15 Comments
User's avatar
jozsef's avatar

Thank you very much!

It's good to see, there is someone who pays appropriate attention to analyze hardwares in depth.

David. Hellyx's avatar

Interesting observation...

"CP is very slow on GFX12 and parsing the packet header is the main bottleneck. Using paired context regs reduce the number of packet headers and it should be more optimal.

It doesn't seem worth when only one context reg is emitted (one packet header and same number of DWORDS) or when consecutive context regs are emitted (would increase the number of DWORDS)."

https://www.phoronix.com/news/AMD-RDNA4-Paired-Context-Regs

Dante Fr.'s avatar

They implemented all these optimizations that brought the architecture close to Nvidia in most aspects—will UDNA discard all that, or is it just a minor upgrade, besides the addition of matrix cores?

Nathan Gabriel's avatar

Anyone have a recommendation for explanations of writing code for the GPU that understands the hardware at this level of precision? I don't have to do anything super complex but I do want to be able to take advantage of the GPU's bandwidth and I don't mind working out the function at a register by register level

Chester Lam's avatar

Usually GPUs aren't coded in assembly because GPU ISAs are vastly different across manufacturers and even different between GPU generations.

People usually go in via a high level compute API like OpenCL, CUDA, etc and don't touch assembly or intrinsics unless performance is extremely critical and the code will only run on known hardware.

Fredrik Tolf's avatar

I must admit the whole thing sounds unnecessarily complicated to me. Wouldn't it be much simpler and easier to just have the sections with different VGPR usage run as separate kernels, with software-managed queues in between? What does this hardware-based approach offer over that?

Chester Lam's avatar

Avoids kernel launch latency and having to spill/reload register contents between kernel launches

Ken Esler's avatar

Great article and quite interesting. One minor technical point: you mention that NVIDIA's register file size is 64 kB, but the specs note that it is 64k 32-bit registers, i.e. 256 kB.

"The register file size is 64K 32-bit registers per SM." Did I misunderstand your accounting?

Chester Lam's avatar

Per register file instance, so per processing block or SMSP within a SM. A SM has 4x 64 KB register files

Ken Esler's avatar

I suspected I misunderstood -- thanks for the clarification! I was a surprised at the disparity in the register file sizes between AMD and NVIDIA. I suppose NVIDIA's architecture relies more on the relatively large L1 cache for register spills, allowing the compiler to allocate fewer registers to maintain occupancy while maintaining reasonable performance. In contrast, as I understand it, RDNA 4's L1 is read-only, so any spilling would have to be serviced by L2 at higher cost. I'm curious which strategy is more effective.

jozsef's avatar

One SM contains 4 subpartion, so one subpartion contains 256/4 = 64 kb of register file. In this article the register file size per simd was mentioned not per CU or SM.

jozsef's avatar

More precisely up to blackwell nvidia had 2 16 lane wide simd per sm partion, and each of this used the same 64 kb register. Now if i know correctly blackwell has 1 32 lane wide fp32 simd per subpartion.

Erik Stubblebine's avatar

Thank you for making me a little smarter.

jozsef's avatar

And also a question: Is there a cache tag like memory to track avaliable register blocks?

jozsef's avatar

Thanks for answears!