15 Comments
User's avatar
jozsef's avatar

Thank you very much!

It's good to see, there is someone who pays appropriate attention to analyze hardwares in depth.

Expand full comment
David. Hellyx's avatar

Interesting observation...

"CP is very slow on GFX12 and parsing the packet header is the main bottleneck. Using paired context regs reduce the number of packet headers and it should be more optimal.

It doesn't seem worth when only one context reg is emitted (one packet header and same number of DWORDS) or when consecutive context regs are emitted (would increase the number of DWORDS)."

https://www.phoronix.com/news/AMD-RDNA4-Paired-Context-Regs

Expand full comment
Dante Fr.'s avatar

They implemented all these optimizations that brought the architecture close to Nvidia in most aspects—will UDNA discard all that, or is it just a minor upgrade, besides the addition of matrix cores?

Expand full comment
Nathan Gabriel's avatar

Anyone have a recommendation for explanations of writing code for the GPU that understands the hardware at this level of precision? I don't have to do anything super complex but I do want to be able to take advantage of the GPU's bandwidth and I don't mind working out the function at a register by register level

Expand full comment
Chester Lam's avatar

Usually GPUs aren't coded in assembly because GPU ISAs are vastly different across manufacturers and even different between GPU generations.

People usually go in via a high level compute API like OpenCL, CUDA, etc and don't touch assembly or intrinsics unless performance is extremely critical and the code will only run on known hardware.

Expand full comment
Fredrik Tolf's avatar

I must admit the whole thing sounds unnecessarily complicated to me. Wouldn't it be much simpler and easier to just have the sections with different VGPR usage run as separate kernels, with software-managed queues in between? What does this hardware-based approach offer over that?

Expand full comment
Chester Lam's avatar

Avoids kernel launch latency and having to spill/reload register contents between kernel launches

Expand full comment
Ken Esler's avatar

Great article and quite interesting. One minor technical point: you mention that NVIDIA's register file size is 64 kB, but the specs note that it is 64k 32-bit registers, i.e. 256 kB.

"The register file size is 64K 32-bit registers per SM." Did I misunderstand your accounting?

Expand full comment
Chester Lam's avatar

Per register file instance, so per processing block or SMSP within a SM. A SM has 4x 64 KB register files

Expand full comment
Ken Esler's avatar

I suspected I misunderstood -- thanks for the clarification! I was a surprised at the disparity in the register file sizes between AMD and NVIDIA. I suppose NVIDIA's architecture relies more on the relatively large L1 cache for register spills, allowing the compiler to allocate fewer registers to maintain occupancy while maintaining reasonable performance. In contrast, as I understand it, RDNA 4's L1 is read-only, so any spilling would have to be serviced by L2 at higher cost. I'm curious which strategy is more effective.

Expand full comment
jozsef's avatar

One SM contains 4 subpartion, so one subpartion contains 256/4 = 64 kb of register file. In this article the register file size per simd was mentioned not per CU or SM.

Expand full comment
jozsef's avatar

More precisely up to blackwell nvidia had 2 16 lane wide simd per sm partion, and each of this used the same 64 kb register. Now if i know correctly blackwell has 1 32 lane wide fp32 simd per subpartion.

Expand full comment
Erik Stubblebine's avatar

Thank you for making me a little smarter.

Expand full comment
jozsef's avatar

And also a question: Is there a cache tag like memory to track avaliable register blocks?

Expand full comment
jozsef's avatar

Thanks for answears!

Expand full comment