Dynamic Register Allocation on AMD's RDNA 4…

Apr 5

Modern GPUs often make a difficult tradeoff between occupancy (active thread count) and register count available to each thread.

Read →

15 Comments

jozsef

Apr 6

Thank you very much!

It's good to see, there is someone who pays appropriate attention to analyze hardwares in depth.

Expand full comment

David. Hellyx

Apr 14

Interesting observation...

"CP is very slow on GFX12 and parsing the packet header is the main bottleneck. Using paired context regs reduce the number of packet headers and it should be more optimal.

It doesn't seem worth when only one context reg is emitted (one packet header and same number of DWORDS) or when consecutive context regs are emitted (would increase the number of DWORDS)."

https://www.phoronix.com/news/AMD-RDNA4-Paired-Context-Regs

Expand full comment

Dante Fr.

Apr 7

They implemented all these optimizations that brought the architecture close to Nvidia in most aspects—will UDNA discard all that, or is it just a minor upgrade, besides the addition of matrix cores?

Expand full comment

Nathan Gabriel

Apr 7

Anyone have a recommendation for explanations of writing code for the GPU that understands the hardware at this level of precision? I don't have to do anything super complex but I do want to be able to take advantage of the GPU's bandwidth and I don't mind working out the function at a register by register level

Expand full comment

Reply (1)

Chester Lam

Apr 7

Usually GPUs aren't coded in assembly because GPU ISAs are vastly different across manufacturers and even different between GPU generations.

People usually go in via a high level compute API like OpenCL, CUDA, etc and don't touch assembly or intrinsics unless performance is extremely critical and the code will only run on known hardware.

Expand full comment

Fredrik Tolf

Apr 6

I must admit the whole thing sounds unnecessarily complicated to me. Wouldn't it be much simpler and easier to just have the sections with different VGPR usage run as separate kernels, with software-managed queues in between? What does this hardware-based approach offer over that?

Expand full comment

Reply (1)

Chester Lam

Apr 6

Avoids kernel launch latency and having to spill/reload register contents between kernel launches

Expand full comment

Ken Esler

Apr 6

Great article and quite interesting. One minor technical point: you mention that NVIDIA's register file size is 64 kB, but the specs note that it is 64k 32-bit registers, i.e. 256 kB.

"The register file size is 64K 32-bit registers per SM." Did I misunderstand your accounting?

Expand full comment

Reply (2)

Chester Lam

Apr 6

Per register file instance, so per processing block or SMSP within a SM. A SM has 4x 64 KB register files

Expand full comment

Reply (1)

Ken Esler

Apr 6

I suspected I misunderstood -- thanks for the clarification! I was a surprised at the disparity in the register file sizes between AMD and NVIDIA. I suppose NVIDIA's architecture relies more on the relatively large L1 cache for register spills, allowing the compiler to allocate fewer registers to maintain occupancy while maintaining reasonable performance. In contrast, as I understand it, RDNA 4's L1 is read-only, so any spilling would have to be serviced by L2 at higher cost. I'm curious which strategy is more effective.

Expand full comment

jozsef

Apr 6

One SM contains 4 subpartion, so one subpartion contains 256/4 = 64 kb of register file. In this article the register file size per simd was mentioned not per CU or SM.

Expand full comment

Reply (1)

jozsef

Apr 6

More precisely up to blackwell nvidia had 2 16 lane wide simd per sm partion, and each of this used the same 64 kb register. Now if i know correctly blackwell has 1 32 lane wide fp32 simd per subpartion.

Expand full comment