13 Comments

Thank you very much!

It's good to see, there is someone who pays appropriate attention to analyze hardwares in depth.

Expand full comment

Anyone have a recommendation for explanations of writing code for the GPU that understands the hardware at this level of precision? I don't have to do anything super complex but I do want to be able to take advantage of the GPU's bandwidth and I don't mind working out the function at a register by register level

Expand full comment

Usually GPUs aren't coded in assembly because GPU ISAs are vastly different across manufacturers and even different between GPU generations.

People usually go in via a high level compute API like OpenCL, CUDA, etc and don't touch assembly or intrinsics unless performance is extremely critical and the code will only run on known hardware.

Expand full comment

I must admit the whole thing sounds unnecessarily complicated to me. Wouldn't it be much simpler and easier to just have the sections with different VGPR usage run as separate kernels, with software-managed queues in between? What does this hardware-based approach offer over that?

Expand full comment

Avoids kernel launch latency and having to spill/reload register contents between kernel launches

Expand full comment

Great article and quite interesting. One minor technical point: you mention that NVIDIA's register file size is 64 kB, but the specs note that it is 64k 32-bit registers, i.e. 256 kB.

"The register file size is 64K 32-bit registers per SM." Did I misunderstand your accounting?

Expand full comment

Per register file instance, so per processing block or SMSP within a SM. A SM has 4x 64 KB register files

Expand full comment

I suspected I misunderstood -- thanks for the clarification! I was a surprised at the disparity in the register file sizes between AMD and NVIDIA. I suppose NVIDIA's architecture relies more on the relatively large L1 cache for register spills, allowing the compiler to allocate fewer registers to maintain occupancy while maintaining reasonable performance. In contrast, as I understand it, RDNA 4's L1 is read-only, so any spilling would have to be serviced by L2 at higher cost. I'm curious which strategy is more effective.

Expand full comment

One SM contains 4 subpartion, so one subpartion contains 256/4 = 64 kb of register file. In this article the register file size per simd was mentioned not per CU or SM.

Expand full comment

More precisely up to blackwell nvidia had 2 16 lane wide simd per sm partion, and each of this used the same 64 kb register. Now if i know correctly blackwell has 1 32 lane wide fp32 simd per subpartion.

Expand full comment

Thank you for making me a little smarter.

Expand full comment

And also a question: Is there a cache tag like memory to track avaliable register blocks?

Expand full comment

Thanks for answears!

Expand full comment