Anyone have a recommendation for explanations of writing code for the GPU that understands the hardware at this level of precision? I don't have to do anything super complex but I do want to be able to take advantage of the GPU's bandwidth and I don't mind working out the function at a register by register level
Usually GPUs aren't coded in assembly because GPU ISAs are vastly different across manufacturers and even different between GPU generations.
People usually go in via a high level compute API like OpenCL, CUDA, etc and don't touch assembly or intrinsics unless performance is extremely critical and the code will only run on known hardware.
I must admit the whole thing sounds unnecessarily complicated to me. Wouldn't it be much simpler and easier to just have the sections with different VGPR usage run as separate kernels, with software-managed queues in between? What does this hardware-based approach offer over that?
Great article and quite interesting. One minor technical point: you mention that NVIDIA's register file size is 64 kB, but the specs note that it is 64k 32-bit registers, i.e. 256 kB.
"The register file size is 64K 32-bit registers per SM." Did I misunderstand your accounting?
I suspected I misunderstood -- thanks for the clarification! I was a surprised at the disparity in the register file sizes between AMD and NVIDIA. I suppose NVIDIA's architecture relies more on the relatively large L1 cache for register spills, allowing the compiler to allocate fewer registers to maintain occupancy while maintaining reasonable performance. In contrast, as I understand it, RDNA 4's L1 is read-only, so any spilling would have to be serviced by L2 at higher cost. I'm curious which strategy is more effective.
One SM contains 4 subpartion, so one subpartion contains 256/4 = 64 kb of register file. In this article the register file size per simd was mentioned not per CU or SM.
More precisely up to blackwell nvidia had 2 16 lane wide simd per sm partion, and each of this used the same 64 kb register. Now if i know correctly blackwell has 1 32 lane wide fp32 simd per subpartion.
Thank you very much!
It's good to see, there is someone who pays appropriate attention to analyze hardwares in depth.
Anyone have a recommendation for explanations of writing code for the GPU that understands the hardware at this level of precision? I don't have to do anything super complex but I do want to be able to take advantage of the GPU's bandwidth and I don't mind working out the function at a register by register level
Usually GPUs aren't coded in assembly because GPU ISAs are vastly different across manufacturers and even different between GPU generations.
People usually go in via a high level compute API like OpenCL, CUDA, etc and don't touch assembly or intrinsics unless performance is extremely critical and the code will only run on known hardware.
I must admit the whole thing sounds unnecessarily complicated to me. Wouldn't it be much simpler and easier to just have the sections with different VGPR usage run as separate kernels, with software-managed queues in between? What does this hardware-based approach offer over that?
Avoids kernel launch latency and having to spill/reload register contents between kernel launches
Great article and quite interesting. One minor technical point: you mention that NVIDIA's register file size is 64 kB, but the specs note that it is 64k 32-bit registers, i.e. 256 kB.
"The register file size is 64K 32-bit registers per SM." Did I misunderstand your accounting?
Per register file instance, so per processing block or SMSP within a SM. A SM has 4x 64 KB register files
I suspected I misunderstood -- thanks for the clarification! I was a surprised at the disparity in the register file sizes between AMD and NVIDIA. I suppose NVIDIA's architecture relies more on the relatively large L1 cache for register spills, allowing the compiler to allocate fewer registers to maintain occupancy while maintaining reasonable performance. In contrast, as I understand it, RDNA 4's L1 is read-only, so any spilling would have to be serviced by L2 at higher cost. I'm curious which strategy is more effective.
One SM contains 4 subpartion, so one subpartion contains 256/4 = 64 kb of register file. In this article the register file size per simd was mentioned not per CU or SM.
More precisely up to blackwell nvidia had 2 16 lane wide simd per sm partion, and each of this used the same 64 kb register. Now if i know correctly blackwell has 1 32 lane wide fp32 simd per subpartion.
Thank you for making me a little smarter.
And also a question: Is there a cache tag like memory to track avaliable register blocks?
Thanks for answears!