12 Comments
User's avatar
Yukimasa Sugizaki's avatar

Thank you for your insightful post, as always!

As indicated in the PTX documentation (https://docs.nvidia.com/cuda/parallel-thread-execution/#integer-arithmetic-instructions-mad ), IMAD and IMAD.WIDE means 32-bit × 32-bit → 32-bit/64-bit integer multiplication, resp.

According to Table 7 in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions , widening multiplication delivers only half the throughput of its non-widening counterpart, so the compiler appears to favor non-widening one for 32-bit address generation for shared space.

Expand full comment
Chester Lam's avatar

I suspect it's also because Shared Memory doesn't need memory addresses larger than 32 bits. 17 bits (128 KB) would be sufficient to address the entire Shared Memory space in Blackwell.

Expand full comment
Peter W.'s avatar

Also, thanks to @Chester for another great deep dive!

Expand full comment
Fredrik Tolf's avatar

It is interesting that the SM execution path is back to one 32-wide FP/INT unit. Does it actually work like it used to on Pascal, or are there significant differences?

Back on Turing, Nvidia introduced the FP/INT split with the reasoning that the increasing INT usage in modern shaders effectively "blocked" the FP units from being fully utilized. Is this no longer considered an issue, or are there other mechanisms to compensate for it?

Was the whole "dual-issue" arc just a detour, and one that's over now?

Expand full comment
Chester Lam's avatar

I personally disagree with Nvidia's characterization of Turing, because what they did was cut the FP32 units to 16-wide. So yes, INT32 execution no longer competes with FP32. But that's because you can no longer issue FP32 instructions back-to-back (FP is always "blocked" for a cycle after you issue a FP instruction), and are trying to fill those stall cycles with INT32 instructions. I don't consider that dual issue because it's never issuing two instructions in the same cycle.

From the POV of trying to finish each wave quickly, Turing looks like a regression from Pascal because you'd need a ~1:1 FP32/INT32 mix to minimize stalls when compute bound. Pascal can fully utilize its vector units with any ratio of FP32/INT32. Of course Turing makes progress in other areas, so it's a complicated picture.

Expand full comment
Fredrik Tolf's avatar

Certainly I don't disagree, but you can kind of pretend it's like dual-issue on an imaginary half-as-fast-clocked SM. :)

Nevertheless, am I correct in interpreting the text such that Blackwell has just left those 16-wide split execution units behind, and basically gone back to what Pascal did?

Expand full comment
Chester Lam's avatar

I would be more amenable to that interpretation if Turing clocked significantly higher than Pascal, which it did not.

That's what it sounds like from Nvidia's whitepaper, yes.

Expand full comment
Fredrik Tolf's avatar

The article states 2.9 GHz observed for the 5090, whereas the official spec states 2.4 GHz. This matches my experience as well, where eg. the 1060's official boost clock is 1.7 GHz, but in practice often hits 1.9 GHz if not above.

What's the deal with nVidia cards often clocking well above their advertised boost clocks? You'd normally think they'd advertise that.

Expand full comment
Chester Lam's avatar

I think it's because people would complain if the boost clock was too hard to reach. Kepler and Maxwell would often get pretty close to their boost clock with a decent cooler. Pascal, not really.

Also that 5090 figure is from Techpowerup, linked in the references above. Originally I put the link in the table, but Substack can't do tables and it has to be an image here. I don't have access to a 5090 myself, so I'm trusting their figures.

Expand full comment
David. Hellyx's avatar

Funny how the XTX has an FP64 equivalent to this massive and expensive GPU.

Expand full comment
Peter W.'s avatar

There has been a general trend to scale back support for FP64, and having the biggest Blackwell GPU barely matching the XTX seems to be part of this. Apparently, double precision compute isn't considered as important anymore. My impression of the GB203 design is that it's mainly an AI accelerator that also does some nice graphics on the side.

Expand full comment
Peter W.'s avatar

Mistyped, of course I meant GB202.

Expand full comment