19 Comments

What I see here is that since vega AMD could have a much more competent and efficient architecture using this type of implementation.

Expand full comment

I was also struck that AMD emphasized just how novel OoO memory access is with RDNA4. Which is why I also appreciated that @Chester did this analysis.

And, I had the same question: why didn't AMD use it before, and, conversely, what benefits did in-order have?

I'm no expert in this area, but in-order designs are often used for DSPs, where that makes sense.

Expand full comment

I wonder if the RDNA3.5+ architecture, which will be used in the next batch of APUs, will focus solely on this kind of optimization and support for FP8 (FSR4). This is because RDNA4's RT cores take up significant space, making the design overly bulky without providing enough ray tracing performance to justify it(?) Just guessing.

Expand full comment

Thank you very much for the excellent article! Finally, someone understood my questions. I was just as confused after seeing AMD's slides and found the cross-wave load dependency strange in prior generation. I felt that no one could correctly understand this slides expect me, and everyone thinked it's a revolutionary feature. Thank you for showing the truth. I hope that an equally great article will be witten about rdna 4 dynamic vector register allocation especially deadlocks in focus. Thank you!!!

Expand full comment

AFAIK x86 does enforce such load ordering (otherwise they wouldn't get Acquire/Release atomics ordering for free). Perhaps AMD previously reused a load unit design from their CPU division to cut costs / save time and has only taken the time to design a proper weakly ordered load unit more recently ?

Expand full comment

No, on CPUs including x86 ones there are no cross-thread load ordering guarantees

Expand full comment

Ah yes sorry I misread, this is across waves

Expand full comment

Are these shaders on GitHub please? I want to run on the cards I have access to and see what results come of it.

Expand full comment

They are but not really polished to a standard suitable for public consumption so I won't be linking it

Expand full comment

This might be a naive question, but how is memory access handled by current or last gen Nvidia and Intel GPUs? Are they using out-of-order, in-order or "something else"? (If the latter, what?)

Thanks !

Expand full comment

I did see the sentence on Nvidia's OoO handling in Turing but don't know if it's unchanged in Ada and Blackwell; hence my question. Back when Turing launched , Nvidia argued that their memory handling and management was better than AMD's. That argument was used to justify why they didn't need to equip their mid-range cards with more than 6 GB VRAM.

Expand full comment

The newest NV GPU I have is Turing, though I can't imagine why they would go backward with subsequent generations. I don't think it has anything to do with VRAM capacity. That level of memory management would happen on a driver level, like what the driver decides to keep in VRAM vs fetch from host memory

Expand full comment

I did see the sentence on Nvidia's OoO handling in Turing, but don't know if it's unchanged in Ada and Blackwell; hence my question. I remember Nvidia arguing back in Turing days that their memory handling and management was better than AMD's. That argument was used to justify why they didn't need to equip their mid-range cards with more than 6 GB VRAM.

Expand full comment

Could this have been an artifact of going with a chiplet design for RDNA 3? I would imagine that keeping multiple chips in sync would introduce more challenges.

Expand full comment

No, this applies to waves within the same WGP

Expand full comment
4dEdited

It would be interesting to compare the features of RDNA 4 and the Blackwell-based GeForce RTX 50, including Shader Execution Reordering, Neural Rendering, Neural Shaders, Mega Geometry, etc. Are these features actually used? Does RDNA 4 have an AMP processor? How does it handle multiple AI models?

Expand full comment

This is all just software and marketing.

Expand full comment
1dEdited

Neural Shaders: running a small neural network on shaders (not tensor cores) is possible on Blackwell, I was wondering if this will work on RDNA4. This is not just software. The general idea is to use a small neural network, stored on the GPU, to approximate something that would otherwise be very expensive to do, shader-wise or data-wise. RTX Neural Shaders bring AI to programmable shaders.

Expand full comment

At the end of the day, this is software, and even though it relies on specific hardware features to work, there is almost always flexibility in how to achieve similar results. RDNA4 is comparable or superior to Blackwell in raw ML performance.

Running even small neural networks requires storing intermediate results. The size and management of the register file within GPU shaders could become a bottleneck.

The overhead of this computation might outweigh the benefits in certain scenarios, leading to lower overall performance compared to traditional shader techniques.

Expand full comment