I wonder if the RDNA3.5+ architecture, which will be used in the next batch of APUs, will focus solely on this kind of optimization and support for FP8 (FSR4). This is because RDNA4's RT cores take up significant space, making the design overly bulky without providing enough ray tracing performance to justify it(?) Just guessing.
Thank you very much for the excellent article! Finally, someone understood my questions. I was just as confused after seeing AMD's slides and found the cross-wave load dependency strange in prior generation. I felt that no one could correctly understand this slides expect me, and everyone thinked it's a revolutionary feature. Thank you for showing the truth. I hope that an equally great article will be witten about rdna 4 dynamic vector register allocation especially deadlocks in focus. Thank you!!!
AFAIK x86 does enforce such load ordering (otherwise they wouldn't get Acquire/Release atomics ordering for free). Perhaps AMD previously reused a load unit design from their CPU division to cut costs / save time and has only taken the time to design a proper weakly ordered load unit more recently ?
This might be a naive question, but how is memory access handled by current or last gen Nvidia and Intel GPUs? Are they using out-of-order, in-order or "something else"? (If the latter, what?)
I did see the sentence on Nvidia's OoO handling in Turing but don't know if it's unchanged in Ada and Blackwell; hence my question. Back when Turing launched , Nvidia argued that their memory handling and management was better than AMD's. That argument was used to justify why they didn't need to equip their mid-range cards with more than 6 GB VRAM.
The newest NV GPU I have is Turing, though I can't imagine why they would go backward with subsequent generations. I don't think it has anything to do with VRAM capacity. That level of memory management would happen on a driver level, like what the driver decides to keep in VRAM vs fetch from host memory
I did see the sentence on Nvidia's OoO handling in Turing, but don't know if it's unchanged in Ada and Blackwell; hence my question. I remember Nvidia arguing back in Turing days that their memory handling and management was better than AMD's. That argument was used to justify why they didn't need to equip their mid-range cards with more than 6 GB VRAM.
Could this have been an artifact of going with a chiplet design for RDNA 3? I would imagine that keeping multiple chips in sync would introduce more challenges.
It would be interesting to compare the features of RDNA 4 and the Blackwell-based GeForce RTX 50, including Shader Execution Reordering, Neural Rendering, Neural Shaders, Mega Geometry, etc. Are these features actually used? Does RDNA 4 have an AMP processor? How does it handle multiple AI models?
Neural Shaders: running a small neural network on shaders (not tensor cores) is possible on Blackwell, I was wondering if this will work on RDNA4. This is not just software. The general idea is to use a small neural network, stored on the GPU, to approximate something that would otherwise be very expensive to do, shader-wise or data-wise. RTX Neural Shaders bring AI to programmable shaders.
At the end of the day, this is software, and even though it relies on specific hardware features to work, there is almost always flexibility in how to achieve similar results. RDNA4 is comparable or superior to Blackwell in raw ML performance.
Running even small neural networks requires storing intermediate results. The size and management of the register file within GPU shaders could become a bottleneck.
The overhead of this computation might outweigh the benefits in certain scenarios, leading to lower overall performance compared to traditional shader techniques.
What I see here is that since vega AMD could have a much more competent and efficient architecture using this type of implementation.
I was also struck that AMD emphasized just how novel OoO memory access is with RDNA4. Which is why I also appreciated that @Chester did this analysis.
And, I had the same question: why didn't AMD use it before, and, conversely, what benefits did in-order have?
I'm no expert in this area, but in-order designs are often used for DSPs, where that makes sense.
I wonder if the RDNA3.5+ architecture, which will be used in the next batch of APUs, will focus solely on this kind of optimization and support for FP8 (FSR4). This is because RDNA4's RT cores take up significant space, making the design overly bulky without providing enough ray tracing performance to justify it(?) Just guessing.
Thank you very much for the excellent article! Finally, someone understood my questions. I was just as confused after seeing AMD's slides and found the cross-wave load dependency strange in prior generation. I felt that no one could correctly understand this slides expect me, and everyone thinked it's a revolutionary feature. Thank you for showing the truth. I hope that an equally great article will be witten about rdna 4 dynamic vector register allocation especially deadlocks in focus. Thank you!!!
AFAIK x86 does enforce such load ordering (otherwise they wouldn't get Acquire/Release atomics ordering for free). Perhaps AMD previously reused a load unit design from their CPU division to cut costs / save time and has only taken the time to design a proper weakly ordered load unit more recently ?
No, on CPUs including x86 ones there are no cross-thread load ordering guarantees
Ah yes sorry I misread, this is across waves
Are these shaders on GitHub please? I want to run on the cards I have access to and see what results come of it.
They are but not really polished to a standard suitable for public consumption so I won't be linking it
This might be a naive question, but how is memory access handled by current or last gen Nvidia and Intel GPUs? Are they using out-of-order, in-order or "something else"? (If the latter, what?)
Thanks !
I did see the sentence on Nvidia's OoO handling in Turing but don't know if it's unchanged in Ada and Blackwell; hence my question. Back when Turing launched , Nvidia argued that their memory handling and management was better than AMD's. That argument was used to justify why they didn't need to equip their mid-range cards with more than 6 GB VRAM.
The newest NV GPU I have is Turing, though I can't imagine why they would go backward with subsequent generations. I don't think it has anything to do with VRAM capacity. That level of memory management would happen on a driver level, like what the driver decides to keep in VRAM vs fetch from host memory
I did see the sentence on Nvidia's OoO handling in Turing, but don't know if it's unchanged in Ada and Blackwell; hence my question. I remember Nvidia arguing back in Turing days that their memory handling and management was better than AMD's. That argument was used to justify why they didn't need to equip their mid-range cards with more than 6 GB VRAM.
Could this have been an artifact of going with a chiplet design for RDNA 3? I would imagine that keeping multiple chips in sync would introduce more challenges.
No, this applies to waves within the same WGP
It would be interesting to compare the features of RDNA 4 and the Blackwell-based GeForce RTX 50, including Shader Execution Reordering, Neural Rendering, Neural Shaders, Mega Geometry, etc. Are these features actually used? Does RDNA 4 have an AMP processor? How does it handle multiple AI models?
This is all just software and marketing.
Neural Shaders: running a small neural network on shaders (not tensor cores) is possible on Blackwell, I was wondering if this will work on RDNA4. This is not just software. The general idea is to use a small neural network, stored on the GPU, to approximate something that would otherwise be very expensive to do, shader-wise or data-wise. RTX Neural Shaders bring AI to programmable shaders.
At the end of the day, this is software, and even though it relies on specific hardware features to work, there is almost always flexibility in how to achieve similar results. RDNA4 is comparable or superior to Blackwell in raw ML performance.
Running even small neural networks requires storing intermediate results. The size and management of the register file within GPU shaders could become a bottleneck.
The overhead of this computation might outweigh the benefits in certain scenarios, leading to lower overall performance compared to traditional shader techniques.