Would the out-of-order memory access capability that AMD added to RDNA4 have any applications in a CDNA4-based design? I know UDNA is rumored to represent a fusion of CDNA and RDNA, implying that some design ideas will be kept from each. I know RDNA4's OOO abilities are unique, but I don't know if there would even be a theoretical benefit to adopting a similar approach for AI or HPC workloads.
Depends on whether AMD expects compute workloads to frequently have different waves going down different code paths with varying cache hit/miss behavior. I suspect a lot of compute workloads, especially ML ones, will be quite regular and won't see too much benefit from the "OOO" memory accesses.
Also RDNA4's OOO memory accesses are not unique (https://www.bing.com/search?pc=MOZI&form=MOZLBR&q=chipsandcheese+rdna4). It's resolving a false dependency between different waves when waiting for data to arrive from the memory subsystem. Nvidia and Intel don't have this false dependency even on older GPUs, so they always had "OOO" memory accesses.
I stand corrected. After re-reviewing this article, I think the only explanation is that I paused reading the RDNA4 article partway through and then forgot I had done so. You clearly lay out that Nvidia and Intel GPUs are not affected by the same false dependency issue.
I don't understand the stochastic rounding note. For HPC I think that stochastic rounding could be very helpful in making smaller data types usable, but that would make sense for math operations or conversion to lower precision. Can anyone explain what stochastic rounding means when increasing precision?
Would the out-of-order memory access capability that AMD added to RDNA4 have any applications in a CDNA4-based design? I know UDNA is rumored to represent a fusion of CDNA and RDNA, implying that some design ideas will be kept from each. I know RDNA4's OOO abilities are unique, but I don't know if there would even be a theoretical benefit to adopting a similar approach for AI or HPC workloads.
Depends on whether AMD expects compute workloads to frequently have different waves going down different code paths with varying cache hit/miss behavior. I suspect a lot of compute workloads, especially ML ones, will be quite regular and won't see too much benefit from the "OOO" memory accesses.
Also RDNA4's OOO memory accesses are not unique (https://www.bing.com/search?pc=MOZI&form=MOZLBR&q=chipsandcheese+rdna4). It's resolving a false dependency between different waves when waiting for data to arrive from the memory subsystem. Nvidia and Intel don't have this false dependency even on older GPUs, so they always had "OOO" memory accesses.
Thank you for replying, Chester.
I stand corrected. After re-reviewing this article, I think the only explanation is that I paused reading the RDNA4 article partway through and then forgot I had done so. You clearly lay out that Nvidia and Intel GPUs are not affected by the same false dependency issue.
Apologies.
Miss you at ExtremeTech, Joel :-)
It was a great place, with great people.
I wonder if the FP6 advantage will be maintained against too, rubin.
I don't understand the stochastic rounding note. For HPC I think that stochastic rounding could be very helpful in making smaller data types usable, but that would make sense for math operations or conversion to lower precision. Can anyone explain what stochastic rounding means when increasing precision?
The core (CU) count for MI300X is incorrect, it should be 304 instead of 288 in the first table
Yep, my mistake when moving the article from the Google Docs to the Substack and Wordpress! Has been fixed!