9 Comments
User's avatar
Joel Hruska's avatar

Would the out-of-order memory access capability that AMD added to RDNA4 have any applications in a CDNA4-based design? I know UDNA is rumored to represent a fusion of CDNA and RDNA, implying that some design ideas will be kept from each. I know RDNA4's OOO abilities are unique, but I don't know if there would even be a theoretical benefit to adopting a similar approach for AI or HPC workloads.

Expand full comment
Chester Lam's avatar

Depends on whether AMD expects compute workloads to frequently have different waves going down different code paths with varying cache hit/miss behavior. I suspect a lot of compute workloads, especially ML ones, will be quite regular and won't see too much benefit from the "OOO" memory accesses.

Also RDNA4's OOO memory accesses are not unique (https://www.bing.com/search?pc=MOZI&form=MOZLBR&q=chipsandcheese+rdna4). It's resolving a false dependency between different waves when waiting for data to arrive from the memory subsystem. Nvidia and Intel don't have this false dependency even on older GPUs, so they always had "OOO" memory accesses.

Expand full comment
Joel Hruska's avatar

Thank you for replying, Chester.

I stand corrected. After re-reviewing this article, I think the only explanation is that I paused reading the RDNA4 article partway through and then forgot I had done so. You clearly lay out that Nvidia and Intel GPUs are not affected by the same false dependency issue.

Apologies.

Expand full comment
Farfle's avatar

Miss you at ExtremeTech, Joel :-)

Expand full comment
Joel Hruska's avatar

It was a great place, with great people.

Expand full comment
tt's avatar
Jun 18Edited

I wonder if the FP6 advantage will be maintained against too, rubin.

Expand full comment
ET3D's avatar

I don't understand the stochastic rounding note. For HPC I think that stochastic rounding could be very helpful in making smaller data types usable, but that would make sense for math operations or conversion to lower precision. Can anyone explain what stochastic rounding means when increasing precision?

Expand full comment
Jakub's avatar

The core (CU) count for MI300X is incorrect, it should be 304 instead of 288 in the first table

Expand full comment
George Cozma's avatar

Yep, my mistake when moving the article from the Google Docs to the Substack and Wordpress! Has been fixed!

Expand full comment