AMD acquired ATI in 2006, hoping ATI's GPU expertise would combine with AMD's CPU know-how to create integrated solutions worth more than the sum of their parts.
Cross-node bandwidth pattern looks like your memory allocation is all in just one channel of one HBM per socket. Are there no options to stripe memory across all channels?
Yep, GPUs have supported virtual memory for a long time. AMD's GCN had it, and even later Terascale versions (like Terascale 2 in Llano) had virtual memory support.
Hmm. I forgot graphics GPUs were integrated, of course. I was thinking more of the AI GPUs. I would expect Nvidia GPUs like H100 have no virtual paging on their HBM but they could use it through PCIe/ATS to work with the host memory.
With the MI300A everything is HBM, and I would have expected the XCDs to have a stripped-down direct map to the HBM. But maybe AMD made the XCD virtual?
My impression of AI accelerators like the Instinct MI300 has always been that the CPU cores are mostly there to feed the GPU/AI cores, i.e. they're more like traffic cops with a little bit of orchestra conductor thrown in. Thus, the much higher latencies for data going to and from the CPU 8 core clusters aren't that injurious to overall performance, as long as it doesn't affect the ability of the CPU to keep the GPU cores staying busy. I guess that's another reason why an update of the CPU chiplets to Zen5 isn't such a high priority; from what I read, the CPU chiplets are not what bottlenecks the MI300's maximum throughput.
Regarding the statement about desktop PC users being reluctant to accept the trade-offs of APUs with large iGPUs: not sure that's entirely so. Apple M desktops, such as the Mac Studio and the higher-end M4 MacMini variants have seen significant uptake. However, on (purchase) price-performance, the classic desktop PC setup is still ahead, at least for now.
Mac Studio isn't such a great example because you have no alternative if you want a reasonably powerful GPU within the Apple ecosystem.
I wrote this piece last year before Strix Halo launched, but it looks like we'll find out whether consumers will be interested in a large iGPU, when good dGPU options are also available.
You're right! I am one of those people who doesn't like MacOS, and thus never became captive audience for the fruity company. Despite umpteen times of hearing "get a Mac"; or maybe because of it 😁.
I thought I'd test the M4 MacMini using some computational codes. I discovered Apple removed OpenMP support from the version of clang in Xcode and the operating system doesn't support setting processor affinity.
Even so, it would be nice to see a performance analysis of the M4 on this website.
Do you think that there will be a MI355A?
Cross-node bandwidth pattern looks like your memory allocation is all in just one channel of one HBM per socket. Are there no options to stripe memory across all channels?
You mention sharing pointers. Does that mean the GPUs have paged virtual memory with address translation like a CPU?
Yep, GPUs have supported virtual memory for a long time. AMD's GCN had it, and even later Terascale versions (like Terascale 2 in Llano) had virtual memory support.
Hmm. I forgot graphics GPUs were integrated, of course. I was thinking more of the AI GPUs. I would expect Nvidia GPUs like H100 have no virtual paging on their HBM but they could use it through PCIe/ATS to work with the host memory.
With the MI300A everything is HBM, and I would have expected the XCDs to have a stripped-down direct map to the HBM. But maybe AMD made the XCD virtual?
Discrete GPUs have virtual memory support too, so GCN had page tables (though used virtually addressed caches so no TLB lookups on cache hit).
Applies to Nvidia as well, at least since Pascal (https://nvidia.github.io/open-gpu-doc/pascal/gp100-mmu-format.pdf) but NV likely had virtual memory support well before that
Very interesting, thank you. Yes there was paging before Pascal, I found documentation for the changes when Pascal was done.
https://nvidia.github.io/open-gpu-doc/pascal/
The HBM paging seems to be 2MB large pages and the MMU generally is flexible to address the host.
https://arxiv.org/html/2408.11556v2
My impression of AI accelerators like the Instinct MI300 has always been that the CPU cores are mostly there to feed the GPU/AI cores, i.e. they're more like traffic cops with a little bit of orchestra conductor thrown in. Thus, the much higher latencies for data going to and from the CPU 8 core clusters aren't that injurious to overall performance, as long as it doesn't affect the ability of the CPU to keep the GPU cores staying busy. I guess that's another reason why an update of the CPU chiplets to Zen5 isn't such a high priority; from what I read, the CPU chiplets are not what bottlenecks the MI300's maximum throughput.
Regarding the statement about desktop PC users being reluctant to accept the trade-offs of APUs with large iGPUs: not sure that's entirely so. Apple M desktops, such as the Mac Studio and the higher-end M4 MacMini variants have seen significant uptake. However, on (purchase) price-performance, the classic desktop PC setup is still ahead, at least for now.
Mac Studio isn't such a great example because you have no alternative if you want a reasonably powerful GPU within the Apple ecosystem.
I wrote this piece last year before Strix Halo launched, but it looks like we'll find out whether consumers will be interested in a large iGPU, when good dGPU options are also available.
You're right! I am one of those people who doesn't like MacOS, and thus never became captive audience for the fruity company. Despite umpteen times of hearing "get a Mac"; or maybe because of it 😁.
I thought I'd test the M4 MacMini using some computational codes. I discovered Apple removed OpenMP support from the version of clang in Xcode and the operating system doesn't support setting processor affinity.
Even so, it would be nice to see a performance analysis of the M4 on this website.
And, I omitted this: Thanks Chester for another great article!
Now if Strix Halo supported that type of bandwidth... :D Okay, very different market segment and programming models