That was an awesome post. Something that really piqued my interest is the amount of external memory bandwidth AMD GPUs require. I wonder how would adreno and their tile based aquitecture would behave in such scenario. As in how much external bandwidth compares between the two. Keep up the good work 🙏
This is exactly the kind of technical deep dive I love to see. The fact that you were able to programatically access Infinity Fabric performance counters on Strix Halo is incredible for understanding how well the cache actually works in real world scenarios. Your finding that the 32MB cache captures about 73% of traffic in Time Spy Extreme is impresive, and it really shows why AMD went this route versus just scaling DRAM bandwith. The resolution scaling results are particularly interesting, shows that at reasonable gaming resolutions the cache is doing its job perfectly.
This is absolutely brillant work. The methodology of comparing CS-side traffic vs UMC-side traffic to infer cache hitrate is elegant, and the granularity you achieved with your custom performance monitoring tool is impressive despite the 1-second sampling limitation. Your observation that hitrate variation across and within workloads makes single-number summaries meaningless is crucial - this is something hardware reviewers often miss when they try to boil everything down to one metric. The resolution-dependent behavior you documented aligns with AMD's Hot Chips data, and your data showing maximum CS bandwidth demands in 3DMark Time Spy Extreme exceeding 335 GB/s really underscores why Infinity Cache exists. The fact that Strix Halo achieves PS5-level gaming performance with 256 GB/s LPDDR5X + 32 MB cache vs PS5's 448 GB/s GDDR6 without cache is a testament to the cache's efficacy. Outstanding piece!
Outstanding analysis Chester! The methology you used to capture CS vs UMC bandwidth is really clever - using the differential to infer cache hitrate is ingenious given AMD's limited tooling. The fact that Strix Halo stays well under DRAM bandwidth limits even at high resolutions shows the 32MB Infinity Cache was well-sized for this design. Would love to see similar analysis on MI300 series if you ever get access!
Excellent deep dive into Strix Halo's memory subsystem! The decision to pair a 256-bit LPDDR5 interface with 32MB of Infinity Cache shows AMD's understanding of the bandwidth-latency tradeoff. What I find particularly intresting is how this approach allows them to compete with wider memory interfaces at lower power consumption. The cache hit rate data really demonstrates the effectiveness of the Infinity Cache for GPU workloads. It's a smart way to deliver performance without the cost and power penalty of GDDR memory.
Thanks Chester, also nice to see that you're also using YT now. However, please continue to post your findings and thoughts here and on the website - Thanks!
Two comments, one question about this analysis of Strix Halo's Infinity Cache and Memory system:
1. Due to Strix Halo being an APU, going with, let's say GDDR6 or GDDR7 instead of LPDDR5 would probably have exposed the CPU cores to significantly longer latencies, something that GDDR is known for. In a console like the PS5, using GDDR as RAM also for the CPU isn't that much of a downside. In more general compute situations, latency becomes more of an issue. Which is probably why Apple also stuck with DRAM (LPDDR) for their "APUs", including the Max versions of their M SoCs with a large number of GPU cores.
2. Here my question: do you know or have an estimate for how great the cost in both die area/transistors and power draw is that the large memory controller in Strix Halo requires compared with the more standard dual channel design? Any information is appreciated!
Especially in the consumer space (laptops and desktops), the standard reply to any request for more than two channels and/or broader buses has been for decades "that would just increase costs because of the increased die area required, increase power draw even when idle, and wouldn't make a difference for office stuff or gaming anyway".
I hope that this mindset will change now that we have beefier iGPUs that really benefit from higher throughput. Besides, with AI now everywhere, higher data rates are getting more important even in consumer SoCs.
Idle power draw seems fine because the memory bus clocks down. From the MemClk event, the UMCs appear to idle at 200-300 MHz, and only go up to ~1 GHz under load. RAPL (however accurate that is) puts idle power at ~2W. So the cost would likely be more on the die area rather than idle power side. And yea it would only help for big iGPUs, not much for CPU workloads
Thanks for answering, Chester! This dispells the notion that an SoC having more than two memory channels automatically results in a higher power draw. I hope that we'll see more consumer-class SoCs/APUs that feature multi- channel (>2) memory controllers.
That was an awesome post. Something that really piqued my interest is the amount of external memory bandwidth AMD GPUs require. I wonder how would adreno and their tile based aquitecture would behave in such scenario. As in how much external bandwidth compares between the two. Keep up the good work 🙏
Great article!
Could someone benchmark the Apple M5’s cache setup? I find the reason behind its unbelievable single-core speed very interesting. Thanks!
This is exactly the kind of technical deep dive I love to see. The fact that you were able to programatically access Infinity Fabric performance counters on Strix Halo is incredible for understanding how well the cache actually works in real world scenarios. Your finding that the 32MB cache captures about 73% of traffic in Time Spy Extreme is impresive, and it really shows why AMD went this route versus just scaling DRAM bandwith. The resolution scaling results are particularly interesting, shows that at reasonable gaming resolutions the cache is doing its job perfectly.
This is absolutely brillant work. The methodology of comparing CS-side traffic vs UMC-side traffic to infer cache hitrate is elegant, and the granularity you achieved with your custom performance monitoring tool is impressive despite the 1-second sampling limitation. Your observation that hitrate variation across and within workloads makes single-number summaries meaningless is crucial - this is something hardware reviewers often miss when they try to boil everything down to one metric. The resolution-dependent behavior you documented aligns with AMD's Hot Chips data, and your data showing maximum CS bandwidth demands in 3DMark Time Spy Extreme exceeding 335 GB/s really underscores why Infinity Cache exists. The fact that Strix Halo achieves PS5-level gaming performance with 256 GB/s LPDDR5X + 32 MB cache vs PS5's 448 GB/s GDDR6 without cache is a testament to the cache's efficacy. Outstanding piece!
Outstanding analysis Chester! The methology you used to capture CS vs UMC bandwidth is really clever - using the differential to infer cache hitrate is ingenious given AMD's limited tooling. The fact that Strix Halo stays well under DRAM bandwidth limits even at high resolutions shows the 32MB Infinity Cache was well-sized for this design. Would love to see similar analysis on MI300 series if you ever get access!
Excellent deep dive into Strix Halo's memory subsystem! The decision to pair a 256-bit LPDDR5 interface with 32MB of Infinity Cache shows AMD's understanding of the bandwidth-latency tradeoff. What I find particularly intresting is how this approach allows them to compete with wider memory interfaces at lower power consumption. The cache hit rate data really demonstrates the effectiveness of the Infinity Cache for GPU workloads. It's a smart way to deliver performance without the cost and power penalty of GDDR memory.
Thanks Chester, also nice to see that you're also using YT now. However, please continue to post your findings and thoughts here and on the website - Thanks!
Two comments, one question about this analysis of Strix Halo's Infinity Cache and Memory system:
1. Due to Strix Halo being an APU, going with, let's say GDDR6 or GDDR7 instead of LPDDR5 would probably have exposed the CPU cores to significantly longer latencies, something that GDDR is known for. In a console like the PS5, using GDDR as RAM also for the CPU isn't that much of a downside. In more general compute situations, latency becomes more of an issue. Which is probably why Apple also stuck with DRAM (LPDDR) for their "APUs", including the Max versions of their M SoCs with a large number of GPU cores.
2. Here my question: do you know or have an estimate for how great the cost in both die area/transistors and power draw is that the large memory controller in Strix Halo requires compared with the more standard dual channel design? Any information is appreciated!
Especially in the consumer space (laptops and desktops), the standard reply to any request for more than two channels and/or broader buses has been for decades "that would just increase costs because of the increased die area required, increase power draw even when idle, and wouldn't make a difference for office stuff or gaming anyway".
I hope that this mindset will change now that we have beefier iGPUs that really benefit from higher throughput. Besides, with AI now everywhere, higher data rates are getting more important even in consumer SoCs.
I'm not sure what the power cost is, but people have taken die shots of Strix Halo (https://misdake.github.io/ChipAnnotationViewer/?map=StrixHalo). The memory controllers are on the left and right sides of the IO die.
Idle power draw seems fine because the memory bus clocks down. From the MemClk event, the UMCs appear to idle at 200-300 MHz, and only go up to ~1 GHz under load. RAPL (however accurate that is) puts idle power at ~2W. So the cost would likely be more on the die area rather than idle power side. And yea it would only help for big iGPUs, not much for CPU workloads
Thanks for answering, Chester! This dispells the notion that an SoC having more than two memory channels automatically results in a higher power draw. I hope that we'll see more consumer-class SoCs/APUs that feature multi- channel (>2) memory controllers.