ML workloads will be really interesting. An ok sized GPU with access to 128GB of RAM could be faster than any other consumer device in cases where those just can't fit the model into memory. Intels B60 showed already that memory alone can make a difference.
Could you double check the rtx 5070 mobile cache bandwith, i think it's not correct data.
In theory blackwell sm can reach 128 byte/clock cycle from L1. Of course it's the upper limit.
But in the article of rtx pro 6000 blackwell one sm can ca. 100 byte/clock cyle.
If we do the math with this number, and suppose a minimal 1.5 ghz operating frequency and the sm count of 5070M which is 36, than the l1 bandwith should be 36*100*1.5 Gbyte/sec which is 5400 Gbyte/sec. If we suppose 2 ghz opperating frequency than we reach 7200 Gbyte/sec and at 2.5 ghz 9000 Gbyte/ sec. So why is this big difference between your measurement (ca. 3000 Gbyte/sec) and the math. Thank you verry much for the answer, and sorry about my english!
As always a big thanks for the article. However, I would have liked to see a bit more emphasis on the new D2D interconnect. Are there any measurable differences regarding the C2C latencies or regarding power usage in idle/load compared to the good old IFoP of their Desktop counterparts?
Exceptional technical deep-dive! Your dissection of the D2D interconnect versus traditional IFoP is incredibly valuabe. The insight that Strix Halo achieves latency parity with monolithic Strix Point despite the chiplet architecture demonstrates AMD's interconnect engineering prowess. I particularly appreciate your bandwidth analysis - 192GB/s from a 4096-bit LPDDR5X interface is transformative for APU workloads. The L3 cache behavior across the CCDs with the unified 32MB structure is fascinating. George, your testing methodology consistently reveals architectural details that other reviewers miss!
I'm a little surprised by the similarity in CPU-to-DRAM latency between Strix Halo and Strix Point. I thought half the point with the new die-to-die interconnect they were using with Strix Halo was to decrease latency by removing the SERDES overhead. Unless I just completely misunderstood that, any idea what might up with that? Does the large GPU/IO die just add back enough latency for some reason by bring it back up to Strix Point parity, or what?
Speculating here, but Strix Point is monolithic while Strix Halo is chiplet based, so that upside evaporates in light of the chiplet downside and that’s why latency is largely the same.
Yeah, sure, but the desktop CPUs also have the latency advantage of not using LPDDR memory, so it's hard to compare more directly.
I don't think AMD has a traditional chiplet-based CPU that uses LPDDR memory, nor a Strix Halo design that uses non-LP DDR memory to have a direct comparison between, right?
For the first only Strix Halo for the second, I’m pretty sure Strix Halo also works with DDR5 because other mobile APUs also had this ability to use both types of Ram, it’s just not used because they want maximum bandwidth and it’s easier with LPDDR5-8000 which is pretty common compared to DDR5 at the same speed (and maybe latter doesn’t even work reliably here, due to the faster latencies).
ML workloads will be really interesting. An ok sized GPU with access to 128GB of RAM could be faster than any other consumer device in cases where those just can't fit the model into memory. Intels B60 showed already that memory alone can make a difference.
Thank you for the article!
Could you double check the rtx 5070 mobile cache bandwith, i think it's not correct data.
In theory blackwell sm can reach 128 byte/clock cycle from L1. Of course it's the upper limit.
But in the article of rtx pro 6000 blackwell one sm can ca. 100 byte/clock cyle.
If we do the math with this number, and suppose a minimal 1.5 ghz operating frequency and the sm count of 5070M which is 36, than the l1 bandwith should be 36*100*1.5 Gbyte/sec which is 5400 Gbyte/sec. If we suppose 2 ghz opperating frequency than we reach 7200 Gbyte/sec and at 2.5 ghz 9000 Gbyte/ sec. So why is this big difference between your measurement (ca. 3000 Gbyte/sec) and the math. Thank you verry much for the answer, and sorry about my english!
As always a big thanks for the article. However, I would have liked to see a bit more emphasis on the new D2D interconnect. Are there any measurable differences regarding the C2C latencies or regarding power usage in idle/load compared to the good old IFoP of their Desktop counterparts?
Exceptional technical deep-dive! Your dissection of the D2D interconnect versus traditional IFoP is incredibly valuabe. The insight that Strix Halo achieves latency parity with monolithic Strix Point despite the chiplet architecture demonstrates AMD's interconnect engineering prowess. I particularly appreciate your bandwidth analysis - 192GB/s from a 4096-bit LPDDR5X interface is transformative for APU workloads. The L3 cache behavior across the CCDs with the unified 32MB structure is fascinating. George, your testing methodology consistently reveals architectural details that other reviewers miss!
Thanks for the article. Who in their right mind would want to get one of these with a dGPU?
I'm a little surprised by the similarity in CPU-to-DRAM latency between Strix Halo and Strix Point. I thought half the point with the new die-to-die interconnect they were using with Strix Halo was to decrease latency by removing the SERDES overhead. Unless I just completely misunderstood that, any idea what might up with that? Does the large GPU/IO die just add back enough latency for some reason by bring it back up to Strix Point parity, or what?
Speculating here, but Strix Point is monolithic while Strix Halo is chiplet based, so that upside evaporates in light of the chiplet downside and that’s why latency is largely the same.
Haha, obviously you're right, I don't know what I was thinking. I have no idea how I got it into my mind that Strix Point wouldn't be monolithic.
From that point of view then, it's actually just impressive that Strix Halo is keeping up with Strix Point in terms of latency, I guess.
Yea, I mean it’s like this since Zen2, they always did a good job relatively with latency or else these CPUs would’ve failed I guess.
Yeah, sure, but the desktop CPUs also have the latency advantage of not using LPDDR memory, so it's hard to compare more directly.
I don't think AMD has a traditional chiplet-based CPU that uses LPDDR memory, nor a Strix Halo design that uses non-LP DDR memory to have a direct comparison between, right?
For the first only Strix Halo for the second, I’m pretty sure Strix Halo also works with DDR5 because other mobile APUs also had this ability to use both types of Ram, it’s just not used because they want maximum bandwidth and it’s easier with LPDDR5-8000 which is pretty common compared to DDR5 at the same speed (and maybe latter doesn’t even work reliably here, due to the faster latencies).