I assume you mean Turin, not Turing. Turin is the codename for the EPYC Zen 5 family (9005 series). Turing is the name of the NVIDIA GPU microarchitecture used in the RTX 20 series. Turin has a hub and spoke connection between DRAM controllers and clusters of 8 Zen 5 CPU cores or 16 Zen 5c CPU cores. When using an accelerator, such as a Google TPU or a GPU, the interconnect between CPU cores doesn't make much difference for AI applications because the heavy lifting is being done by the accelerator. Based on this Chips and Cheese article, it looks to me that Intel's current mesh interconnect is not as good as AMD's hub and spoke, but that has nothing to do with AI applications. I wrote a proposal for what I think would be a better interconnect between CPU cores for a future Xeon processor and I posted it here:
Mesh interconnect would have to run slower (it boils down to the top Metal Layer, VIAS) since there are more signal and power wiring that have to be installed to establish the connections.
But i would imagine a chiplet (AMD-EPYC) is more to accommodate cloud virtualization workload to serve multiple "independent" workloads.
Whilst in the context of "mesh" interconnect it's more for "Dependent" workload that needs to move data to another core to tap on it's execution ports there.
Back to the law of physics, if i need more things to communicate, i have to go with mesh-style interconnect.
Reading Chester's write-up the mesh-interconnect isn't good yet.
But is that due to the implementation of the memory controllers or just to much VIAS are been cramped onto the Intel 3 Process Node for Xeon 6th? Because if it's the latter there will be a "heat-effect" since there are too much wiring and pulling me back on how fast (Uncore Frequency) i can run that mesh.
That alludes me to the key-point, so with Intel 18A they implemented a new form of power delivery ("Back-side Power Delivery") which improves IR drop. And from Panther Lake laptop review, the results does point to that improvement.
So if Intel adopts 18A for their "Diamond Rapids" (high performance version for 7th Gen Xeon), i'm anticipating a big-uplift on their interconnect performance.
Lowering IR drop can allow me to run mesh at high frequency.
I think the main problem with Xeon's current mesh interconnect is that there are up to 3 transfers over the mesh to load from DRAM or L3 cache, as explained in the link below.
This results in the L3 latency of Granite Rapids being 3x higher than Turin (33ns vs 11ns) in sub-NUMA clustering mode. Also, the mesh in Granite Rapids becomes clogged when all the cores are accessing L3 cache. As Chips and Cheese measured for Granite Rapids, the L3 bandwidth per core is 2.7x higher when only one core is doing read-modify-writes of L3 compared to when all cores are doing read-modify-writes of L3 (60 GB/s vs 22 GB/s) in sub-NUMA clustering mode. When all cores are reading L3, Turin has about 9x more L3 bandwidth per core than Granite Rapids (98 GB/s vs 11 GB/s). Xeon's traditional strength has been when all cores are running a single application sharing a huge L3. It remains to be seen if that traditional strength will hold up against Zen 6 Venice-X with 3D V-Cache and the MI400A with 384 MB of L4 cache.
So it's a bottleneck that has more to do with the amount of work that the intel memory controllers has to do when MANY cores are trying to access the L3 cache slices and probably a smaller function of the mesh interconnect performance?
The mesh interconnect in Granite Rapids has poor performance regardless of whether only a single core is accessing L3 or many cores are accessing L3. The 4th graph in the article shows the L3 bandwidth of a single core doing read-modify-writes is 261 GB/s for Turin versus 60 GB/s for Granite Rapids. The 5th graph in the article shows total chip-wide cache bandwidth. This 5th graph is more difficult to interpret because it is comparing a 32 core Turin to a 96 core Granite Rapids and the horizontal axis is "Data Across All Threads". A fair comparison is the L3 bandwidth per core, which is 3123.61/32 = 98 GB/s for Turin versus 1076.8/96 = 11 GB/s for Granite Rapids.
The 5th graph appears to be for reads only, rather than read-modify-writes. An upper bound for the L3 bandwidth per core when all cores are doing read-modify-writes on L3 is 22 GB/s for Granite Rapids (twice the read-only rate). The actual read-modify-write L3 bandwidth per core is probably a bit lower than that. The effect of mesh congestion due to all cores accessing L3 compared to a single core accessing L3 is therefore even more than the factor of 60/22 = 2.7x mentioned in my previous post. This is in sub-NUMA clustering mode, where the source and destination of L3 accesses are on the same chiplet. Mesh congestion and L3 latency for Granite Rapids would be significantly worse in uniform memory access mode because the source for an L3 access would be evenly spread across all the chiplets instead of being localized to the destination chiplet.
The measurements in this article show that Intel has work to do improving their on-chip interconnect network. Some clues about future improvements in Diamond Rapids and Coral Rapids are visible in Intel's Clearwater Forest presentation from Hot Chips 2025. All of these products use Intel 18A and will undoubtedly have similar chiplet constructions. Clearwater Forest has 2 PCIe/accelerator chiplets and 3 base chiplets. On top of each base chiplet, 4 CPU chiplets are placed side-by-side. The base chiplets contain the L3 cache, DRAM controllers and part of the interconnect network. Stacking CPU chiplets on top of base chiplets roughly doubles the number of wire layers available for the fabric between cores and L3 cache slices, which should improve fabric performance. This chiplet construction also gives Intel an opportunity to provide an option like AMD's 3D V-Cache. The large package needed to support 16 DRAM channels for Diamond Rapids and Coral Rapids provides space for High Bandwidth Memory (HBM) and/or High Bandwidth Flash (HBF). The smaller CPU chiplets will help 18A yield.
Clearwater Forest has only a 35 GBytes/sec connection from each quad core module to L3 cache, which is 7.5x slower than AMD's Zen 5 Turin. Hopefully, future P core Xeons will do a lot better than that. For a 256 core Coral Rapids processor, one option is 16 cores per CPU chiplet, 4 CPU chiplets per base chiplet and 4 base chiplets. The interconnection network could be:
* 2 levels of fully-connected order 4 meshes to connect the 16 CPUs on each CPU chiplet,
* 1 level of fully-connected order 5 mesh to connect the 4 CPU chiplets and DRAM controller on each base chiplet and
* a ring to connect the 4 base chiplets and 2 PCIe/accelerator chiplets.
A different option for the same 256 core Coral Rapids processor is 8 cores per CPU chiplet, 4 CPU chiplets per base chiplet and 8 base chiplets. The interconnection network for this option could be:
* 1 level of fully-connected order 8 mesh to connect the 8 CPUs on each CPU chiplet,
* 1 level of fully-connected order 5 mesh to connect the 4 CPU chiplets and DRAM controller on each base chiplet and
* a ring to connect the 8 base chiplets and 2 PCIe/accelerator chiplets.
I'm guessing the base chiplets and PCIe/accelerator chiplets have to be connected by a ring because the EMIB interconnect density is not high enough to support a fully-connected (i.e. all-to-all) mesh. This ring could use SerDes embedded clock signalling to increase the bandwidth. Since sending a physical address and receiving data contribute equally to latency but the physical address is about 1/10th the width of a cache line, there could be a faster and more expensive interconnect network to send the address for a load compared to the data.
To reduce L3 cache latency, the L3 cache could be only shared across the cores on a CPU chiplet and not accessible by the other cores in the processor. L4 slices could be added that are shared across all the CPU chiplets stacked on a base chiplet or shared across the whole processor. For example, a CPU chiplet could access 32 MBytes of L3 shared across 16 cores and 128 MBytes of L4 shared across the 64 cores stacked on a base chiplet or 512 MBytes of L4 shared across the 256 cores in a Coral Rapids processor. Optional 3D V-Cache could be used to double the size of the L3 and L4 caches.
In the 4th graph titled "Single Core Bandwidth", does anyone here have a guess about why the part of the curve for the L2 bandwidth starts to increase after a test size of 256 KBytes and reaches a local maximum at a test size of about 600 to 800 KBytes for the EPYC 9355P (1 MB L2 per core)? In other words, for the EPYC 9355P, why does using more of the L2 capacity cause the L2 bandwidth to increase, from 420 to 550 GB/sec for "EPYC 9355P, Add" and from 190 to 280 GB/sec for "EPYC 9355P, Read"? This is about a 30% to 50% increase in L2 bandwidth just from using more of the L2 capacity.
The section titled "Single Core Performance: SPEC CPU2017" says "I’ve been running the rate suite with a single copy to summarize single-thread performance." Wouldn't it be better to use the SPECspeed suite for single-thread performance? My understanding is that SPECrate is designed for measuring the throughput of multiple cores while SPECspeed is designed for measuring latency.
The "System Overview" section says the Amazon r8i instance uses a 6985P-C processor but it's actually a 6975P-C processor. The graphs in the article are labeled correctly.
I think some high-core count multiple-chiplet processors will be used to replace older scale-up systems. Currently eight-socket Lenovo x3950 X6 servers are selling for much less than $1000 on eBay.
In my opinion these older 8-socket servers have interesting performance characteristics and make an interesting point of comparison for the new machines.
Thanks Chester, another great deep dive! Appreciate also the inclusion of the Graviton CPU instances, as Hyperscalers like AWS deploy more and more ARM-based servers. Question more related to the AWS rentals: what were (roughly) the relative expenses ( rentals) for the Xeons, the EPYC and AWS's own Graviton at comparable compute capabilities?
And, did you or George hear anything new about the status of the next generation Xeons? I believe it'll be critical for Intel to deliver both Panther Lake for notebooks and the next "Rapid" in 18 Angstrom in the next 6-9 months.
I didn't hear anything besides about Clearwater Forest (288 E-Cores), but I also don't keep up with the rumor mill. Logically they would want to put Lion Cove on the server side, with AVX-512 and AMX enabled just like with Redwood Cove on Xeon 6. Hopefully a public cloud will have an instance for rent when/if those show up.
Expense was high for the Xeon 6 one because I spent perhaps a bit more time than I should testing things, including failed experiments like "can I get more bandwidth via
Thanks for the response, and for being so persistent scouting for additional bandwidth in Xeon 6. Wendell from Level1techs had an interesting video (YT) on his experiences with the largest Xeon 6 (128 cores) in a single socket system. One area where Intel's Xeon 6 still have a bit of a niche are their multi-socket setups; they can utilize enormous amounts of RAM, which might be interesting for really large (several TB) RAM-resident databases (HANA..) and possibly for very large language models. The latter possibly even more so if NVLink becomes available for future Xeons.
Great article! Although I believe ice lake has up to 40 cores, not 28 :P
Ah you're right. I was looking at Intel's Hot Chips slides when they presented Ice Lake SP, and they showed a 28 core die there. But there's a 40 core SKU (https://www.intel.com/content/www/us/en/products/sku/212287/intel-xeon-platinum-8380-processor-60m-cache-2-30-ghz/specifications.html)
Substack makes it a bit hard to fix tables because there is no table support. They're just images. I'll get to it...
For AI workload on the cloud say when pair with Google TPU pods like hardwares, which form of interconnect best suit AI related workloads?
Mesh (Granite Rapids) or "semi-mesh like" (Turing)
And does AI workload tend to favour XEON CPU, because of the onboard accelerators and all cores are connected via a mesh?
I assume you mean Turin, not Turing. Turin is the codename for the EPYC Zen 5 family (9005 series). Turing is the name of the NVIDIA GPU microarchitecture used in the RTX 20 series. Turin has a hub and spoke connection between DRAM controllers and clusters of 8 Zen 5 CPU cores or 16 Zen 5c CPU cores. When using an accelerator, such as a Google TPU or a GPU, the interconnect between CPU cores doesn't make much difference for AI applications because the heavy lifting is being done by the accelerator. Based on this Chips and Cheese article, it looks to me that Intel's current mesh interconnect is not as good as AMD's hub and spoke, but that has nothing to do with AI applications. I wrote a proposal for what I think would be a better interconnect between CPU cores for a future Xeon processor and I posted it here:
https://www.realworldtech.com/forum/?threadid=226708
Mesh interconnect would have to run slower (it boils down to the top Metal Layer, VIAS) since there are more signal and power wiring that have to be installed to establish the connections.
But i would imagine a chiplet (AMD-EPYC) is more to accommodate cloud virtualization workload to serve multiple "independent" workloads.
Whilst in the context of "mesh" interconnect it's more for "Dependent" workload that needs to move data to another core to tap on it's execution ports there.
Back to the law of physics, if i need more things to communicate, i have to go with mesh-style interconnect.
Reading Chester's write-up the mesh-interconnect isn't good yet.
But is that due to the implementation of the memory controllers or just to much VIAS are been cramped onto the Intel 3 Process Node for Xeon 6th? Because if it's the latter there will be a "heat-effect" since there are too much wiring and pulling me back on how fast (Uncore Frequency) i can run that mesh.
That alludes me to the key-point, so with Intel 18A they implemented a new form of power delivery ("Back-side Power Delivery") which improves IR drop. And from Panther Lake laptop review, the results does point to that improvement.
So if Intel adopts 18A for their "Diamond Rapids" (high performance version for 7th Gen Xeon), i'm anticipating a big-uplift on their interconnect performance.
Lowering IR drop can allow me to run mesh at high frequency.
Any views, feel free to share :)
I think the main problem with Xeon's current mesh interconnect is that there are up to 3 transfers over the mesh to load from DRAM or L3 cache, as explained in the link below.
https://www.realworldtech.com/forum/?threadid=226708&curpostid=226711
This results in the L3 latency of Granite Rapids being 3x higher than Turin (33ns vs 11ns) in sub-NUMA clustering mode. Also, the mesh in Granite Rapids becomes clogged when all the cores are accessing L3 cache. As Chips and Cheese measured for Granite Rapids, the L3 bandwidth per core is 2.7x higher when only one core is doing read-modify-writes of L3 compared to when all cores are doing read-modify-writes of L3 (60 GB/s vs 22 GB/s) in sub-NUMA clustering mode. When all cores are reading L3, Turin has about 9x more L3 bandwidth per core than Granite Rapids (98 GB/s vs 11 GB/s). Xeon's traditional strength has been when all cores are running a single application sharing a huge L3. It remains to be seen if that traditional strength will hold up against Zen 6 Venice-X with 3D V-Cache and the MI400A with 384 MB of L4 cache.
So it's a bottleneck that has more to do with the amount of work that the intel memory controllers has to do when MANY cores are trying to access the L3 cache slices and probably a smaller function of the mesh interconnect performance?
The mesh interconnect in Granite Rapids has poor performance regardless of whether only a single core is accessing L3 or many cores are accessing L3. The 4th graph in the article shows the L3 bandwidth of a single core doing read-modify-writes is 261 GB/s for Turin versus 60 GB/s for Granite Rapids. The 5th graph in the article shows total chip-wide cache bandwidth. This 5th graph is more difficult to interpret because it is comparing a 32 core Turin to a 96 core Granite Rapids and the horizontal axis is "Data Across All Threads". A fair comparison is the L3 bandwidth per core, which is 3123.61/32 = 98 GB/s for Turin versus 1076.8/96 = 11 GB/s for Granite Rapids.
The 5th graph appears to be for reads only, rather than read-modify-writes. An upper bound for the L3 bandwidth per core when all cores are doing read-modify-writes on L3 is 22 GB/s for Granite Rapids (twice the read-only rate). The actual read-modify-write L3 bandwidth per core is probably a bit lower than that. The effect of mesh congestion due to all cores accessing L3 compared to a single core accessing L3 is therefore even more than the factor of 60/22 = 2.7x mentioned in my previous post. This is in sub-NUMA clustering mode, where the source and destination of L3 accesses are on the same chiplet. Mesh congestion and L3 latency for Granite Rapids would be significantly worse in uniform memory access mode because the source for an L3 access would be evenly spread across all the chiplets instead of being localized to the destination chiplet.
The measurements in this article show that Intel has work to do improving their on-chip interconnect network. Some clues about future improvements in Diamond Rapids and Coral Rapids are visible in Intel's Clearwater Forest presentation from Hot Chips 2025. All of these products use Intel 18A and will undoubtedly have similar chiplet constructions. Clearwater Forest has 2 PCIe/accelerator chiplets and 3 base chiplets. On top of each base chiplet, 4 CPU chiplets are placed side-by-side. The base chiplets contain the L3 cache, DRAM controllers and part of the interconnect network. Stacking CPU chiplets on top of base chiplets roughly doubles the number of wire layers available for the fabric between cores and L3 cache slices, which should improve fabric performance. This chiplet construction also gives Intel an opportunity to provide an option like AMD's 3D V-Cache. The large package needed to support 16 DRAM channels for Diamond Rapids and Coral Rapids provides space for High Bandwidth Memory (HBM) and/or High Bandwidth Flash (HBF). The smaller CPU chiplets will help 18A yield.
Clearwater Forest has only a 35 GBytes/sec connection from each quad core module to L3 cache, which is 7.5x slower than AMD's Zen 5 Turin. Hopefully, future P core Xeons will do a lot better than that. For a 256 core Coral Rapids processor, one option is 16 cores per CPU chiplet, 4 CPU chiplets per base chiplet and 4 base chiplets. The interconnection network could be:
* 2 levels of fully-connected order 4 meshes to connect the 16 CPUs on each CPU chiplet,
* 1 level of fully-connected order 5 mesh to connect the 4 CPU chiplets and DRAM controller on each base chiplet and
* a ring to connect the 4 base chiplets and 2 PCIe/accelerator chiplets.
A different option for the same 256 core Coral Rapids processor is 8 cores per CPU chiplet, 4 CPU chiplets per base chiplet and 8 base chiplets. The interconnection network for this option could be:
* 1 level of fully-connected order 8 mesh to connect the 8 CPUs on each CPU chiplet,
* 1 level of fully-connected order 5 mesh to connect the 4 CPU chiplets and DRAM controller on each base chiplet and
* a ring to connect the 8 base chiplets and 2 PCIe/accelerator chiplets.
I'm guessing the base chiplets and PCIe/accelerator chiplets have to be connected by a ring because the EMIB interconnect density is not high enough to support a fully-connected (i.e. all-to-all) mesh. This ring could use SerDes embedded clock signalling to increase the bandwidth. Since sending a physical address and receiving data contribute equally to latency but the physical address is about 1/10th the width of a cache line, there could be a faster and more expensive interconnect network to send the address for a load compared to the data.
To reduce L3 cache latency, the L3 cache could be only shared across the cores on a CPU chiplet and not accessible by the other cores in the processor. L4 slices could be added that are shared across all the CPU chiplets stacked on a base chiplet or shared across the whole processor. For example, a CPU chiplet could access 32 MBytes of L3 shared across 16 cores and 128 MBytes of L4 shared across the 64 cores stacked on a base chiplet or 512 MBytes of L4 shared across the 256 cores in a Coral Rapids processor. Optional 3D V-Cache could be used to double the size of the L3 and L4 caches.
In the 4th graph titled "Single Core Bandwidth", does anyone here have a guess about why the part of the curve for the L2 bandwidth starts to increase after a test size of 256 KBytes and reaches a local maximum at a test size of about 600 to 800 KBytes for the EPYC 9355P (1 MB L2 per core)? In other words, for the EPYC 9355P, why does using more of the L2 capacity cause the L2 bandwidth to increase, from 420 to 550 GB/sec for "EPYC 9355P, Add" and from 190 to 280 GB/sec for "EPYC 9355P, Read"? This is about a 30% to 50% increase in L2 bandwidth just from using more of the L2 capacity.
The section titled "Single Core Performance: SPEC CPU2017" says "I’ve been running the rate suite with a single copy to summarize single-thread performance." Wouldn't it be better to use the SPECspeed suite for single-thread performance? My understanding is that SPECrate is designed for measuring the throughput of multiple cores while SPECspeed is designed for measuring latency.
The "System Overview" section says the Amazon r8i instance uses a 6985P-C processor but it's actually a 6975P-C processor. The graphs in the article are labeled correctly.
I think some high-core count multiple-chiplet processors will be used to replace older scale-up systems. Currently eight-socket Lenovo x3950 X6 servers are selling for much less than $1000 on eBay.
In my opinion these older 8-socket servers have interesting performance characteristics and make an interesting point of comparison for the new machines.
Thanks Chester, another great deep dive! Appreciate also the inclusion of the Graviton CPU instances, as Hyperscalers like AWS deploy more and more ARM-based servers. Question more related to the AWS rentals: what were (roughly) the relative expenses ( rentals) for the Xeons, the EPYC and AWS's own Graviton at comparable compute capabilities?
And, did you or George hear anything new about the status of the next generation Xeons? I believe it'll be critical for Intel to deliver both Panther Lake for notebooks and the next "Rapid" in 18 Angstrom in the next 6-9 months.
I didn't hear anything besides about Clearwater Forest (288 E-Cores), but I also don't keep up with the rumor mill. Logically they would want to put Lion Cove on the server side, with AVX-512 and AMX enabled just like with Redwood Cove on Xeon 6. Hopefully a public cloud will have an instance for rent when/if those show up.
Expense was high for the Xeon 6 one because I spent perhaps a bit more time than I should testing things, including failed experiments like "can I get more bandwidth via
1-->2
1------>3
or
1--->2<---3
kind of access patterns" (nope)
Thanks for the response, and for being so persistent scouting for additional bandwidth in Xeon 6. Wendell from Level1techs had an interesting video (YT) on his experiences with the largest Xeon 6 (128 cores) in a single socket system. One area where Intel's Xeon 6 still have a bit of a niche are their multi-socket setups; they can utilize enormous amounts of RAM, which might be interesting for really large (several TB) RAM-resident databases (HANA..) and possibly for very large language models. The latter possibly even more so if NVLink becomes available for future Xeons.