The measurements in this article show that Intel has work to do improving their on-chip interconnect network. Some clues about future improvements in Diamond Rapids and Coral Rapids are visible in Intel's Clearwater Forest presentation from Hot Chips 2025. All of these products use Intel 18A and will undoubtedly have similar chiplet constructions. Clearwater Forest has 2 PCIe/accelerator chiplets and 3 base chiplets. On top of each base chiplet, 4 CPU chiplets are placed side-by-side. The base chiplets contain the L3 cache, DRAM controllers and part of the interconnect network. Stacking CPU chiplets on top of base chiplets roughly doubles the number of wire layers available for the fabric between cores and L3 cache slices, which should improve fabric performance. This chiplet construction also gives Intel an opportunity to provide an option like AMD's 3D V-Cache. The large package needed to support 16 DRAM channels for Diamond Rapids and Coral Rapids provides space for High Bandwidth Memory (HBM) and/or High Bandwidth Flash (HBF). The smaller CPU chiplets will help 18A yield.
Clearwater Forest has only a 35 GBytes/sec connection from each quad core module to L3 cache, which is 7.5x slower than AMD's Zen 5 Turin. Hopefully, future P core Xeons will do a lot better than that. For a 256 core Coral Rapids processor, one option is 16 cores per CPU chiplet, 4 CPU chiplets per base chiplet and 4 base chiplets. The interconnection network could be:
* 2 levels of fully-connected order 4 meshes to connect the 16 CPUs on each CPU chiplet,
* 1 level of fully-connected order 5 mesh to connect the 4 CPU chiplets and DRAM controller on each base chiplet and
* a ring to connect the 4 base chiplets and 2 PCIe/accelerator chiplets.
A different option for the same 256 core Coral Rapids processor is 8 cores per CPU chiplet, 4 CPU chiplets per base chiplet and 8 base chiplets. The interconnection network for this option could be:
* 1 level of fully-connected order 8 mesh to connect the 8 CPUs on each CPU chiplet,
* 1 level of fully-connected order 5 mesh to connect the 4 CPU chiplets and DRAM controller on each base chiplet and
* a ring to connect the 8 base chiplets and 2 PCIe/accelerator chiplets.
I'm guessing the base chiplets and PCIe/accelerator chiplets have to be connected by a ring because the EMIB interconnect density is not high enough to support a fully-connected (i.e. all-to-all) mesh. This ring could use SerDes embedded clock signalling to increase the bandwidth. Since sending a physical address and receiving data contribute equally to latency but the physical address is about 1/10th the width of a cache line, there could be a faster and more expensive interconnect network to send the address for a load compared to the data.
To reduce L3 cache latency, the L3 cache could be only shared across the cores on a CPU chiplet and not accessible by the other cores in the processor. L4 slices could be added that are shared across all the CPU chiplets stacked on a base chiplet or shared across the whole processor. For example, a CPU chiplet could access 32 MBytes of L3 shared across 16 cores and 128 MBytes of L4 shared across the 64 cores stacked on a base chiplet or 512 MBytes of L4 shared across the 256 cores in a Coral Rapids processor. Optional 3D V-Cache could be used to double the size of the L3 and L4 caches.
In the 4th graph titled "Single Core Bandwidth", does anyone here have a guess about why the part of the curve for the L2 bandwidth starts to increase after a test size of 256 KBytes and reaches a local maximum at a test size of about 600 to 800 KBytes for the EPYC 9355P (1 MB L2 per core)? In other words, for the EPYC 9355P, why does using more of the L2 capacity cause the L2 bandwidth to increase, from 420 to 550 GB/sec for "EPYC 9355P, Add" and from 190 to 280 GB/sec for "EPYC 9355P, Read"? This is about a 30% to 50% increase in L2 bandwidth just from using more of the L2 capacity.
The section titled "Single Core Performance: SPEC CPU2017" says "I’ve been running the rate suite with a single copy to summarize single-thread performance." Wouldn't it be better to use the SPECspeed suite for single-thread performance? My understanding is that SPECrate is designed for measuring the throughput of multiple cores while SPECspeed is designed for measuring latency.
The "System Overview" section says the Amazon r8i instance uses a 6985P-C processor but it's actually a 6975P-C processor. The graphs in the article are labeled correctly.
I think some high-core count multiple-chiplet processors will be used to replace older scale-up systems. Currently eight-socket Lenovo x3950 X6 servers are selling for much less than $1000 on eBay.
In my opinion these older 8-socket servers have interesting performance characteristics and make an interesting point of comparison for the new machines.
Thanks Chester, another great deep dive! Appreciate also the inclusion of the Graviton CPU instances, as Hyperscalers like AWS deploy more and more ARM-based servers. Question more related to the AWS rentals: what were (roughly) the relative expenses ( rentals) for the Xeons, the EPYC and AWS's own Graviton at comparable compute capabilities?
And, did you or George hear anything new about the status of the next generation Xeons? I believe it'll be critical for Intel to deliver both Panther Lake for notebooks and the next "Rapid" in 18 Angstrom in the next 6-9 months.
I didn't hear anything besides about Clearwater Forest (288 E-Cores), but I also don't keep up with the rumor mill. Logically they would want to put Lion Cove on the server side, with AVX-512 and AMX enabled just like with Redwood Cove on Xeon 6. Hopefully a public cloud will have an instance for rent when/if those show up.
Expense was high for the Xeon 6 one because I spent perhaps a bit more time than I should testing things, including failed experiments like "can I get more bandwidth via
Thanks for the response, and for being so persistent scouting for additional bandwidth in Xeon 6. Wendell from Level1techs had an interesting video (YT) on his experiences with the largest Xeon 6 (128 cores) in a single socket system. One area where Intel's Xeon 6 still have a bit of a niche are their multi-socket setups; they can utilize enormous amounts of RAM, which might be interesting for really large (several TB) RAM-resident databases (HANA..) and possibly for very large language models. The latter possibly even more so if NVLink becomes available for future Xeons.
Great article! Although I believe ice lake has up to 40 cores, not 28 :P
Ah you're right. I was looking at Intel's Hot Chips slides when they presented Ice Lake SP, and they showed a 28 core die there. But there's a 40 core SKU (https://www.intel.com/content/www/us/en/products/sku/212287/intel-xeon-platinum-8380-processor-60m-cache-2-30-ghz/specifications.html)
Substack makes it a bit hard to fix tables because there is no table support. They're just images. I'll get to it...
The measurements in this article show that Intel has work to do improving their on-chip interconnect network. Some clues about future improvements in Diamond Rapids and Coral Rapids are visible in Intel's Clearwater Forest presentation from Hot Chips 2025. All of these products use Intel 18A and will undoubtedly have similar chiplet constructions. Clearwater Forest has 2 PCIe/accelerator chiplets and 3 base chiplets. On top of each base chiplet, 4 CPU chiplets are placed side-by-side. The base chiplets contain the L3 cache, DRAM controllers and part of the interconnect network. Stacking CPU chiplets on top of base chiplets roughly doubles the number of wire layers available for the fabric between cores and L3 cache slices, which should improve fabric performance. This chiplet construction also gives Intel an opportunity to provide an option like AMD's 3D V-Cache. The large package needed to support 16 DRAM channels for Diamond Rapids and Coral Rapids provides space for High Bandwidth Memory (HBM) and/or High Bandwidth Flash (HBF). The smaller CPU chiplets will help 18A yield.
Clearwater Forest has only a 35 GBytes/sec connection from each quad core module to L3 cache, which is 7.5x slower than AMD's Zen 5 Turin. Hopefully, future P core Xeons will do a lot better than that. For a 256 core Coral Rapids processor, one option is 16 cores per CPU chiplet, 4 CPU chiplets per base chiplet and 4 base chiplets. The interconnection network could be:
* 2 levels of fully-connected order 4 meshes to connect the 16 CPUs on each CPU chiplet,
* 1 level of fully-connected order 5 mesh to connect the 4 CPU chiplets and DRAM controller on each base chiplet and
* a ring to connect the 4 base chiplets and 2 PCIe/accelerator chiplets.
A different option for the same 256 core Coral Rapids processor is 8 cores per CPU chiplet, 4 CPU chiplets per base chiplet and 8 base chiplets. The interconnection network for this option could be:
* 1 level of fully-connected order 8 mesh to connect the 8 CPUs on each CPU chiplet,
* 1 level of fully-connected order 5 mesh to connect the 4 CPU chiplets and DRAM controller on each base chiplet and
* a ring to connect the 8 base chiplets and 2 PCIe/accelerator chiplets.
I'm guessing the base chiplets and PCIe/accelerator chiplets have to be connected by a ring because the EMIB interconnect density is not high enough to support a fully-connected (i.e. all-to-all) mesh. This ring could use SerDes embedded clock signalling to increase the bandwidth. Since sending a physical address and receiving data contribute equally to latency but the physical address is about 1/10th the width of a cache line, there could be a faster and more expensive interconnect network to send the address for a load compared to the data.
To reduce L3 cache latency, the L3 cache could be only shared across the cores on a CPU chiplet and not accessible by the other cores in the processor. L4 slices could be added that are shared across all the CPU chiplets stacked on a base chiplet or shared across the whole processor. For example, a CPU chiplet could access 32 MBytes of L3 shared across 16 cores and 128 MBytes of L4 shared across the 64 cores stacked on a base chiplet or 512 MBytes of L4 shared across the 256 cores in a Coral Rapids processor. Optional 3D V-Cache could be used to double the size of the L3 and L4 caches.
In the 4th graph titled "Single Core Bandwidth", does anyone here have a guess about why the part of the curve for the L2 bandwidth starts to increase after a test size of 256 KBytes and reaches a local maximum at a test size of about 600 to 800 KBytes for the EPYC 9355P (1 MB L2 per core)? In other words, for the EPYC 9355P, why does using more of the L2 capacity cause the L2 bandwidth to increase, from 420 to 550 GB/sec for "EPYC 9355P, Add" and from 190 to 280 GB/sec for "EPYC 9355P, Read"? This is about a 30% to 50% increase in L2 bandwidth just from using more of the L2 capacity.
The section titled "Single Core Performance: SPEC CPU2017" says "I’ve been running the rate suite with a single copy to summarize single-thread performance." Wouldn't it be better to use the SPECspeed suite for single-thread performance? My understanding is that SPECrate is designed for measuring the throughput of multiple cores while SPECspeed is designed for measuring latency.
The "System Overview" section says the Amazon r8i instance uses a 6985P-C processor but it's actually a 6975P-C processor. The graphs in the article are labeled correctly.
I think some high-core count multiple-chiplet processors will be used to replace older scale-up systems. Currently eight-socket Lenovo x3950 X6 servers are selling for much less than $1000 on eBay.
In my opinion these older 8-socket servers have interesting performance characteristics and make an interesting point of comparison for the new machines.
Thanks Chester, another great deep dive! Appreciate also the inclusion of the Graviton CPU instances, as Hyperscalers like AWS deploy more and more ARM-based servers. Question more related to the AWS rentals: what were (roughly) the relative expenses ( rentals) for the Xeons, the EPYC and AWS's own Graviton at comparable compute capabilities?
And, did you or George hear anything new about the status of the next generation Xeons? I believe it'll be critical for Intel to deliver both Panther Lake for notebooks and the next "Rapid" in 18 Angstrom in the next 6-9 months.
I didn't hear anything besides about Clearwater Forest (288 E-Cores), but I also don't keep up with the rumor mill. Logically they would want to put Lion Cove on the server side, with AVX-512 and AMX enabled just like with Redwood Cove on Xeon 6. Hopefully a public cloud will have an instance for rent when/if those show up.
Expense was high for the Xeon 6 one because I spent perhaps a bit more time than I should testing things, including failed experiments like "can I get more bandwidth via
1-->2
1------>3
or
1--->2<---3
kind of access patterns" (nope)
Thanks for the response, and for being so persistent scouting for additional bandwidth in Xeon 6. Wendell from Level1techs had an interesting video (YT) on his experiences with the largest Xeon 6 (128 cores) in a single socket system. One area where Intel's Xeon 6 still have a bit of a niche are their multi-socket setups; they can utilize enormous amounts of RAM, which might be interesting for really large (several TB) RAM-resident databases (HANA..) and possibly for very large language models. The latter possibly even more so if NVLink becomes available for future Xeons.