Discussion about this post

User's avatar
nicball's avatar

Great article! Although I believe ice lake has up to 40 cores, not 28 :P

Expand full comment
Evan R.'s avatar

The measurements in this article show that Intel has work to do improving their on-chip interconnect network. Some clues about future improvements in Diamond Rapids and Coral Rapids are visible in Intel's Clearwater Forest presentation from Hot Chips 2025. All of these products use Intel 18A and will undoubtedly have similar chiplet constructions. Clearwater Forest has 2 PCIe/accelerator chiplets and 3 base chiplets. On top of each base chiplet, 4 CPU chiplets are placed side-by-side. The base chiplets contain the L3 cache, DRAM controllers and part of the interconnect network. Stacking CPU chiplets on top of base chiplets roughly doubles the number of wire layers available for the fabric between cores and L3 cache slices, which should improve fabric performance. This chiplet construction also gives Intel an opportunity to provide an option like AMD's 3D V-Cache. The large package needed to support 16 DRAM channels for Diamond Rapids and Coral Rapids provides space for High Bandwidth Memory (HBM) and/or High Bandwidth Flash (HBF). The smaller CPU chiplets will help 18A yield.

Clearwater Forest has only a 35 GBytes/sec connection from each quad core module to L3 cache, which is 7.5x slower than AMD's Zen 5 Turin. Hopefully, future P core Xeons will do a lot better than that. For a 256 core Coral Rapids processor, one option is 16 cores per CPU chiplet, 4 CPU chiplets per base chiplet and 4 base chiplets. The interconnection network could be:

* 2 levels of fully-connected order 4 meshes to connect the 16 CPUs on each CPU chiplet,

* 1 level of fully-connected order 5 mesh to connect the 4 CPU chiplets and DRAM controller on each base chiplet and

* a ring to connect the 4 base chiplets and 2 PCIe/accelerator chiplets.

A different option for the same 256 core Coral Rapids processor is 8 cores per CPU chiplet, 4 CPU chiplets per base chiplet and 8 base chiplets. The interconnection network for this option could be:

* 1 level of fully-connected order 8 mesh to connect the 8 CPUs on each CPU chiplet,

* 1 level of fully-connected order 5 mesh to connect the 4 CPU chiplets and DRAM controller on each base chiplet and

* a ring to connect the 8 base chiplets and 2 PCIe/accelerator chiplets.

I'm guessing the base chiplets and PCIe/accelerator chiplets have to be connected by a ring because the EMIB interconnect density is not high enough to support a fully-connected (i.e. all-to-all) mesh. This ring could use SerDes embedded clock signalling to increase the bandwidth. Since sending a physical address and receiving data contribute equally to latency but the physical address is about 1/10th the width of a cache line, there could be a faster and more expensive interconnect network to send the address for a load compared to the data.

To reduce L3 cache latency, the L3 cache could be only shared across the cores on a CPU chiplet and not accessible by the other cores in the processor. L4 slices could be added that are shared across all the CPU chiplets stacked on a base chiplet or shared across the whole processor. For example, a CPU chiplet could access 32 MBytes of L3 shared across 16 cores and 128 MBytes of L4 shared across the 64 cores stacked on a base chiplet or 512 MBytes of L4 shared across the 256 cores in a Coral Rapids processor. Optional 3D V-Cache could be used to double the size of the L3 and L4 caches.

Expand full comment
6 more comments...

No posts

Ready for more?