6 Comments
User's avatar
Evan R.'s avatar

Why do the numbers in the left diagram of the 7th image (for AMD EPYC 9355P NPS4) not match the table in the 6th image? The numbers along the upper-left to lower-right diagonal of the table in the 6th image differ by 5%, meaning the latency of 3 DRAM channels in the I/O hub to the closest CPU chiplet differs by 5%. CPU0 to Mem2 or Mem3 has 6% less latency than CPU2 to Mem0 or Mem1. Does this mean these measurements are only accurate to 5% to 6% or is there some other explanation for these variations? Was the processor changing the clock frequencies while this test was running?

Expand full comment
David. Hellyx's avatar

The outdated IOd is getting in the way.

Expand full comment
Peter W.'s avatar

Thanks Chester! Would you expect the results for the Xeon 6 ( Granite Rapids) with 32 cores - 6745P) to be any different from the larger Xeon 6 you tested? The 6745P can hit similarly high all-core turbo frequencies as the 9355P you tested here, hence my question. Thanks!

Expand full comment
Evan R.'s avatar

EPYC 9355P (32 cores) has 12 channels of DDR5-6400 DRAM while Xeon 6745P (32 cores) has only 8 channels of DDR5-6400 DRAM. The smallest Granite Rapids SKU with 12 channels of DRAM has 72 cores. With the performance profile set to compute mode, Xeon 6745P has approximately the same CPU clock frequencies as EPYC 9355P. The L3 cache on both processors runs on a separate clock domain. The main things that are hurting Granite Rapids performance compared to Turin are the L3 latency, L3 bandwidth and DRAM latency. For the Xeon 6975P-C and EPYC 9355P tested by Chips and Cheese, Granite Rapids has:

* 3x slower L3 latency than Turin (33.25ns vs 10.58ns),

* 1/4th the single core L3 bandwidth of Turin (59.81 vs 261.32 GBytes/sec),

* 1.4x slower worse case unloaded DRAM latency than Turin (181.54ns vs 129.42ns) and

* 1.14x slower best case unloaded DRAM latency than Turin (131.97ns vs 115.42ns).

See the Chips and Cheese article titled "A Look into Intel Xeon 6’s Memory Subsystem" for the L3 latency and L3 bandwidth measurements. The Xeon 6975P-C was in sub-NUMA clustering mode (SNC3) for these measurements.

Expand full comment
MZ's avatar

Is there a list of Wide-GMI OPNs/SKUs available anywhere? I have spent quite a while searching for it, but so far I could not find any document that would explicitly say which Genoa/Turin CPUs are wide-GMI. Are some of the single socket parts wide?

Expand full comment
Evan R.'s avatar

I have not found any official data from AMD that indicates which SKUs are GMI-Wide. Since the I/O die on Turin has 16 GMI ports, it seems reasonable to guess that GMI-Wide is used whenever there are 8 or fewer CCDs so each CCD can get 2 GMI ports. There is a maximum of 8 Zen 5 P cores per CCD so any SKUs with more than 64 P cores can not use GMI-Wide to each CCD since those SKUs have more than 8 CCDs. There is a maximum of 32MB of L3 on Zen 5 P core CCDs. This implies when

total L3/cores = 4MB, there are 8 cores per CCD and 8 CCDs has 64 cores;

total L3/cores = 5.33MB, there are 6 cores per CCD and 8 CCDs has 48 cores;

total L3/cores = 8MB, there are 4 cores per CCD and 8 CCDs had 32 cores;

total L3/cores = 10.67MB, there are 3 cores per CCD and 8 CCDs has 24 cores;

total L3/cores = 32MB, there is 1 core per CCD and 8 CCDs has 8 cores.

For each Zen 5 P core SKU, I divided the total L3 by the number of cores to put each SKU in one of the 5 categories above and then compared the number of cores to the number of cores for 8 CCDs. It turns out that all the Zen 5 EPYC processors with 64 or fewer cores have 8 or fewer CCDs and therefore are probably GMI-Wide, except the 9175F (16 cores, 512MB total L3, total L3/cores = 32MB, 1 core/CCD, 16 CCDs). This agrees with the number of CCDs in each SKU listed on the Wikipedia page for Zen 5.

The smallest Zen 5c processor has 96 cores. Zen 5c CCDs contain 1 CCX with 16 cores and 32MB of shared L3 for the whole CCD. By similar reasoning as used for Zen 5, two Zen 5c SKUs have 8 CCDs: 9745 (128 cores, 256MB total L3, total L3/cores = 2MB, 16 cores/CCD, 8 CCDs) and 9645 (96 cores, 256MB total L3, total L3/cores = 2.67MB, 12 cores/CCD, 8 CCDs).

I don't know if Zen5c GMI-Wide SKUs allow a single core to use both GMI ports at the same time. Zen5c is mainly for hyperscalers. Hyperscalers may prefer a single core to be limited to using one GMI port at a time so the performance of virtual machines running on the same processor are more independent of each other.

If the Zen 6 P core CCD has 12 cores, there could be GMI-Wide with up to 8 x 12 = 96 P cores, assuming the I/O die is similar to what is used for Zen 5 EPYC (Turin).

Expand full comment