Zen 5's AVX-512 Frequency Behavior

Mar 1

Zen 5 is AMD's first core to use full-width AVX-512 datapaths.

16 Comments

I'm interested in how much a fairly light AVX-512 workload (say, a small but highly optimized loop that just runs for a couple of microseconds) affects core behavior. If I recall correctly, a common criticism with Skylake-X was that the core dropped completely as soon as *any* AVX-512 instructions were executed, causing many developers to just avoid AVX-512 completely since it left their code running actively worse than AVX2 code that should nominally be slower.

The "rapid switching" graph seems to indicate that it shouldn't be nearly as big of a deal on Zen 5, since the core at least seems to recover immediately when small-ish AVX-512 sequences end, but it does also clearly shows IPC throttling immediately , but this could of course just be due to measurement granularity. Do you think there might be a "maximum size" of AVX-512 workload that would cause the core not to throttle at all?

Expand full comment

Reply (1)

Chester Lam

Mar 3

My interpretation is the sequence with AVX-512 instructions gets throttled. If you had an AVX-512 sequence that lasts for a few microseconds surrounded by lighter code that runs for orders of magnitude longer, that AVX-512 sequence would get get throttled but everything else runs at full speed.

Because the AVX-512 sequence is so short, dropping a bit of performance there is preferable to running everything at a lower clock

Expand full comment

Reply (1)

Jan

May 7

Not to mention you don't get 1:4 decrease typical of Skylake-X, which from memory was the worst thing for short AVX-512 instructions. Or Instruction. Wasn't it because Skylake-X would only activate 128-bit datapath for 50.000 cycles in order to avoid self-immolation? I only read stuff from Alexander Yee when it was published, I didn't read anything by Travis Downs - I will gladly do so as soon as possible though. Shame he stopped publishing after 2021.

I also wonder how 11th Gen handled it. And if it's possible to get early 12th Gen with early BIOS where AVX-512 wasn't locked? Intel trying to destroy desktop AVX-512 for some reason is bonkers.

Expand full comment

Reply (1)

Fredrik Tolf

May 12

>Wasn't it because Skylake-X would only activate 128-bit datapath for 50.000 cycles in order to avoid self-immolation?

Aren't you thinking of Intel's infamous handling of mixed SSE/AVX code? If you mixed SSE and AVX instructions on XMM (128-bit) registers, the whole core would stall for tens of thousands of cycles to switch mode for the upper half of the 256-bit registers, in order to handle the SSE versions not clearing the upper halves. More info eg. at https://stackoverflow.com/questions/41819514/why-do-sse-instructions-preserve-the-upper-128-bit-of-the-ymm-registers

Expand full comment

Adenilson Cavalcanti

Mar 6

Pretty awesome article, thanks a lot for sharing it!

A couple questions:

a) Is the same behavior expected in EPYC 5th gen processors?

b) I wonder what would be the results for Xeon 6th gen (e.g. Granite Rapids)?

c) was the code used for the profiling published somewhere in github?

Overall, excellent article. Pure class, as expected from you guys.

:-)

Expand full comment

vic

Mar 1

Thats colloquially called "clock stretching" behavior, when it slides in no-ops to prevent crashes from slamming into voltage droop (i believe, anyw)

Expand full comment

Reply (1)

Pierce Wiederecht

Mar 4

It’s literally only called that in the overclocking community and the vast majority of them have no understanding whatsoever why the “effective” clock speed is lower than the actual clock speed. They certainly don’t understand that the front end is firing blanks to to keep the FPU from melting down lol.

Expand full comment

Reply (1)

vic

Mar 4

Wildly aggressive, aren't we! If only the first comment said something about "coloquially" and about being a belief instead of objective......

Expand full comment

Peter W.

Mar 1

While Zen 5's ability to execute AVX512 often close to its regular clock speed is certainly impressive, the recurring comparison to SkylakeX is, IMHO, also a bit skewed. Both the manufacturing node and the architecture of Skylake is now several generations out of date. I am curious if the relative throttling behavior of the new P-core Xeons when running AVX512 is more similar to what you saw here with the Zen 5 9900X, or if they still suffer the significant drop in speed and long recovery time when executing your AVX512 tasks like SkylakeX did. Since those new Xeons (fabbed in Intel 3) are supposed to go head-to-head with the new EPYCs, I am sure I wouldn't be the only one who'd like to know. Maybe Intel or Supermicro can loan you a test setup 😄 to put through the paces ?

And, thanks for continuing your deep dives - they are appreciated 👍🏻!

Expand full comment

Reply (2)

Y69

Mar 1

Article like this one with the investment made into it are greatly appreciated.

However, yea... comparing a 2024 processor (Zen 5) to a 2017 one (Skylake-X) should be made just for sake of curiosity not treating them as equal contenders. I know it's easy to say but a recent Xeon would do :)

Expand full comment

Reply (1)

Chester Lam

Mar 1

I didn't treat them as equal contenders. Just had Skylake-X data for perspective. I imagine recent Xeons would be just fine because they don't clock particularly high

Expand full comment

Thomas

Mar 1

The problem is that current gen intel consumer chips got rid AVX512 support

A head to head comparison for similar server SKUs would be interesting, though.

I suspect most zen 5 server chips are running at low enough clock speed in the first place that triggering the behavior analyzed here will be a lot trickier if possible at all

Expand full comment

Reply (1)

Peter W.

Mar 1

Some Zen 5 EPYCs are however specifically meant to run pretty fast, at the expense of more power use per core. There are applications that benefit from high core speed even more than from overall compute brawn.

Expand full comment

Reply (1)

Chester Lam

Mar 3

From a quick check on Wikipedia, there isn't a single Epyc SKU that boosts to the upper 5 GHz range. I'm guessing none will show that behavior.

Expand full comment

Reply (2)

Rob

Mar 10

While the EPYC can only do 5 GHz on one core and the Threadripper beat that by a bit, with all cores able to hit 4 GHz, I think we are interested in knowing if those chips can operate without the hiccup and downclock seen in the Ryzen.

Expand full comment

Peter W.

Mar 4

That's entirely possible, you might well be right. I thought specifically of the versions with lower core count but high TdP. I don't know if the situation that follows still exists, but there were a couple of applications that were (are?) licensed not by seats, but by cores or threads. For those, running fewer cores a high speed makes a lot of economic sense. From what I remember (from a long time ago), some of these licenses (e.g. Oracle!) could cost a lot more per year than even the most expensive Xeons did at the time.

Expand full comment

Chips and Cheese

Zen 5's AVX-512 Frequency Behavior