I'm interested in how much a fairly light AVX-512 workload (say, a small but highly optimized loop that just runs for a couple of microseconds) affects core behavior. If I recall correctly, a common criticism with Skylake-X was that the core dropped completely as soon as *any* AVX-512 instructions were executed, causing many developers to just avoid AVX-512 completely since it left their code running actively worse than AVX2 code that should nominally be slower.
The "rapid switching" graph seems to indicate that it shouldn't be nearly as big of a deal on Zen 5, since the core at least seems to recover immediately when small-ish AVX-512 sequences end, but it does also clearly shows IPC throttling immediately , but this could of course just be due to measurement granularity. Do you think there might be a "maximum size" of AVX-512 workload that would cause the core not to throttle at all?
My interpretation is the sequence with AVX-512 instructions gets throttled. If you had an AVX-512 sequence that lasts for a few microseconds surrounded by lighter code that runs for orders of magnitude longer, that AVX-512 sequence would get get throttled but everything else runs at full speed.
Because the AVX-512 sequence is so short, dropping a bit of performance there is preferable to running everything at a lower clock
It’s literally only called that in the overclocking community and the vast majority of them have no understanding whatsoever why the “effective” clock speed is lower than the actual clock speed. They certainly don’t understand that the front end is firing blanks to to keep the FPU from melting down lol.
While Zen 5's ability to execute AVX512 often close to its regular clock speed is certainly impressive, the recurring comparison to SkylakeX is, IMHO, also a bit skewed. Both the manufacturing node and the architecture of Skylake is now several generations out of date. I am curious if the relative throttling behavior of the new P-core Xeons when running AVX512 is more similar to what you saw here with the Zen 5 9900X, or if they still suffer the significant drop in speed and long recovery time when executing your AVX512 tasks like SkylakeX did. Since those new Xeons (fabbed in Intel 3) are supposed to go head-to-head with the new EPYCs, I am sure I wouldn't be the only one who'd like to know. Maybe Intel or Supermicro can loan you a test setup 😄 to put through the paces ?
And, thanks for continuing your deep dives - they are appreciated 👍🏻!
Article like this one with the investment made into it are greatly appreciated.
However, yea... comparing a 2024 processor (Zen 5) to a 2017 one (Skylake-X) should be made just for sake of curiosity not treating them as equal contenders. I know it's easy to say but a recent Xeon would do :)
I didn't treat them as equal contenders. Just had Skylake-X data for perspective. I imagine recent Xeons would be just fine because they don't clock particularly high
The problem is that current gen intel consumer chips got rid AVX512 support
A head to head comparison for similar server SKUs would be interesting, though.
I suspect most zen 5 server chips are running at low enough clock speed in the first place that triggering the behavior analyzed here will be a lot trickier if possible at all
Some Zen 5 EPYCs are however specifically meant to run pretty fast, at the expense of more power use per core. There are applications that benefit from high core speed even more than from overall compute brawn.
While the EPYC can only do 5 GHz on one core and the Threadripper beat that by a bit, with all cores able to hit 4 GHz, I think we are interested in knowing if those chips can operate without the hiccup and downclock seen in the Ryzen.
That's entirely possible, you might well be right. I thought specifically of the versions with lower core count but high TdP. I don't know if the situation that follows still exists, but there were a couple of applications that were (are?) licensed not by seats, but by cores or threads. For those, running fewer cores a high speed makes a lot of economic sense. From what I remember (from a long time ago), some of these licenses (e.g. Oracle!) could cost a lot more per year than even the most expensive Xeons did at the time.
Pretty awesome article, thanks a lot for sharing it!
A couple questions:
a) Is the same behavior expected in EPYC 5th gen processors?
b) I wonder what would be the results for Xeon 6th gen (e.g. Granite Rapids)?
c) was the code used for the profiling published somewhere in github?
Overall, excellent article. Pure class, as expected from you guys.
:-)
I'm interested in how much a fairly light AVX-512 workload (say, a small but highly optimized loop that just runs for a couple of microseconds) affects core behavior. If I recall correctly, a common criticism with Skylake-X was that the core dropped completely as soon as *any* AVX-512 instructions were executed, causing many developers to just avoid AVX-512 completely since it left their code running actively worse than AVX2 code that should nominally be slower.
The "rapid switching" graph seems to indicate that it shouldn't be nearly as big of a deal on Zen 5, since the core at least seems to recover immediately when small-ish AVX-512 sequences end, but it does also clearly shows IPC throttling immediately , but this could of course just be due to measurement granularity. Do you think there might be a "maximum size" of AVX-512 workload that would cause the core not to throttle at all?
My interpretation is the sequence with AVX-512 instructions gets throttled. If you had an AVX-512 sequence that lasts for a few microseconds surrounded by lighter code that runs for orders of magnitude longer, that AVX-512 sequence would get get throttled but everything else runs at full speed.
Because the AVX-512 sequence is so short, dropping a bit of performance there is preferable to running everything at a lower clock
Thats colloquially called "clock stretching" behavior, when it slides in no-ops to prevent crashes from slamming into voltage droop (i believe, anyw)
It’s literally only called that in the overclocking community and the vast majority of them have no understanding whatsoever why the “effective” clock speed is lower than the actual clock speed. They certainly don’t understand that the front end is firing blanks to to keep the FPU from melting down lol.
Wildly aggressive, aren't we! If only the first comment said something about "coloquially" and about being a belief instead of objective......
While Zen 5's ability to execute AVX512 often close to its regular clock speed is certainly impressive, the recurring comparison to SkylakeX is, IMHO, also a bit skewed. Both the manufacturing node and the architecture of Skylake is now several generations out of date. I am curious if the relative throttling behavior of the new P-core Xeons when running AVX512 is more similar to what you saw here with the Zen 5 9900X, or if they still suffer the significant drop in speed and long recovery time when executing your AVX512 tasks like SkylakeX did. Since those new Xeons (fabbed in Intel 3) are supposed to go head-to-head with the new EPYCs, I am sure I wouldn't be the only one who'd like to know. Maybe Intel or Supermicro can loan you a test setup 😄 to put through the paces ?
And, thanks for continuing your deep dives - they are appreciated 👍🏻!
Article like this one with the investment made into it are greatly appreciated.
However, yea... comparing a 2024 processor (Zen 5) to a 2017 one (Skylake-X) should be made just for sake of curiosity not treating them as equal contenders. I know it's easy to say but a recent Xeon would do :)
I didn't treat them as equal contenders. Just had Skylake-X data for perspective. I imagine recent Xeons would be just fine because they don't clock particularly high
The problem is that current gen intel consumer chips got rid AVX512 support
A head to head comparison for similar server SKUs would be interesting, though.
I suspect most zen 5 server chips are running at low enough clock speed in the first place that triggering the behavior analyzed here will be a lot trickier if possible at all
Some Zen 5 EPYCs are however specifically meant to run pretty fast, at the expense of more power use per core. There are applications that benefit from high core speed even more than from overall compute brawn.
From a quick check on Wikipedia, there isn't a single Epyc SKU that boosts to the upper 5 GHz range. I'm guessing none will show that behavior.
While the EPYC can only do 5 GHz on one core and the Threadripper beat that by a bit, with all cores able to hit 4 GHz, I think we are interested in knowing if those chips can operate without the hiccup and downclock seen in the Ryzen.
That's entirely possible, you might well be right. I thought specifically of the versions with lower core count but high TdP. I don't know if the situation that follows still exists, but there were a couple of applications that were (are?) licensed not by seats, but by cores or threads. For those, running fewer cores a high speed makes a lot of economic sense. From what I remember (from a long time ago), some of these licenses (e.g. Oracle!) could cost a lot more per year than even the most expensive Xeons did at the time.