Quite fascinating. Before Zen 5 launched, various sources, including articles here, seemed to at least vaguely imply that AMD's "two-ahead" branch predictor would be able to follow two branches per cycle even for a single thread, whereas post-launch it quickly became clear that that wasn't the case, and also that, as reiterated by this article, the op-cache only seems to be able to deliver six ops per cycle for one thread, which seems a bit at odds with the 8-wide renamer.
All taken together, I can't help but wonder if there wasn't something that turned out badly with Zen 5's front-end at a late stage, and they were forced to neuter it to prevent bugs. If true, and they manage to fix those problems with Zen 6, that could paint quite a positive picture for Zen 6 IPC improvements, not least coupled with the rumors that Zen 6 is using a new, lower-latency die-to-die interconnect (which they're already kind of using for Strix Halo, aren't they?).
Is there something intrinsic to video games that lead low IPC computations or does low IPC simply follow from lack of optimisation at the software development level?
Also, since Intel is backend latency constrained while AMD is front-end latency constrained, does that mean code needs to be optimised in different ways depending on which processor it will run on?
I assume games in general are pretty low IPC just because of how branch-y + control flow dependent they are? I can imagine its hard to have much ILP in game logic
In my experience, most programs are kind of low-IPC "by default", and it's only really the ones with very regular instruction and data access patterns that achieve particularly high IPC. That's not a systematic and rigorous statement, just my experience from running `perf stat` on various different kinds of programs. Most "normal" programs in this sense generally seem to hit somewhere between 1-2 instructions per clock.
If they're blowing out of instruction cache, one possible explanation could be too much inlining and loop-unrolling. In a sense, some of these games could actually be over-optimized.
Code size vs. straight-line speed is a very difficult tradeoff to make. Game programmers will nearly always prefer straight-line speed, even when there's not a lot to be gained by doing so.
Excellent write-up. Would love to see Factorio in future gaming benchmarks because it is unlike many other games due to its very high sensitivity to cache and relative lack of graphic intensity.
Do current branch predictors try to predict the confidence associated with the most likely outcome? If so, that could open the door to another dimension of optimizations, which is intelligently deciding whether to prefetch the less likely branch target.
Also, APX adds predication to many instructions, as a way to avoid cluttering up the branch predictor state. So, that could be another avenue where we might anticipate improvements on these sorts of low-IPC workloads.
If the L1 and L2 cache hit rates are already good, and the main limit to performance is low IPC, then how does the larger L3 cache on the X3D chips boost gaming performance? Given the popularity of X3D for gaming, it would be great to see a similar article explaining why X3D is so beneficial for gaming, but does little for most other workloads.
Low IPC just means the core doesn't manage to execute many instructions per cycle, which can have different root causes.
The benchmarks in this article show, that the core is often frontend latency bound, meaning it is waiting on instructions. As that can be caused by I-cache misses, I would guess more L3 reduces the average duration of a stall at the frontend.
Quite fascinating. Before Zen 5 launched, various sources, including articles here, seemed to at least vaguely imply that AMD's "two-ahead" branch predictor would be able to follow two branches per cycle even for a single thread, whereas post-launch it quickly became clear that that wasn't the case, and also that, as reiterated by this article, the op-cache only seems to be able to deliver six ops per cycle for one thread, which seems a bit at odds with the 8-wide renamer.
All taken together, I can't help but wonder if there wasn't something that turned out badly with Zen 5's front-end at a late stage, and they were forced to neuter it to prevent bugs. If true, and they manage to fix those problems with Zen 6, that could paint quite a positive picture for Zen 6 IPC improvements, not least coupled with the rumors that Zen 6 is using a new, lower-latency die-to-die interconnect (which they're already kind of using for Strix Halo, aren't they?).
Is there something intrinsic to video games that lead low IPC computations or does low IPC simply follow from lack of optimisation at the software development level?
Also, since Intel is backend latency constrained while AMD is front-end latency constrained, does that mean code needs to be optimised in different ways depending on which processor it will run on?
I assume games in general are pretty low IPC just because of how branch-y + control flow dependent they are? I can imagine its hard to have much ILP in game logic
In my experience, most programs are kind of low-IPC "by default", and it's only really the ones with very regular instruction and data access patterns that achieve particularly high IPC. That's not a systematic and rigorous statement, just my experience from running `perf stat` on various different kinds of programs. Most "normal" programs in this sense generally seem to hit somewhere between 1-2 instructions per clock.
There's a lot of scope for optimization in games, that's a fact.
If they're blowing out of instruction cache, one possible explanation could be too much inlining and loop-unrolling. In a sense, some of these games could actually be over-optimized.
Code size vs. straight-line speed is a very difficult tradeoff to make. Game programmers will nearly always prefer straight-line speed, even when there's not a lot to be gained by doing so.
Thank you for another interesting write-up. It seems AMD has room to squeeze a little more performance out of its Zen 5.
Excellent write-up. Would love to see Factorio in future gaming benchmarks because it is unlike many other games due to its very high sensitivity to cache and relative lack of graphic intensity.
Do current branch predictors try to predict the confidence associated with the most likely outcome? If so, that could open the door to another dimension of optimizations, which is intelligently deciding whether to prefetch the less likely branch target.
Also, APX adds predication to many instructions, as a way to avoid cluttering up the branch predictor state. So, that could be another avenue where we might anticipate improvements on these sorts of low-IPC workloads.
I have not heard of that, besides schemes that use multiple predictors and a meta-predictor to track which sub-predictor has been doing better.
Are you going to run the same IPC gaming tests on your 7950x3d? It's be interesting to see IPC limits given the big changes from Zen4 to Zen5.
If the L1 and L2 cache hit rates are already good, and the main limit to performance is low IPC, then how does the larger L3 cache on the X3D chips boost gaming performance? Given the popularity of X3D for gaming, it would be great to see a similar article explaining why X3D is so beneficial for gaming, but does little for most other workloads.
Thanks for the interesting article!
Low IPC just means the core doesn't manage to execute many instructions per cycle, which can have different root causes.
The benchmarks in this article show, that the core is often frontend latency bound, meaning it is waiting on instructions. As that can be caused by I-cache misses, I would guess more L3 reduces the average duration of a stall at the frontend.