28 Comments
User's avatar
c3dtops's avatar

Jeez.. great piece of insightful work.

I guess a lot of time must have went in to do those tests.

Do have a question for Chester:

P-Cores and E-Core use IDI to talk to the uncore, which starts with the ring bus on most Intel client designs.

Here the "uncore" refers to shared L3 cache slices onwards, the ring like interconnect on Arrow lake, I/O and integrated memory controllers?

Chester Lam's avatar

On Meteor Lake and Arrow Lake I think Intel considers it a two-part uncore. There's the uncore on the CPU tile with the ring bus and L3, and that extends onto the SoC tile using the IDI protocol. Memory controllers I think are counted as part of the uncore, though I'm not sure about other components like the NPU or hardware video blocks.

Schrödinger's Cat's avatar

Also, I'd be curious to know more about how modern x86 CPUs implement nontemporal stores. Back when I first played with it on a Pentium 4, it seemed to me that it had the effect of restricting which L2 cache set could be used. So, it did cause some cache pollution, but it was limited in scope. Do modern x86 CPUs still do something similar, or do they truly bypass the cache hierarchy, entirely?

Schrödinger's Cat's avatar

Don't compilers default to "natural alignment", though? That means a 64-bit data type should have 64-bit alignment, for instance. In that case, you'd pretty much have to go out of your way to create one of these scenarios.

BTW, I recently ran across the LOCK prefix, while attempting to validate Robert Hallock's claim that iBOT could lock cachelines (i.e. protect them from eviction). If Intel CPUs have any way to lock a cacheline, I believe it's not documented.

Sunaabh Trivedi's avatar

This is what I was thinking too. I was under the impression that the CPU would just raise an exception on unaligned atomic accesses rather than jumping through all these hoops. This is definitely what happens in many arm CPUs

c3dtops's avatar

Hello - would you have the link to Robert's claim on cacheline locking?

Schrödinger's Cat's avatar

“Did the cache miss happen because that data wasn’t tagged with priority, and it just got evicted? OK, that is a place where binary optimization could certainly help,” Intel’s Robert Hallock says. “We could just tag it and say, ‘hey, stay local, don’t evict me.’ Pretty powerful capability.”

https://www.tomshardware.com/pc-components/cpus/intels-binary-optimization-tool-tested-and-explained-how-the-ibot-translation-delivers-up-to-18-percent-faster-gaming-performance-8-percent-on-average

c3dtops's avatar

tyvm sir.. have a nice weekend.

Random Verse's avatar

Wow, what I got most interesting out of this is the first picture, spin lock unrelated. ARL P-Core to P latency is twice the e-Cores and the same as Zen 5 cross CCD?!

What a massive fail on the P-Core design.

Schrödinger's Cat's avatar

The P-cores in Arrow Lake have more levels of cache than the E-cores. I think that's a big part of it.

In Lion Cove, Intel added something they call L0 cache, but it walks, looks, and quacks like L1. It's the same 48 kiB as what they've been using for L1d since Sunny Cove (Ice Lake). Then, they added a 192 kiB cache that Chester dubbed L1.5. Finally, they enlarged the L2 cache from 2 MiB to 2.5 MiB. The attached L3 slice stayed the same 3 MiB that it's been since Willow Cove (Tiger Lake).

The Skymont E-cores have 32 kiB L1d, but then each cluster of 4 shares a 4 MiB L2 cache. This is clearly visible in that core-to-core latency chart. Then, each 4-core cluster also gets the same 3 MiB L3 slice as each P-core.

BTW, I've read that the way Intel CPUs decide which L3 slice to use is via address hashing. Just because a core or cluster has a slice of L3 attached to it doesn't mean that's the slice it'll use for a given access.

Anyway, the point I was driving towards is that having to do those additional cache lookups is surely responsible for some of the additional P-to-P and P-to-E latency. Also, larger caches have higher latency penalties. I think it's probably not the whole story, but perhaps the biggest part of it.

I find it an interesting contrast how Apple and Qualcomm are keeping their cache hierarchies shallower and wider. Also, similar to what Intel is doing with their E-cores.

Random Verse's avatar

Maybe. I haven't looked deep into uArch for many years.

What I recall is that it usually does not affect core to core latency, only fetch latency due to the need to check more levels. Same for size, as you say.

But that is what I mean by a massive fail: your core is next to others on the same die.

If you have seen some IPC measurements, Darkmont is basically P-core performance but without the ability to scale to high clocks. For some workloads the cache will matter but P Core architects need to go on a diet fast. It is not like they are competing with Apple or even Qualcomm right now.

What they have been doing well for a while is Memory Controller and probably CPU fetch parallelism. Not at the level Apple has been doing it but much better than AMD.

Schrödinger's Cat's avatar

> What I recall is that it usually does not affect core to core latency, only fetch latency due to the need to check more levels.

Core-to-core latency is a special case of fetch. The core has to check each level of its private cache hierarchy, as well as probably L3, before discovering that the address it wants is present in another core's private cache hierarchy. At that point, I think the cache line must then be flushed down through the second core's private cache hierarchy.

> Darkmont is basically P-core performance

The article tested Arrow lake, which has Skymont E-cores. Intel said those have comparable integer IPC to Raptor Cove, but I think their FPU isn't quite P-core tier.

> P Core architects need to go on a diet fast.

The P-cores are rumored to be discontinued after Razer Lake. Subsequent to that, Intel will supposedly use a unified microarchitecture based on the E-cores.

rcmz's avatar

Should atomic operations also be cache line aligned on GPUs ?

I could not find any info related to this in the CUDA programming guide or Vulkan spec.

I suppose they don't care since GPU atomic operations are handled with specialized ALUs instead of locking cache lines. But I would sure like to get a confirmation on this :)

Schrödinger's Cat's avatar

Most architectures flat-out don't support split locks. x86 only got into this pickle because older CPUs allowed it, so they couldn't de-support it without breaking legacy compatibility.

Peter W.'s avatar

Wasn't that a key reason for Intel's iBOT and related optimizations? What Robert Hallock said when asked about it made sense to me. It would be interesting if Intel is or will be using AI to systematically identify legacy software that would benefit from, for example, keeping specific instructions resident in the cache, and then update iBOT accordingly.

Also, if either you @Chester or you @George have the opportunity, I'd like to know if AMD is working on something similar - Thanks!

Schrödinger's Cat's avatar

No, split locks are generally rare, in practice. That's why Linux was able to initially handle them by raising a SIGBUS, which terminates the process. It was only later that they had to walk that back, once people started complaining about it breaking somewhat niche apps that couldn't be fixed.

What Robert Hallock said might sound sensible, but I can't find that it has any basis in reality. If there's a way to say, ‘hey, stay local, don’t evict me’ I couldn't find it. Some CPU cores do have such a capability, but I've never seen it in x86. Of course, it might be an undocumented feature of Arrow Lake, but it could also be that he misremembered what someone had told him or was taking some liberties while paraphrasing one of the things it actually does (e.g. cache prefetching).

I'll refrain from saying anything further about iBOT, because I've read some things about it that suggested it doesn't work how I had thought. I think it would be a good subject for the Chips & Cheese team to dig into.

Tanj's avatar

Just trap them and fault the apps that use them. Don't elaborate the chip to support stupid, useless behavior.

Schrödinger's Cat's avatar

Linux tried this, but had to walk it back due to the breakage it caused in legacy non-OSS apps, where the user had no means to fix them. So, the current behavior of "trap and log" was the compromise.

Still, some of the apps where it happens are Windows games people play under WINE, and Linux' current default behavior is fairly disastrous for them.

You can find more about the history of Linux' split-lock handling on Phoronix.

Tanj's avatar
Apr 10Edited

A less hostile response which is probably entirely feasible today, a few days work with the help of a codex, is to trap, re-JIT it to properly aligned code, and run that instead. Like a Rosetta for bad code.

Schrödinger's Cat's avatar

You're talking about reorganizing datastructures, which can precipitate highly non-local code changes. That seems like it'd exceed the capabilities of JIT-based recompilation frameworks.

It's hard enough for compilers to do pointer aliasing analysis in certain cases, when they have access to the source code. In this scenario, you need them to do something similar, without the source code or any assumptions about what language it was written in!

Tanj's avatar

I am not surprised, but I still think this has to be a hard line. Coherency is a pernicious expense, and anything that you add to an architecture to appease misuse simply propagates into unexpected places and adds complexity. It is already complex enough for a CPU to maintain directories and other structures for well-behaved use of coherent operations. Playing legacy games which were badly written? Tough luck go find an old PC, don't add complexity to make it seem ok.

The costs of coherency are why GPUs beat CPUs. GPUs are not coherent by default, they are coherent only at deliberate places and only for limited effect. This liberated programmers to discover that many algorithms have vast freedom from Amdahl's law, which you would never know on a CPU due to the invisible coherency mechanisms adding a limit not inherent in the algorithm. I'm not saying CPUs should not be coherent, just pointing out the huge hidden costs of coherency which is why I am intransigent on not appeasing misuse by making it worse. We need it to be as good as it can be, and old, bad code deserves to be put in a sandbox where it can be miserable without sharing the misery. Time to move on.

Schrödinger's Cat's avatar

Well, Linux's handling of it is configurable. So, you can compile your kernel to raise a SIGBUS, if you want.

But, for someone who wants or needs to use software they can't modify, CPUs and operating systems aren't going to spoil the main selling point of x86, which is backwards compatibility. No matter what Linux does, Intel and AMD are not going to drop hardware support for split locks, so it's it's a pointless battle to fight.

GPUs beat CPUs for lots of reasons. It's not just due to the memory ordering and consistency model. GPUs are also in-order (preferring not to spend precious die area on OoO execution or branch prediction), they run at lower clock speeds (which enables more complex pipeline stages), they have local memory that's much more efficient to access than CPU caches, and they have wider SIMD. They took a different evolutionary path than CPUs and optimized everything around a different class of computational workloads.

Tanj's avatar
Apr 11Edited

Yah it is difficult. But it remains a competitive possible advantage. Every so often the urge to eliminate x86 legacy is considered, and this would be a good part. While backwards compatibility is often cited, what is often forgotten is that folks trying to support old, bad software are a small fraction of new purchase. And they can buy old systems. Why should everyone else carry that burden? Convince 95% of the market who do not have that need to buy a faster cheaper system.

True, GPUs have many differences. But still the reason that they can scale Amdahl-lite while CPUs cannot is the invisible drag of coherency. Lightening that load will always be a good move for CPUs.

Schrödinger's Cat's avatar

Intel tried to drop compatibility with some seemingly no longer unused x86 ISA features, in a proposal they called x86S. After the formation of the x86 Ecosystem Advisory Council, which included AMD and several of each companies' big customers, they announced that the x86S proposal had been withdrawn.

The best explanation I've seen for this is probably what Linus, himself, wrote: https://www.realworldtech.com/forum/?threadid=221944&curpostid=221965

I think that serves as a demonstration of just how hard it is to drop backwards compatibility. It's a ball-and-chain that x86 is going to drag along for each and every one of its remaining days.

I actually agree with you that GPUs benefit from their weak memory consistency model (as well as having direct-mapped local memory, where coherency is a non-issue). However, it goes way beyond split locks. x86 is never getting away from TSO, which I think is a much more consequential issue than supporting split locks.

Tanj's avatar

Kind of sad to see Linus an exponent of legacy debt unpaid.

I've been doing this a lot longer than him. Sometimes you just have to repair the foundations, otherwise you just end up living in a hovel.

Compare x86 to M5: which cleaned house, and which would you rather use?

Choosing your future is not free of cost.

Peter W.'s avatar

Legacy apps are called "legacy" for a reason - it's really hard to just dump them outright. Apple was able to do so mainly because they fully controlled both hardware and OS, and had enough of a fiercely loyal customer base to make those two transitions viable.