Hello you fine Internet folks,
I was at Intel Tech Tour this year where Intel talked about their upcoming Panther Lake and Clearwater Forest CPUs. I had a chance to sit down with Stephen Robinson, Lead Architect for x86 Cores at Intel, and talk about lntel’s approach to Cougar Cove and Darkmont and what changed from the prior generations.
Hope y’all enjoy!
The transcript has been edited for readability and conciseness.
George Cozma: Hello you fine Internet folks! We’re here in Phoenix, Arizona at Intel Tech Tour 2025, where there’s been a number of disclosures about Panther Lake and Clearwater Forest. And joining me, to talk about the core architectures in both of those CPUs, is Stephen Robinson. What do you do at Intel?
Stephen Robinson: I am a CPU architect and I lead the architecture team for the x86 cores.
George Cozma: Awesome. So, diving straight in: so we did do an interview about- a recorded interview about Skymont. But going back to Skymont, what were the big changes from your previous architecture, Crestmont, moving into Skymont?
Stephen Robinson: Yeah, so Skymont, we did a lot. We wanted to build a wider, deeper machine so that we could sort of run more workloads. So kind of, “coverage” is one of the terms we use sometimes. If we can get more workloads running on an E core, then we can bring more efficiency to the whole platform.
So, you know, sometimes people want- “Why are you adding IPC to an E core? You’re making it more expensive, right?” Well, actually, software runs better. So we made the out-of-order depth about 50% bigger, somewhere around that ballpark. We went from two load ports to three. We roughly doubled the vector hardware. So we had two FMAs in Crestmont. Now we have four FMAs in Skymont. And then the front end, we went from kind of a six-wide, two cluster, three decode front end to a nine wide, three cluster. And then eight wide alloc and, you know, more branch prediction, a little bit more L2 bandwidth, the whole lot.
George Cozma: So sort of an interesting quirk that I noticed about Skymont is that it has four store ports and three load ports. Why the four store ports? Usually you see more load ports than store ports. Why more store ports in this case?
Stephen Robinson: Yeah. So let’s, let’s break it down into address generation versus execution. So, when you have three load execution ports, you need three load address generators. And so that’s there. On the store side, we have four store address generation units. But we only sustain two stores into the data cache.
So we have a little bit of asymmetry on the store side. So you’re right. Why on earth do we have more store address units than store ports? The answer is, we have hazards between loads and stores. And sometimes loads get blocked on stores because we don’t know the store address because we’re all out of order. So by increasing the store address bandwidth, that reduces the latency to resolving unknown stores.
So basically we get performance by just spending more time and effort generating store addresses so that loads don’t end up blocking.
George Cozma: Awesome. So I know in Darkmont, something that caught my eye was the memory disambiguation. And that usually has to deal with store to load forwarding. But that’s been a technique that’s been around for quite a while, so what have you enhanced in Darkmont that you would like to talk about?
Stephen Robinson: Sure, yeah. So you’re right. It’s about store load connections. And so in... there’s several different ways people do it. You can have a big, big table that tells you when it’s safe to ignore stores. You can have a small table that tells you when it’s unsafe to ignore stores.
So you can kind of do it either way. And those two techniques end up kind of collapsing to the same answer, because the big table saturates and everything’s safe. And then so, you know, you discover the hazard. What we’ve done here is we’ve spent a little bit more time trying to have hardware that isn’t just sort of a history table, that actually figures out, before address generation, whether things are going to be connected.
So when we bring uops to the memory subsystem to address generation, we kind of look at some information and say, “Oh, I’m fairly confident that these loads and stores are connected.” So I’m not using a table, I’m using sort of the inherent information about the instructions, whether I think they’re going to be connected. And that gives us the ability to kind of slow down the load or something like that so that, you know, we know when the store’s gone now it should be safer to do the load that we think might be connected.
George Cozma: Interesting. Now, I know on Cougar Cove, which is the P core in Panther Lake, that you also commented on the memory disambiguation. Is that similar in Cougar Cove?
Stephen Robinson: It’s similar in kind of concept, but it’s a bit different in implementation. So it’s a different method, but in the end, we’re still getting to where we’re trying to figure out when’s a good time to schedule the load relative to the store that we’re going to be dependent on. So, you know, I tell a story of two tables. Well, this is kind of another table. And again, we’re trying to say, “Okay, now I think it’s time to do the load because I think it’s going to be connected to the store.” So similar concept, different implementation.
George Cozma: Okay. And I guess sort of what drives the two different implementations? Like the reasoning behind the two different implementations, I should say.
Stephen Robinson: I would say it’s as simple as two teams working in parallel, doing independent research, solving localized problems, coming up with solutions. And then we end up with two similar but different implementations.
George Cozma: Okay.
Stephen Robinson: Two teams.
George Cozma: Cool. So sort of talking about Cougar Cove, a key change made in Lion Cove was the lack of SMT. SMT in Lunar Lake and Arrow Lake is no longer there. Why wasn’t it -- could you have re-added it to Cougar Cove if you had wished? And why haven’t you added it back? Like what would be the reason why you wouldn’t?
Stephen Robinson: Yeah. So let’s talk about client first, right, where this is where we’ve shipped products without SMT. When you have hybrid compute, SMT isn’t necessarily as valuable, right? So when you schedule something, you know, if you want performance, you schedule it on a P core. And then you schedule it on an E core. And then, once you’ve exhausted those, then you would come back and schedule a thread.
So in Alder Lake, Raptor Lake, that’s kind of how it works. So those are the threads on top of the dessert, right? In Lion Cove, in Lunar Lake and Arrow Lake, you know, we removed threads. We didn’t have threads implemented. Let me say it that way. And so that gave us -- we didn’t lose a lot in client because of hybrid and the core count. But we gained a bit in our design execution, so a little bit lower power because you don’t have the transistors and the logic to support SMT. A little bit smaller area because -- same reason. And it’s a little bit easier to achieve your frequency target. Because, you know, the old joke that SMT is a bit in the mux [multiplexer], right? So there’s truth to that. There’s a mux somewhere. And that causes delay. So now you’ve kind of got something that’s maybe a little bit easier and less expensive and maybe can go a little bit faster.
So when you’re doing Cougar Cove, you just take those basic premises and say, yeah, this is what I’m going to do for the next gen as well.
George Cozma: And so on server, I know that there have been some data points that suggest that SMT does help. So what is sort of your opinion there?
Stephen Robinson: Yeah. So server is a little bit different than client. You know, people have talked about doing hybrid compute in servers. But nobody does it. And the simple explanation is if you want to be hybrid in servers, you do it at the rack-level, not inside an SOC. Why would I want asymmetry inside my SOC when I can have asymmetry, you know, a 200 core server, another 200 core server, and I’ve got a bunch of those. So you have the choice. You know, Amazon and others, they have different instances that you can go and you can get. So what’s the value of different instances within one?
So first, there’s no hybrid in servers today in general. The second thing is, you know, kind of the story I told about you’d schedule on P cores and E cores and come back with threads. Well, if you don’t have the E cores, then you’re going to go on threads. Server workloads and gaming workloads and others, right, they miss a lot. They can have long latency. And so when you miss and you have long latency, you’ve got available hardware. So in the server area, threads, there are more workloads that like it. You know, take a networking workload. Those usually like threads because they’re moving a lot of data around and they’re exposing those latencies. So the server workloads are a bit different. And without hybrid, then SMT has more value.
George Cozma: And so actually speaking on difference between client and server, so Darkmont is both used in Panther Lake and in Clearwater Forest. What sort of differences do you have to make in a core between server and client in terms of stuff like RAS features? So what differences are there in terms of implementation and what you have to design.
Stephen Robinson: Yeah. Great question. So in the client space, you can have RAS features, but they don’t quite have as much value because the client system is different, right? If I have hundreds of cores or thousands of cores, reliability becomes very, very important.
If I’m on my own little laptop and I have fewer, it’s a different concern, right? When Google Cloud goes down, everyone’s very upset.
George Cozma: Everything goes down.
Stephen Robinson: That’s right. Everyone’s very upset. So clearly, you know, the bar of the reliability is there. So in a core, if we want a target server, there are additional features we’ll do. You know, ECC in the caches... so inside the core, we do add features. We can put that core in both if we want, right? So there aren’t a lot of physical differences between the cores in the two, but the environment is very different. So on the server side, maybe we have power gates per core, maybe we don’t. The power delivery is different. Because the power delivery is different, you may change the decision on when to power gate and when not to. And the power level is different. So maybe power gating isn’t as important in server because, you know, 24/7, I’m always running.
The other thing is there are things that can only really truly work at the SOC level because you need SOC components to be part of that. You know, take technology like SGX or TDX, you know, security. You know, secure computing elements. If you don’t have the security and the controllers in your client part, then even if you implemented it inside the core, it doesn’t matter because you need that whole system to do it. So there’s a lot of things that maybe it’s in the core, but you can really only test it and run it and productize it with the complete stack.
George Cozma: And speaking of sort of the differences between client and server, I know in Lunar Lake, you talked a lot about how there was some novel branch prediction stuff going on. Do you see that being helpful in server workloads or was that -- were those sort of improvements more targeted towards client?
Stephen Robinson: Everyone wants branch prediction. Honestly, everyone does. So in client, you know, it’s funny. Games. Are games similar to web servers?
George Cozma: Not really.
Stephen Robinson: Not really, right. But in terms of code footprints and paths and sizes, they’re more similar than you realize. Same kind of thing for databases. Databases are very large binaries.
George Cozma: Databases are actually close -- are very similar to games in their sort of what they like in a core.
Stephen Robinson: Exactly, right. So honestly, when it comes to branch prediction, we do it for everybody, right? We do it for client, we do it for server. And the things we do will be workload-specific sometimes in where you get the gains. But there’s always a workload in both client and server that will appreciate what you did.
George Cozma: So sort of evolving on that, is it such that potentially you could make a branch predictor that is more targeted for server workloads and/or client workloads? Or is it such that there isn’t really a difference there, so to speak?
Stephen Robinson: I would say that I think the -- internally, within Intel, we tend to think that server wants more branch prediction, larger capacity, right? Bigger paths. Because we know that the workloads are complex in -- and large binaries in server. But it really is in client as well, right? You know, just -- which workloads are you working at, right?
George Cozma: Exactly.
Stephen Robinson: You know. SPEC, okay, that’s different, obviously, right? But again, games and databases, yeah, they’re --
George: I would argue games and databases are closer to each other than SPEC is to either, in most cases.
Stephen Robinson: That can be true.
George Cozma: But of course, my final question here is, what’s your favorite type of cheese?
Stephen Robinson: Oof. I like a good smoked Gouda. But honestly, we’re doing blue cheese, Roquefort type things these days. Because, you know, a little musky.
George Cozma: I will admit, blue cheese is not my favorite. I had a really good cheddar from Washington. And, yeah, that was actually really good. It was a smoked cheddar. Which I’m not usually the biggest fan of.
Stephen Robinson: I do like the smoked cheeses. I really do.
George Cozma: Well, thank you so much.
Stephen Robinson: Of course.
George Cozma: So thank you so much for watching. If you like interviews like this, hit like, hit subscribe. Unfortunately I do have to say all that because it does help with the algorithm. And go check out the Substack where there will be a written transcript of this up there. And, well, if you want to donate, PayPal and Patreon are down below. And have a good one, folks!
Thanks for the interview! I would have loved to hear his perspective on the Zen5's two-ahead branch predictor.
I also wonder if they have improved the latency of the memory subsystem. IMHO, that's one of the weaknesses of recent Intel CPUs.