14 Comments
User's avatar
Jason Ross's avatar

I didn’t notice whose article I was reading. Saw the Chips and Cheese note at the end. Immediately went, “Oh, of course.”

Nobody else does analysis like this. Wow.

Expand full comment
Peter W.'s avatar

First, thanks Chester for another great article!

Regarding the RISC-V CPU design here: their main problem will be to find one or more customers to license it, as an at-risk production is a very high risk indeed for any company that is not a hyperscaler. Hyperscaler have a captive audience for their designs, which is a key reason why Graviton made it. Of course AWS owned and still owns Anapurna, and AWS was willing to invest and push Graviton at (initially) substantial discounts over x86 instances. Of course, that worked out really well for AWS.

In contrast, Ampere didn't have that very large in-house customer, and struggled finding their niche for their ARM-based designs. With a large RISC-V multicore CPU, the challenge is likely even greater, unless the CPU is adopted by a large player like Tencent or Alibaba, which are interested in RISC-V designs.

Expand full comment
Schrödinger's Cat's avatar

> In contrast, Ampere didn't have that very large in-house customer

Ampere is a CPU company, while Condor is an IP company.

Furthermore, it seems to me Andes' niche is in the embedded market, where there's perhaps need of a core with more horsepower than others in Andes' product offerings. This gives them more potential avenues than Ampere had.

Also, it seems Ampere started to suffer when they tried competing in a fairly mature market with an inferior product (i.e. AmpereOne). The RISC-V server market is quite new and we don't yet know how these cores will compare with other market entrants.

Of course, it's not a given the market for these cores will materialize, but I wouldn't count them out.

Expand full comment
Chester Lam's avatar

Keep in mind Ampere did just fine with Altra and "inferior" cores. They made up for it by stuffing more cores into the same space/power budget. AmpereOne tries to push that concept further

Expand full comment
Schrödinger's Cat's avatar

My comment about inferior cores was in reference to AmpereOne.

In the first couple years of Altra, you're right that they did fine with Neoverse N1 cores and simply adding more of them to compensate for what they individually lacked. By the time AmpereOne finally reached the market, it seemed like that trick was no longer working for them. It probably didn't help that so many hyperscalers had started building their own CPUs.

Expand full comment
Will's avatar

Bit of trivia: Ty Garibay was one of the architects of the Cyrix 4x86 and 6x86

Expand full comment
theduck's avatar

Very nice very nice, highly detailed 👍

Expand full comment
Schrödinger's Cat's avatar

> Schedulers can be expensive structures in conventional out-of-order CPUs,

> because they have to check whether instructions are ready to execute every cycle.

Don't any of these CPUs' schedulers know how to walk a data dependency graph? Each physical register could be augmented with a list of which reorder buffer entries depend on it. Each reorder buffer entry could have a counter that gets decremented as each dependency is satisfied. When the counter reaches zero, the entry is ready to run.

Sure, you'd have to manage those variable-length lists of each register's dependencies. But, that scales better than trying to search the reorder buffer for something that's ready to run. With an approach like this, the only cost of increasing your reorder buffer length (besides the cost of the reorder buffer, itself) is having to increase the number of bits needed to index within it. Importantly, the cost of searching it wouldn't change, because you're not actually searching it.

Sorry if this sounds naive. I'm obviously not a CPU designer.

Expand full comment
Chester Lam's avatar

Generally the scheduler only contains incomplete instructions, not everything in the ROB.

You don't want to walk a data dependency graph because that would introduce extra latency. What "regular" schedulers do is basically what you say. "as each dependency is satisfied" = compare register numbers to see if the result of a prior instruction satisfies a dependency. "counter that gets decremented..." would be functionally be the same as a ready flag that gets set when all dependencies are satisfied.

But every cycle, you still need to check those ready flags. See https://github.com/XUANTIE-RV/openc910/blob/main/C910_RTL_FACTORY/gen_rtl/idu/rtl/ct_idu_is_aiq1.v and https://github.com/XUANTIE-RV/openc910/blob/main/C910_RTL_FACTORY/gen_rtl/idu/rtl/ct_idu_is_aiq1_entry.v for example (C910). Check the x_rdy (ready) signal

Expand full comment
Schrödinger's Cat's avatar

> You don't want to walk a data dependency graph because that would introduce extra latency.

Yes, I can see that. What I was imagining is that each physical register write would trigger updates to the ROB entries of the instructions dependent on that register. While that whole process could add a little delay between each instruction retiring and its dependencies issuing, It does seem like this could happen concurrent with the first instruction issuing, given that pipeline latencies are usually deterministic. So, you can time the ROB updates so that any dependencies wouldn't issue before their data dependency is ready.

> every cycle, you still need to check those ready flags.

I don't really see why. If a counter is used to track the number of unmet dependencies, then you know when it reaches zero and can then add it to a queue for the set of issue ports that could accept it.

Sorry, I know I'm probably too clueless to be opining on this stuff, but reading about Condor's design got me thinking about the problem and how I'd approach it (absent some of the constraints hardware designers probably contend with). Thank you for humoring my musings and for your excellent writeup.

Expand full comment
Fredrik Tolf's avatar

>when it reaches zero and can then add it to a queue for the set of issue ports that could accept it

Not a CPU designer either, but I'd expect that kind of a queue to be quite challenging to implement, given that it would have to be able to accept an arbitrarily large number of instructions in a single cycle -- you could easily have dozens of instructions waiting for the same dependency, and routing all of those to the issuing queue in one cycle seems that it would require a ridiculous amount of gating, which would also need to be powered. You'd basically need one "port" for every instruction in the "waiting-for-dependencies queue", and each of them would need to be able to multiplex into any one of the "issuing queue" slots, and at that point you're powering such a port for each instruction every cycle anyway. (If the alternative would be to have a more limited number of ports and leave excessive instructions behind, you'd still have to check every instruction in the "waiting-for-dependencies queue" every cycle anyway to see if it's ready but didn't have room to be routed a previous cycle.)

Expand full comment
Schrödinger's Cat's avatar

> it would have to be able to accept an arbitrarily large number of instructions in a single cycle

It seems to me like CPUs routinely deal with many practical limits on what you can do in one cycle. For instance, register bank conflicts limit how many instructions can issue. In the same spirit, queuing instructions dependent on a register value would have to be practically limited to some total number of per cycle. I'm sure OoO schedulers have lots of such limitations.

> If the alternative would be to have a more limited number of ports and leave excessive instructions behind, you'd still have to check every instruction in the "waiting-for-dependencies queue"

The only time you'd put an instruction in the run queue is if its dependencies are satisfied, which you'd know because its unmet-dependency counter goes to zero. That's fed by the above process of reading off the dependencies of prior instructions, which then have their counters decremented.

The point, in all of this, is that nothing should be walking the ROB to search for work to do. Therefore, you don't need any sort of windowing approach, like what this core uses.

Expand full comment
Chester Lam's avatar

Usually, CPUs don't have register bank conflicts. They use multiported register files with enough RF ports to feed the execution ports except under very limited situations (like Zen 1/2 and one FMA pipe sharing a port with a FADD pipe). They really want to issue each instruction as fast as possible without unnecessary delay. Any delay along a critical path would be bad for performance.

Instead of putting an instruction in a run queue, why not execute it? Seems like the run queue creates a harder problem in front of the schedulers. The scheduler would still have to check which micro-ops are ready to decide what to put into the run queue.

And again no one walks the ROB to search for ready instructions, either on conventional OoO or Cuzco. Cuzco walks a time resource matrix to see when execution resources will be available, not whether micro-ops are ready.

Expand full comment
Schrödinger's Cat's avatar

Thanks for another reply. I respectfully yield, as I'm clearly out of my depth, here.

Expand full comment