Condor Computing, a subsidiary of Andes Technology that creates licensable RISC-V cores, has a business model with parallels to Arm (the company) and SiFive.
Regarding the RISC-V CPU design here: their main problem will be to find one or more customers to license it, as an at-risk production is a very high risk indeed for any company that is not a hyperscaler. Hyperscaler have a captive audience for their designs, which is a key reason why Graviton made it. Of course AWS owned and still owns Anapurna, and AWS was willing to invest and push Graviton at (initially) substantial discounts over x86 instances. Of course, that worked out really well for AWS.
In contrast, Ampere didn't have that very large in-house customer, and struggled finding their niche for their ARM-based designs. With a large RISC-V multicore CPU, the challenge is likely even greater, unless the CPU is adopted by a large player like Tencent or Alibaba, which are interested in RISC-V designs.
> In contrast, Ampere didn't have that very large in-house customer
Ampere is a CPU company, while Condor is an IP company.
Furthermore, it seems to me Andes' niche is in the embedded market, where there's perhaps need of a core with more horsepower than others in Andes' product offerings. This gives them more potential avenues than Ampere had.
Also, it seems Ampere started to suffer when they tried competing in a fairly mature market with an inferior product (i.e. AmpereOne). The RISC-V server market is quite new and we don't yet know how these cores will compare with other market entrants.
Of course, it's not a given the market for these cores will materialize, but I wouldn't count them out.
Keep in mind Ampere did just fine with Altra and "inferior" cores. They made up for it by stuffing more cores into the same space/power budget. AmpereOne tries to push that concept further
My comment about inferior cores was in reference to AmpereOne.
In the first couple years of Altra, you're right that they did fine with Neoverse N1 cores and simply adding more of them to compensate for what they individually lacked. By the time AmpereOne finally reached the market, it seemed like that trick was no longer working for them. It probably didn't help that so many hyperscalers had started building their own CPUs.
> Schedulers can be expensive structures in conventional out-of-order CPUs,
> because they have to check whether instructions are ready to execute every cycle.
Don't any of these CPUs' schedulers know how to walk a data dependency graph? Each physical register could be augmented with a list of which reorder buffer entries depend on it. Each reorder buffer entry could have a counter that gets decremented as each dependency is satisfied. When the counter reaches zero, the entry is ready to run.
Sure, you'd have to manage those variable-length lists of each register's dependencies. But, that scales better than trying to search the reorder buffer for something that's ready to run. With an approach like this, the only cost of increasing your reorder buffer length (besides the cost of the reorder buffer, itself) is having to increase the number of bits needed to index within it. Importantly, the cost of searching it wouldn't change, because you're not actually searching it.
Sorry if this sounds naive. I'm obviously not a CPU designer.
Generally the scheduler only contains incomplete instructions, not everything in the ROB.
You don't want to walk a data dependency graph because that would introduce extra latency. What "regular" schedulers do is basically what you say. "as each dependency is satisfied" = compare register numbers to see if the result of a prior instruction satisfies a dependency. "counter that gets decremented..." would be functionally be the same as a ready flag that gets set when all dependencies are satisfied.
> You don't want to walk a data dependency graph because that would introduce extra latency.
Yes, I can see that. What I was imagining is that each physical register write would trigger updates to the ROB entries of the instructions dependent on that register. While that whole process could add a little delay between each instruction retiring and its dependencies issuing, It does seem like this could happen concurrent with the first instruction issuing, given that pipeline latencies are usually deterministic. So, you can time the ROB updates so that any dependencies wouldn't issue before their data dependency is ready.
> every cycle, you still need to check those ready flags.
I don't really see why. If a counter is used to track the number of unmet dependencies, then you know when it reaches zero and can then add it to a queue for the set of issue ports that could accept it.
Sorry, I know I'm probably too clueless to be opining on this stuff, but reading about Condor's design got me thinking about the problem and how I'd approach it (absent some of the constraints hardware designers probably contend with). Thank you for humoring my musings and for your excellent writeup.
>when it reaches zero and can then add it to a queue for the set of issue ports that could accept it
Not a CPU designer either, but I'd expect that kind of a queue to be quite challenging to implement, given that it would have to be able to accept an arbitrarily large number of instructions in a single cycle -- you could easily have dozens of instructions waiting for the same dependency, and routing all of those to the issuing queue in one cycle seems that it would require a ridiculous amount of gating, which would also need to be powered. You'd basically need one "port" for every instruction in the "waiting-for-dependencies queue", and each of them would need to be able to multiplex into any one of the "issuing queue" slots, and at that point you're powering such a port for each instruction every cycle anyway. (If the alternative would be to have a more limited number of ports and leave excessive instructions behind, you'd still have to check every instruction in the "waiting-for-dependencies queue" every cycle anyway to see if it's ready but didn't have room to be routed a previous cycle.)
> it would have to be able to accept an arbitrarily large number of instructions in a single cycle
It seems to me like CPUs routinely deal with many practical limits on what you can do in one cycle. For instance, register bank conflicts limit how many instructions can issue. In the same spirit, queuing instructions dependent on a register value would have to be practically limited to some total number of per cycle. I'm sure OoO schedulers have lots of such limitations.
> If the alternative would be to have a more limited number of ports and leave excessive instructions behind, you'd still have to check every instruction in the "waiting-for-dependencies queue"
The only time you'd put an instruction in the run queue is if its dependencies are satisfied, which you'd know because its unmet-dependency counter goes to zero. That's fed by the above process of reading off the dependencies of prior instructions, which then have their counters decremented.
The point, in all of this, is that nothing should be walking the ROB to search for work to do. Therefore, you don't need any sort of windowing approach, like what this core uses.
Usually, CPUs don't have register bank conflicts. They use multiported register files with enough RF ports to feed the execution ports except under very limited situations (like Zen 1/2 and one FMA pipe sharing a port with a FADD pipe). They really want to issue each instruction as fast as possible without unnecessary delay. Any delay along a critical path would be bad for performance.
Instead of putting an instruction in a run queue, why not execute it? Seems like the run queue creates a harder problem in front of the schedulers. The scheduler would still have to check which micro-ops are ready to decide what to put into the run queue.
And again no one walks the ROB to search for ready instructions, either on conventional OoO or Cuzco. Cuzco walks a time resource matrix to see when execution resources will be available, not whether micro-ops are ready.
I didn’t notice whose article I was reading. Saw the Chips and Cheese note at the end. Immediately went, “Oh, of course.”
Nobody else does analysis like this. Wow.
First, thanks Chester for another great article!
Regarding the RISC-V CPU design here: their main problem will be to find one or more customers to license it, as an at-risk production is a very high risk indeed for any company that is not a hyperscaler. Hyperscaler have a captive audience for their designs, which is a key reason why Graviton made it. Of course AWS owned and still owns Anapurna, and AWS was willing to invest and push Graviton at (initially) substantial discounts over x86 instances. Of course, that worked out really well for AWS.
In contrast, Ampere didn't have that very large in-house customer, and struggled finding their niche for their ARM-based designs. With a large RISC-V multicore CPU, the challenge is likely even greater, unless the CPU is adopted by a large player like Tencent or Alibaba, which are interested in RISC-V designs.
> In contrast, Ampere didn't have that very large in-house customer
Ampere is a CPU company, while Condor is an IP company.
Furthermore, it seems to me Andes' niche is in the embedded market, where there's perhaps need of a core with more horsepower than others in Andes' product offerings. This gives them more potential avenues than Ampere had.
Also, it seems Ampere started to suffer when they tried competing in a fairly mature market with an inferior product (i.e. AmpereOne). The RISC-V server market is quite new and we don't yet know how these cores will compare with other market entrants.
Of course, it's not a given the market for these cores will materialize, but I wouldn't count them out.
Keep in mind Ampere did just fine with Altra and "inferior" cores. They made up for it by stuffing more cores into the same space/power budget. AmpereOne tries to push that concept further
My comment about inferior cores was in reference to AmpereOne.
In the first couple years of Altra, you're right that they did fine with Neoverse N1 cores and simply adding more of them to compensate for what they individually lacked. By the time AmpereOne finally reached the market, it seemed like that trick was no longer working for them. It probably didn't help that so many hyperscalers had started building their own CPUs.
Bit of trivia: Ty Garibay was one of the architects of the Cyrix 4x86 and 6x86
Very nice very nice, highly detailed 👍
> Schedulers can be expensive structures in conventional out-of-order CPUs,
> because they have to check whether instructions are ready to execute every cycle.
Don't any of these CPUs' schedulers know how to walk a data dependency graph? Each physical register could be augmented with a list of which reorder buffer entries depend on it. Each reorder buffer entry could have a counter that gets decremented as each dependency is satisfied. When the counter reaches zero, the entry is ready to run.
Sure, you'd have to manage those variable-length lists of each register's dependencies. But, that scales better than trying to search the reorder buffer for something that's ready to run. With an approach like this, the only cost of increasing your reorder buffer length (besides the cost of the reorder buffer, itself) is having to increase the number of bits needed to index within it. Importantly, the cost of searching it wouldn't change, because you're not actually searching it.
Sorry if this sounds naive. I'm obviously not a CPU designer.
Generally the scheduler only contains incomplete instructions, not everything in the ROB.
You don't want to walk a data dependency graph because that would introduce extra latency. What "regular" schedulers do is basically what you say. "as each dependency is satisfied" = compare register numbers to see if the result of a prior instruction satisfies a dependency. "counter that gets decremented..." would be functionally be the same as a ready flag that gets set when all dependencies are satisfied.
But every cycle, you still need to check those ready flags. See https://github.com/XUANTIE-RV/openc910/blob/main/C910_RTL_FACTORY/gen_rtl/idu/rtl/ct_idu_is_aiq1.v and https://github.com/XUANTIE-RV/openc910/blob/main/C910_RTL_FACTORY/gen_rtl/idu/rtl/ct_idu_is_aiq1_entry.v for example (C910). Check the x_rdy (ready) signal
> You don't want to walk a data dependency graph because that would introduce extra latency.
Yes, I can see that. What I was imagining is that each physical register write would trigger updates to the ROB entries of the instructions dependent on that register. While that whole process could add a little delay between each instruction retiring and its dependencies issuing, It does seem like this could happen concurrent with the first instruction issuing, given that pipeline latencies are usually deterministic. So, you can time the ROB updates so that any dependencies wouldn't issue before their data dependency is ready.
> every cycle, you still need to check those ready flags.
I don't really see why. If a counter is used to track the number of unmet dependencies, then you know when it reaches zero and can then add it to a queue for the set of issue ports that could accept it.
Sorry, I know I'm probably too clueless to be opining on this stuff, but reading about Condor's design got me thinking about the problem and how I'd approach it (absent some of the constraints hardware designers probably contend with). Thank you for humoring my musings and for your excellent writeup.
>when it reaches zero and can then add it to a queue for the set of issue ports that could accept it
Not a CPU designer either, but I'd expect that kind of a queue to be quite challenging to implement, given that it would have to be able to accept an arbitrarily large number of instructions in a single cycle -- you could easily have dozens of instructions waiting for the same dependency, and routing all of those to the issuing queue in one cycle seems that it would require a ridiculous amount of gating, which would also need to be powered. You'd basically need one "port" for every instruction in the "waiting-for-dependencies queue", and each of them would need to be able to multiplex into any one of the "issuing queue" slots, and at that point you're powering such a port for each instruction every cycle anyway. (If the alternative would be to have a more limited number of ports and leave excessive instructions behind, you'd still have to check every instruction in the "waiting-for-dependencies queue" every cycle anyway to see if it's ready but didn't have room to be routed a previous cycle.)
> it would have to be able to accept an arbitrarily large number of instructions in a single cycle
It seems to me like CPUs routinely deal with many practical limits on what you can do in one cycle. For instance, register bank conflicts limit how many instructions can issue. In the same spirit, queuing instructions dependent on a register value would have to be practically limited to some total number of per cycle. I'm sure OoO schedulers have lots of such limitations.
> If the alternative would be to have a more limited number of ports and leave excessive instructions behind, you'd still have to check every instruction in the "waiting-for-dependencies queue"
The only time you'd put an instruction in the run queue is if its dependencies are satisfied, which you'd know because its unmet-dependency counter goes to zero. That's fed by the above process of reading off the dependencies of prior instructions, which then have their counters decremented.
The point, in all of this, is that nothing should be walking the ROB to search for work to do. Therefore, you don't need any sort of windowing approach, like what this core uses.
Usually, CPUs don't have register bank conflicts. They use multiported register files with enough RF ports to feed the execution ports except under very limited situations (like Zen 1/2 and one FMA pipe sharing a port with a FADD pipe). They really want to issue each instruction as fast as possible without unnecessary delay. Any delay along a critical path would be bad for performance.
Instead of putting an instruction in a run queue, why not execute it? Seems like the run queue creates a harder problem in front of the schedulers. The scheduler would still have to check which micro-ops are ready to decide what to put into the run queue.
And again no one walks the ROB to search for ready instructions, either on conventional OoO or Cuzco. Cuzco walks a time resource matrix to see when execution resources will be available, not whether micro-ops are ready.
Thanks for another reply. I respectfully yield, as I'm clearly out of my depth, here.