Condor Computing, a subsidiary of Andes Technology that creates licensable RISC-V cores, has a business model with parallels to Arm (the company) and SiFive.
> Schedulers can be expensive structures in conventional out-of-order CPUs,
> because they have to check whether instructions are ready to execute every cycle.
Don't any of these CPUs' schedulers know how to walk a data dependency graph? Each physical register could be augmented with a list of which reorder buffer entries depend on it. Each reorder buffer entry could have a counter that gets decremented as each dependency is satisfied. When the counter reaches zero, the entry is ready to run.
Sure, you'd have to manage those variable-length lists of each register's dependencies. But, that scales better than trying to search the reorder buffer for something that's ready to run. With an approach like this, the only cost of increasing your reorder buffer length (besides the cost of the reorder buffer, itself) is having to increase the number of bits needed to index within it. Importantly, the cost of searching it wouldn't change, because you're not actually searching it.
Sorry if this sounds naive. I'm obviously not a CPU designer.
Generally the scheduler only contains incomplete instructions, not everything in the ROB.
You don't want to walk a data dependency graph because that would introduce extra latency. What "regular" schedulers do is basically what you say. "as each dependency is satisfied" = compare register numbers to see if the result of a prior instruction satisfies a dependency. "counter that gets decremented..." would be functionally be the same as a ready flag that gets set when all dependencies are satisfied.
> You don't want to walk a data dependency graph because that would introduce extra latency.
Yes, I can see that. What I was imagining is that each physical register write would trigger updates to the ROB entries of the instructions dependent on that register. While that whole process could add a little delay between each instruction retiring and its dependencies issuing, It does seem like this could happen concurrent with the first instruction issuing, given that pipeline latencies are usually deterministic. So, you can time the ROB updates so that any dependencies wouldn't issue before their data dependency is ready.
> every cycle, you still need to check those ready flags.
I don't really see why. If a counter is used to track the number of unmet dependencies, then you know when it reaches zero and can then add it to a queue for the set of issue ports that could accept it.
Sorry, I know I'm probably too clueless to be opining on this stuff, but reading about Condor's design got me thinking about the problem and how I'd approach it (absent some of the constraints hardware designers probably content with). Thank you for humoring my musings and for your excellent writeup.
> Schedulers can be expensive structures in conventional out-of-order CPUs,
> because they have to check whether instructions are ready to execute every cycle.
Don't any of these CPUs' schedulers know how to walk a data dependency graph? Each physical register could be augmented with a list of which reorder buffer entries depend on it. Each reorder buffer entry could have a counter that gets decremented as each dependency is satisfied. When the counter reaches zero, the entry is ready to run.
Sure, you'd have to manage those variable-length lists of each register's dependencies. But, that scales better than trying to search the reorder buffer for something that's ready to run. With an approach like this, the only cost of increasing your reorder buffer length (besides the cost of the reorder buffer, itself) is having to increase the number of bits needed to index within it. Importantly, the cost of searching it wouldn't change, because you're not actually searching it.
Sorry if this sounds naive. I'm obviously not a CPU designer.
Generally the scheduler only contains incomplete instructions, not everything in the ROB.
You don't want to walk a data dependency graph because that would introduce extra latency. What "regular" schedulers do is basically what you say. "as each dependency is satisfied" = compare register numbers to see if the result of a prior instruction satisfies a dependency. "counter that gets decremented..." would be functionally be the same as a ready flag that gets set when all dependencies are satisfied.
But every cycle, you still need to check those ready flags. See https://github.com/XUANTIE-RV/openc910/blob/main/C910_RTL_FACTORY/gen_rtl/idu/rtl/ct_idu_is_aiq1.v and https://github.com/XUANTIE-RV/openc910/blob/main/C910_RTL_FACTORY/gen_rtl/idu/rtl/ct_idu_is_aiq1_entry.v for example (C910). Check the x_rdy (ready) signal
> You don't want to walk a data dependency graph because that would introduce extra latency.
Yes, I can see that. What I was imagining is that each physical register write would trigger updates to the ROB entries of the instructions dependent on that register. While that whole process could add a little delay between each instruction retiring and its dependencies issuing, It does seem like this could happen concurrent with the first instruction issuing, given that pipeline latencies are usually deterministic. So, you can time the ROB updates so that any dependencies wouldn't issue before their data dependency is ready.
> every cycle, you still need to check those ready flags.
I don't really see why. If a counter is used to track the number of unmet dependencies, then you know when it reaches zero and can then add it to a queue for the set of issue ports that could accept it.
Sorry, I know I'm probably too clueless to be opining on this stuff, but reading about Condor's design got me thinking about the problem and how I'd approach it (absent some of the constraints hardware designers probably content with). Thank you for humoring my musings and for your excellent writeup.