To be sure, RISC-V is already used in various settings, such as controllers. But, I agree, it needs at least one proof of concept in actual silicon that it can go toe-to-toe with, for example, ARM's Neoverse cores. AFAIK, that PoC in hardware is yet to be delivered.
I tried https://github.com/clamchowder/Microbenchmarks/blob/master/CoherencyLatency/PThreadsCoherencyLatency.c and https://github.com/nviennot/core-to-core-latency , unfortunately their values cannot be cross-examined at all -- on my EPYC 7K62 system, the rust tool gives out logic CPU 0 to logic CPU [1,7] latency as {37,36,35,101,100,100,99} and the C tool has a result of {56.4,56.3,57.7,436.9,442.5,440.3,445.1}. When I test on EIC7700 (Pine64 StarPro64), my result of the Rust tool is also around 100ns (a little smaller when the board is clocked @ 1.8G -- the SiFive board cannot reach 1.8G so I run with cpufreq scaling max freq set to 1.4G).
A few years ago you guys did an awesome comparison of some gpu hardware encoders including the 6900xt and rtx 2060. Any way you could do the same for the 7000 series encoder and upcoming amd gpud this march? I want to upgrade from my 3090, but the nvidia cards this gen are atrocious value. Depending on value for the new amd 9000 series I want to consider one of those cards but I have no idea what the encoding is like on say the 7900xtx for example.
These are some pretty big architectural deficiencies. SIMD has been nearly ubiquitous on ARM apps processor class CPUs for over 15 years. Hardware unaligned access support even longer.
The OS designers could probably get the unaligned access performance down a bit if they decided to optimize and fast track it, 500+ instructions is pretty enormous. But the whole ethos of lacking unaligned memory accesses was that software was just not expected to do it and that the OS would fault the program and let the developer know to change the code to not need unaligned accesses. Of course that approach has not been realistic for existing software ecosystems going back decades now. Which makes it pretty baffling that an even remotely modern CPU would lack it.
The lack of SIMD is less of a potential performance disaster but it does at least make their core size and performance comparison highly questionable. But it's not surprising. RISC-V has really shot themselves in the foot with SIMD. Their earliest revisions intentionally avoided it trying to get designers to use some bizarre loop vectorizing coprocessor instead. When they got around to adding it properly the standard was still pretty volatile for years and the base feature set was pretty demanding (not an entirely bad thing but probably hurting adoption). It also has some questionable aspects such as needing a separate instruction to set element width which is probably going to hurt performance on some vectorized integer code.
It's interesting that the comparison was made with Pentium Pro. That CPU was probably fine for what it was, a high end and expensive workstation/server class part with relatively specialized code bases. But it needed robust features (eg MMX, which was still pretty new then, and better 16-bit support) as soon as it moved into its first actual mainstream consumer part (Pentium II). This SoC is not a specialized HPC part nor is it a specialized microcontroller so it doesn't really have the same excuses.
I think for p550, it’s very likely that the vector extension was not yet frozen by the time they finished designing the core, considering Sifive has already showcased the original Horse Creek p550 dev board in late 2022. t-head (xuantie) made their cores based on the vector 0.7 spec, and that just created a lot more issues with OS/toolchain… To make matters worse, there was then something happened between Intel and Sifive, and eventually Sifive turned to the Chinese vendor ESWIN for the SoC. Basically the board was delayed too much, and you are looking at a 3-4 year old core. I was hoping the availability of P6x0 dev boards, which has vector 1.0, at this day and age, and obviously I was hoping too much. For misaligned load/store, I doubt any implementation, including SiFive, has support for it. It’ll by default go through traps in M mode that does instruction decoding and emulation. (Invisible fault to OS). And the riscv spec basically says expect it to be slow and fix your code to not have misaligned access. The support for it should be considered as for correctness, not performance.
How was the time measured? I own a Premier p550 myself, and the mtime runs at 1Mhz. I wonder what clock source you were using to get to ns accuracy. And thanks so much for the analysis
Great read. Definitely seems like an odd choice to forego hardware support for misaligned accesses considering the substantial loss in performance. I assume it was a conscious design choice, do you think it is just because of the increased area/power requirements of the required logic?
Thanks for another great deep dive! So, is this inordinately large number of cycles the SiFive can require with unaligned loads that you observed its current glass (brittle) jaw? And, is that specific to this implementation of RISC-V, or a general feature of current RISC-V designs? If it's more a general feature, the idea of substituting ARM cores with RISC-V would be quite a hill to climb!
Yea, it's something to be avoided as much as possible. I don't think it's a general RISC-V thing (much as high penalties for subnormal FP numbers isn't a x86 or ARM thing), just an implementation choice on this CPU
The RISC-V architectural standard allows misaligned accesses to complete transparently (and correctly) without faulting. Going with not supporting it was a pretty questionable decision for this processor. But you could argue that the base ISA profile should have never been specified to disallow it to begin with.
Even a low profile implementation with microfaulting and microcode would be highly preferable to this.
Speculation (!) on my part: this small RISC-V core is serving a similar role as a small ARM Cortex M. Probably saves Allwinner a few cents in licensing fees.
I have a different theory. The E902 likely oversees power states, dynamically adjusting voltage/frequency for the ARM cores or peripherals. It can keep the main cores in sleep mode during idle periods, waking them only when needed, thereby optimizing battery life(?)
It would be interesting to run tests on Tenstorrent's silicon
Expensive, slow and incompatible. A long way to go before it becomes something meaningful for consumers.
I assume there's no way to run SPEC on this?
To be sure, RISC-V is already used in various settings, such as controllers. But, I agree, it needs at least one proof of concept in actual silicon that it can go toe-to-toe with, for example, ARM's Neoverse cores. AFAIK, that PoC in hardware is yet to be delivered.
I tried https://github.com/clamchowder/Microbenchmarks/blob/master/CoherencyLatency/PThreadsCoherencyLatency.c and https://github.com/nviennot/core-to-core-latency , unfortunately their values cannot be cross-examined at all -- on my EPYC 7K62 system, the rust tool gives out logic CPU 0 to logic CPU [1,7] latency as {37,36,35,101,100,100,99} and the C tool has a result of {56.4,56.3,57.7,436.9,442.5,440.3,445.1}. When I test on EIC7700 (Pine64 StarPro64), my result of the Rust tool is also around 100ns (a little smaller when the board is clocked @ 1.8G -- the SiFive board cannot reach 1.8G so I run with cpufreq scaling max freq set to 1.4G).
A few years ago you guys did an awesome comparison of some gpu hardware encoders including the 6900xt and rtx 2060. Any way you could do the same for the 7000 series encoder and upcoming amd gpud this march? I want to upgrade from my 3090, but the nvidia cards this gen are atrocious value. Depending on value for the new amd 9000 series I want to consider one of those cards but I have no idea what the encoding is like on say the 7900xtx for example.
These are some pretty big architectural deficiencies. SIMD has been nearly ubiquitous on ARM apps processor class CPUs for over 15 years. Hardware unaligned access support even longer.
The OS designers could probably get the unaligned access performance down a bit if they decided to optimize and fast track it, 500+ instructions is pretty enormous. But the whole ethos of lacking unaligned memory accesses was that software was just not expected to do it and that the OS would fault the program and let the developer know to change the code to not need unaligned accesses. Of course that approach has not been realistic for existing software ecosystems going back decades now. Which makes it pretty baffling that an even remotely modern CPU would lack it.
The lack of SIMD is less of a potential performance disaster but it does at least make their core size and performance comparison highly questionable. But it's not surprising. RISC-V has really shot themselves in the foot with SIMD. Their earliest revisions intentionally avoided it trying to get designers to use some bizarre loop vectorizing coprocessor instead. When they got around to adding it properly the standard was still pretty volatile for years and the base feature set was pretty demanding (not an entirely bad thing but probably hurting adoption). It also has some questionable aspects such as needing a separate instruction to set element width which is probably going to hurt performance on some vectorized integer code.
It's interesting that the comparison was made with Pentium Pro. That CPU was probably fine for what it was, a high end and expensive workstation/server class part with relatively specialized code bases. But it needed robust features (eg MMX, which was still pretty new then, and better 16-bit support) as soon as it moved into its first actual mainstream consumer part (Pentium II). This SoC is not a specialized HPC part nor is it a specialized microcontroller so it doesn't really have the same excuses.
I think for p550, it’s very likely that the vector extension was not yet frozen by the time they finished designing the core, considering Sifive has already showcased the original Horse Creek p550 dev board in late 2022. t-head (xuantie) made their cores based on the vector 0.7 spec, and that just created a lot more issues with OS/toolchain… To make matters worse, there was then something happened between Intel and Sifive, and eventually Sifive turned to the Chinese vendor ESWIN for the SoC. Basically the board was delayed too much, and you are looking at a 3-4 year old core. I was hoping the availability of P6x0 dev boards, which has vector 1.0, at this day and age, and obviously I was hoping too much. For misaligned load/store, I doubt any implementation, including SiFive, has support for it. It’ll by default go through traps in M mode that does instruction decoding and emulation. (Invisible fault to OS). And the riscv spec basically says expect it to be slow and fix your code to not have misaligned access. The support for it should be considered as for correctness, not performance.
How was the time measured? I own a Premier p550 myself, and the mtime runs at 1Mhz. I wonder what clock source you were using to get to ns accuracy. And thanks so much for the analysis
Great read. Definitely seems like an odd choice to forego hardware support for misaligned accesses considering the substantial loss in performance. I assume it was a conscious design choice, do you think it is just because of the increased area/power requirements of the required logic?
Thanks for another great deep dive! So, is this inordinately large number of cycles the SiFive can require with unaligned loads that you observed its current glass (brittle) jaw? And, is that specific to this implementation of RISC-V, or a general feature of current RISC-V designs? If it's more a general feature, the idea of substituting ARM cores with RISC-V would be quite a hill to climb!
Yea, it's something to be avoided as much as possible. I don't think it's a general RISC-V thing (much as high penalties for subnormal FP numbers isn't a x86 or ARM thing), just an implementation choice on this CPU
The RISC-V architectural standard allows misaligned accesses to complete transparently (and correctly) without faulting. Going with not supporting it was a pretty questionable decision for this processor. But you could argue that the base ISA profile should have never been specified to disallow it to begin with.
Even a low profile implementation with microfaulting and microcode would be highly preferable to this.
https://www.notebookcheck.net/Allwinner-A733-Processor-Benchmarks-and-Specs.951751.0.html
AllWinner added a RISC-V core to this ARM SoC for whatever reason.
Speculation (!) on my part: this small RISC-V core is serving a similar role as a small ARM Cortex M. Probably saves Allwinner a few cents in licensing fees.
I have a different theory. The E902 likely oversees power states, dynamically adjusting voltage/frequency for the ARM cores or peripherals. It can keep the main cores in sleep mode during idle periods, waking them only when needed, thereby optimizing battery life(?)