how do u declare and initialize and array of NOPs?
__asm__ __volatile__(".byte 0x90") or __asm__ __volatile__("nop")
like this?
void (*func_ptr[4])(void) {
&Test_nops,
&Test_nops,
&Test_nops,
&Test_nops
};
void Test_Nops(void) {
__asm__ __volatile__("nop");
return;
};
2) It shows Zen_cores has 2-sets of 4-wide Decoders. (is it on a per core basis?, that's 1 core 8 decoders) hence if i got a 8-cores CPU (say the X3D, then in totality my CPU have 8*8==64 decoders
Fill opcodes into the array, just like filling any normal array. Then use whatever OS call you need to make it executable and jump to it. Make sure you have a return at the end of the array.
My code has
for (uint64_t nopIdx = 0; nopIdx < elements; nopIdx++)
nops[nopIdx] = *nop8bptr;
where nop8ptr started out pointing to a 8B NOP at first, but I've hacked a lot of other things in since so it's no longer recognizable.
then mprotect((void *)nopfuncPage, mprotectLen, PROT_EXEC | PROT_READ | PROT_WRITE) or VirtualAlloc(..., MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE) on Windows, and jump to it from a function pointer
At this point I'm starting to wonder if Zen CPUs even need an L1 icache. The omission was pretty disastrous for Netburst but Zen 5 has a much bigger op cache capacity with L1I only having maybe 1.5x to 2x the effective coverage, and L2 is basically as good at feeding the decoders anyway.
Maybe it'd be better served by an L2 op cache? Using much denser SRAM cells it could offer more effective capacity in the same area as the existing L1I at a probably mostly harmless latency cost while benefiting from higher energy efficiency, higher typical frontend throughput and maybe an easier path to using both decode clusters in a single thread on queued L1 op cache misses.
Great article, as always. I really appreciate these articles that might not have practical relevance but help to explain the decisions taken and trade-offs made by Chip Designers.
Another aspect of interest for me would have been the impact on power efficiency when disabling the OpCache. Did the consumption (in W) increase significantly? How much did it impact the overall energy needed for a given fixed workload like a CB24 run in J/Ws?
Dumb questions..
how do u declare and initialize and array of NOPs?
__asm__ __volatile__(".byte 0x90") or __asm__ __volatile__("nop")
like this?
void (*func_ptr[4])(void) {
&Test_nops,
&Test_nops,
&Test_nops,
&Test_nops
};
void Test_Nops(void) {
__asm__ __volatile__("nop");
return;
};
2) It shows Zen_cores has 2-sets of 4-wide Decoders. (is it on a per core basis?, that's 1 core 8 decoders) hence if i got a 8-cores CPU (say the X3D, then in totality my CPU have 8*8==64 decoders
Fill opcodes into the array, just like filling any normal array. Then use whatever OS call you need to make it executable and jump to it. Make sure you have a return at the end of the array.
My code has
for (uint64_t nopIdx = 0; nopIdx < elements; nopIdx++)
nops[nopIdx] = *nop8bptr;
where nop8ptr started out pointing to a 8B NOP at first, but I've hacked a lot of other things in since so it's no longer recognizable.
then mprotect((void *)nopfuncPage, mprotectLen, PROT_EXEC | PROT_READ | PROT_WRITE) or VirtualAlloc(..., MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE) on Windows, and jump to it from a function pointer
And yes, decoders are per-core
At this point I'm starting to wonder if Zen CPUs even need an L1 icache. The omission was pretty disastrous for Netburst but Zen 5 has a much bigger op cache capacity with L1I only having maybe 1.5x to 2x the effective coverage, and L2 is basically as good at feeding the decoders anyway.
Maybe it'd be better served by an L2 op cache? Using much denser SRAM cells it could offer more effective capacity in the same area as the existing L1I at a probably mostly harmless latency cost while benefiting from higher energy efficiency, higher typical frontend throughput and maybe an easier path to using both decode clusters in a single thread on queued L1 op cache misses.
Great article, as always. I really appreciate these articles that might not have practical relevance but help to explain the decisions taken and trade-offs made by Chip Designers.
Another aspect of interest for me would have been the impact on power efficiency when disabling the OpCache. Did the consumption (in W) increase significantly? How much did it impact the overall energy needed for a given fixed workload like a CB24 run in J/Ws?