Arm has traditionally targeted the low end of the power and performance curve, but just as Intel has been looking to expand into the low power market, ARM is looking to expand into higher power and performance segments.
Your code seems to do 20 instructions, two blocks of 10 instructions. So the throughput will actually be min(10/latency, throughput) which corresponds to the 2.5 result above. Doing 16 independent instructions make the throughput goes to 4 IPC as advertised.
Your code seems to do 20 instructions, two blocks of 10 instructions. So the throughput will actually be min(10/latency, throughput) which corresponds to the 2.5 result above. Doing 16 independent instructions make the throughput goes to 4 IPC as advertised.
Some other tests are showing 2.53 too.