Thanks Chester, great test and review of ARM's A725!
My (admittedly jaded) take on the slim-down of some features (vs the A710) is that, in addition to reducing area, this also reminds licensees to kindly use and pay for an X core design if they want a more performant core.
And I have this question about SVE: how much of the " de-prioritizing [of] vector execution to favor density" you observed is also because SVE sees little if any use by current software? I heard and read that many developers find SVE hard to implement, and AFAIK Qualcomm has omitted SVE entirely from their own core designs.
So, does SVE really see little use in software, is it as unwieldy as many say, and might this be the reason why ARM has now reduced its presence in their recent core designs? Thanks!
I think it's really density focused. Going higher performance is probably not worth it when you need several of these cores to do well in cell phones. A small SVE implementation probably fits with that too. I don't see people running well vectorized, throughput bound workloads on cell phones. It would be horrible for battery life.
Also the important part is they can at least support SVE from this "little" core. It doesn't necessarily have to be fast. It helps to get a lot of SVE-capable hardware out there so people can play with it and start developing for it.
Because the width of the SVE implementation is software-visible, it seems to me that you can't have a SoC with a set of cores featuring mixed SVE widths. They must all report the same width, so that they all have matching thread contexts.
The C1 Ultra, Premium, Pro, and Nano cores all feature 128-bit SVE width.
That makes sense! I mainly remembered the initial implementation of SVE in Fujitsu's A64fx, used in their Fugaku (Mt. Fuji) supercomputer. If I remember correctly
, SVEs were actually developed jointly by ARM and Fujitsu.
I think it's sort of funny and a little sad how SVE has gotten narrower as it has increased in adoption.
First, Fujitsu implemented it at 512-bit, which made a lot of sense for a HPC chip of its era. Then, ARM rolled it out in the Neoverse V1 at 256-bit. Now that ARM has included it in their phone SoC cores, it's being implemented at just 128-bit!
I get that you can compensate for lower width by having more pipelines, but it seems unlikely to fully match the potential benefits of going wider.
Thanks Chester! Another thing about SVE that I have wondered about: how did ARM deal with SVE in "bigLITTLE" SoCs? AFAIK, the small cores don't have SVE, so how did ARM manage what Intel with AVX512 didn't? Again only AFAIK, ARM didn't delete SVE in their P-core designs, or I am wrong? (That's also entirely possible 😄)
Yes, they do. All ARMv9-A IP cores by ARM support SVE2.
In fact, ISA symmetry is so important that a patch went in to disable BF16 on all SoCs that include A510's, because an early revision of that core had a hardware bug.
BTW, you can lookup the SVE width on different ARM IP cores in their official reference documentation. They don't make it too easy to find, but that's how I was able to verify that all four classes of C1 cores implement it at 128-bit.
> My (admittedly jaded) take on the slim-down of some features
I think it's just facing the reality that their A5x-series cores are no longer viable against competing E-cores. By insisting on remaining in-order, A5x-series seems like a replacement for their old A3x-series. So, the A7x cores had to step into the breech.
Meanwhile, you've got some SoCs with a combination of A725, X4, and X925, where the X4 is being used as the mid cores. That option further enables area & power optimizations in the A7x cores, as they now slot firmly into the role of the "little" cores. In the new C1 generation, those probably align with the Pro, Premium, and Ultra cores, respectively.
It's (IMHO) very unfortunate that ARM continues to limit their LITTLE cores for Smartphone SoCs* to in-order execution. Many of us had hoped for/expected ARM to follow Apple's E-cores here ( out-of-order) long before the release of "Nano". I have heard (read) ARM's arguments why they believe in-order execution is better for LITTLE cores, but the performance/W (or per Joule😄) of Apple's E-cores is a lot better.
* Adding to my puzzlement over this is that ARM has and had designs for small-ish cores that use OoO execution for years.
A large part of Apple’s e core advantage is the fact that because they’re only on Apple’s lineup, which all emphasizes performance almost as much as efficiency, is that they were willing to spend a little more die area per core for e cores because they’re only have to meet Apple’s needs instead of thousands of different customers all over the world. Apple’s e cores per generation have usually been about 20% larger than ARM’s. Apple’s P cores are also larger than Qualcomm’s Oryon P cores. I’m not sure about Oryon “e cores” because they’re now equivalent to an a725, a710 etc rather than an a520, a55 etc.
> It's (IMHO) very unfortunate that ARM continues to limit their LITTLE cores for Smartphone SoCs* to in-order execution.
But the better SoCs are migrating away from Cortex-A5x tier cores, and that's probably why. I was too lazy to look up which one, but I gave the example of MediaTek Dimensity 9500s, which has 1x Cortex-X925 + 3x Cortex-X4 + 4x Cortex-A720 cores. No A5x cores!
> Many of us had hoped for/expected ARM to follow Apple's E-cores here ( out-of-order)
By making the A7x cores their new E-cores, that's effectively what they've done!
They still make cores in the a520/a55 family. Phone chip designers such as mediatek just aren’t using them in flagship chips. You’ll still see them in midrange chips. Keep in mind that the a725 is much smaller than the a77 was(the last “A series” prime core as the sd888 used the x1+3xa78+4xa55). Even the a78 had already started the process of downsizing the A cores. If you read when the a78 released, ARM specifically mentioned that they focused more on reducing area than increasing performance because they now had the X1 prime core to provide the snappy desktop feel.
> Keep in mind that the a725 is much smaller than the a77 was
It seems like you're contradicting yourself. On the one hand, you lament that the A5x-series are still in-order. Yet, now that the A7x-series is stepping into the role of E-cores, you're complaining they're not as wide as when they were the P-cores.
Yes, it's all true! A7x is no longer the P-tier. That's now the X (or "Ultra", in the new parlance). So, it's natural that the A7x would focus more on efficiency.
Anyway, ARM shook up the whole product stack with the new C-based naming scheme and there are now 4 tiers. The successor to the A5x-series is the bottom of those (i.e. the "Nano" tier). The A7x-series is succeeded by the second tier (i.e. the "Pro"). Ultra is the new name for the X-series, and then "Premium" slots in between.
This is interesting, could you provide any references for further reading on the awkwardness of SVE? My understanding what that Apple for example has gotten very good use of SVE and SME, so I wonder what the difference is.
A related question for anyone- If vector instructions are deprioritized in some of the cores in a heterogeneous design like this, do different OS’s have sufficient logic to schedule programs that need them appropriately? Do chip vendors supply patches to Linux and Windows etc. for each core design?
> Apple for example has gotten very good use of SVE and SME
Apple hated SVE and refused to implement it. I wonder if it was even the main thing holding them back from ARMv9-A.
I suspect SSVE (Streaming SVE) was added at their insistence. Furthermore, when the core is in SSVE mode, no execution of SVE or SVE2 instructions is allowed. This seems like Apple coercing ARM to kill off SVE/SVE2.
> if vector instructions are deprioritized in some of the cores in a heterogeneous design like this, do different OS’s have sufficient logic to schedule programs that need them appropriately?
Nope. I'm not aware of any examples of heterogeneous CPUs with any ISA differences, including vector width. Alder Lake seemed like it was close to releasing with AVX-512 in the P-cores, but Intel disabled it in BIOS and then fused it off in later revisions. The problem with having ISA-level asymmetries between cores is primarily that userspace software isn't designed for something like that.
Apps would do things like starting too many threads for the number of cores that could actually run them and then you get performance problems with those threads all competing for execution time a smaller number of P-cores. Or the app starts on an E-core and sees no AVX-512, so it doesn't even try to use it in threads that later get scheduled to a P-core.
It'd just be an even bigger headache than heterogeneous CPUs already are, for both app and OS developers.
Unlike earlier versions, the most recent Apple CPUs (M4 and M5) apparently use or can use SME, but not SVE, at least not all of it; they seem to support SVE2. I'm not an expert on this, but it's mentioned here: https://code.videolan.org/videolan/dav1d/-/merge_requests/1787
Lastly, I believe it'll be with SVE on ARM just like it was for AVX512 on x64 years ago: once there are really compelling upsides to use SVE (and SVE2), it'll show up in a lot more applications.
I think once there's enough hardware out there with SVE/SVE2 support, we'll start to see more applications take advantage of it. I don't think needing assembly to make it work well is unusual. Vector extensions often need intrinsics or assembly to really shine.
I sure hope we get a look at the x925 cores in there as well. It would be very interesting to see the structure sizes and queue depths on ARM’s most aggressive off the shelf IP.
The SVE approach is simply not fitting well with too many cases in general purpose usage of SIMD. While it would be a nice extra what people really want and need is NEON 256 bit.
Thanks Chester, great test and review of ARM's A725!
My (admittedly jaded) take on the slim-down of some features (vs the A710) is that, in addition to reducing area, this also reminds licensees to kindly use and pay for an X core design if they want a more performant core.
And I have this question about SVE: how much of the " de-prioritizing [of] vector execution to favor density" you observed is also because SVE sees little if any use by current software? I heard and read that many developers find SVE hard to implement, and AFAIK Qualcomm has omitted SVE entirely from their own core designs.
So, does SVE really see little use in software, is it as unwieldy as many say, and might this be the reason why ARM has now reduced its presence in their recent core designs? Thanks!
I think it's really density focused. Going higher performance is probably not worth it when you need several of these cores to do well in cell phones. A small SVE implementation probably fits with that too. I don't see people running well vectorized, throughput bound workloads on cell phones. It would be horrible for battery life.
Also the important part is they can at least support SVE from this "little" core. It doesn't necessarily have to be fast. It helps to get a lot of SVE-capable hardware out there so people can play with it and start developing for it.
Because the width of the SVE implementation is software-visible, it seems to me that you can't have a SoC with a set of cores featuring mixed SVE widths. They must all report the same width, so that they all have matching thread contexts.
The C1 Ultra, Premium, Pro, and Nano cores all feature 128-bit SVE width.
That makes sense! I mainly remembered the initial implementation of SVE in Fujitsu's A64fx, used in their Fugaku (Mt. Fuji) supercomputer. If I remember correctly
, SVEs were actually developed jointly by ARM and Fujitsu.
Here's an (IMHO) interesting blog-entry about how to test for SVE register size: https://lemire.me/blog/2022/11/29/how-big-are-your-sve-registers-aws-graviton/
I think it's sort of funny and a little sad how SVE has gotten narrower as it has increased in adoption.
First, Fujitsu implemented it at 512-bit, which made a lot of sense for a HPC chip of its era. Then, ARM rolled it out in the Neoverse V1 at 256-bit. Now that ARM has included it in their phone SoC cores, it's being implemented at just 128-bit!
I get that you can compensate for lower width by having more pipelines, but it seems unlikely to fully match the potential benefits of going wider.
Thanks Chester! Another thing about SVE that I have wondered about: how did ARM deal with SVE in "bigLITTLE" SoCs? AFAIK, the small cores don't have SVE, so how did ARM manage what Intel with AVX512 didn't? Again only AFAIK, ARM didn't delete SVE in their P-core designs, or I am wrong? (That's also entirely possible 😄)
> AFAIK, the small cores don't have SVE
Yes, they do. All ARMv9-A IP cores by ARM support SVE2.
In fact, ISA symmetry is so important that a patch went in to disable BF16 on all SoCs that include A510's, because an early revision of that core had a hardware bug.
https://www.phoronix.com/news/Linux-6.1-ARM64-Updates
BTW, you can lookup the SVE width on different ARM IP cores in their official reference documentation. They don't make it too easy to find, but that's how I was able to verify that all four classes of C1 cores implement it at 128-bit.
> My (admittedly jaded) take on the slim-down of some features
I think it's just facing the reality that their A5x-series cores are no longer viable against competing E-cores. By insisting on remaining in-order, A5x-series seems like a replacement for their old A3x-series. So, the A7x cores had to step into the breech.
Meanwhile, you've got some SoCs with a combination of A725, X4, and X925, where the X4 is being used as the mid cores. That option further enables area & power optimizations in the A7x cores, as they now slot firmly into the role of the "little" cores. In the new C1 generation, those probably align with the Pro, Premium, and Ultra cores, respectively.
BTW, the C1 Nano is the successor to the Cortex-A520. According to the Software Optimization Guide for it, it's also an in-order core: https://documentation-service.arm.com/static/68c946f08a337a2bc66460c8
It's (IMHO) very unfortunate that ARM continues to limit their LITTLE cores for Smartphone SoCs* to in-order execution. Many of us had hoped for/expected ARM to follow Apple's E-cores here ( out-of-order) long before the release of "Nano". I have heard (read) ARM's arguments why they believe in-order execution is better for LITTLE cores, but the performance/W (or per Joule😄) of Apple's E-cores is a lot better.
* Adding to my puzzlement over this is that ARM has and had designs for small-ish cores that use OoO execution for years.
A large part of Apple’s e core advantage is the fact that because they’re only on Apple’s lineup, which all emphasizes performance almost as much as efficiency, is that they were willing to spend a little more die area per core for e cores because they’re only have to meet Apple’s needs instead of thousands of different customers all over the world. Apple’s e cores per generation have usually been about 20% larger than ARM’s. Apple’s P cores are also larger than Qualcomm’s Oryon P cores. I’m not sure about Oryon “e cores” because they’re now equivalent to an a725, a710 etc rather than an a520, a55 etc.
> It's (IMHO) very unfortunate that ARM continues to limit their LITTLE cores for Smartphone SoCs* to in-order execution.
But the better SoCs are migrating away from Cortex-A5x tier cores, and that's probably why. I was too lazy to look up which one, but I gave the example of MediaTek Dimensity 9500s, which has 1x Cortex-X925 + 3x Cortex-X4 + 4x Cortex-A720 cores. No A5x cores!
> Many of us had hoped for/expected ARM to follow Apple's E-cores here ( out-of-order)
By making the A7x cores their new E-cores, that's effectively what they've done!
They still make cores in the a520/a55 family. Phone chip designers such as mediatek just aren’t using them in flagship chips. You’ll still see them in midrange chips. Keep in mind that the a725 is much smaller than the a77 was(the last “A series” prime core as the sd888 used the x1+3xa78+4xa55). Even the a78 had already started the process of downsizing the A cores. If you read when the a78 released, ARM specifically mentioned that they focused more on reducing area than increasing performance because they now had the X1 prime core to provide the snappy desktop feel.
> Keep in mind that the a725 is much smaller than the a77 was
It seems like you're contradicting yourself. On the one hand, you lament that the A5x-series are still in-order. Yet, now that the A7x-series is stepping into the role of E-cores, you're complaining they're not as wide as when they were the P-cores.
Yes, it's all true! A7x is no longer the P-tier. That's now the X (or "Ultra", in the new parlance). So, it's natural that the A7x would focus more on efficiency.
Anyway, ARM shook up the whole product stack with the new C-based naming scheme and there are now 4 tiers. The successor to the A5x-series is the bottom of those (i.e. the "Nano" tier). The A7x-series is succeeded by the second tier (i.e. the "Pro"). Ultra is the new name for the X-series, and then "Premium" slots in between.
This is interesting, could you provide any references for further reading on the awkwardness of SVE? My understanding what that Apple for example has gotten very good use of SVE and SME, so I wonder what the difference is.
A related question for anyone- If vector instructions are deprioritized in some of the cores in a heterogeneous design like this, do different OS’s have sufficient logic to schedule programs that need them appropriately? Do chip vendors supply patches to Linux and Windows etc. for each core design?
> Apple for example has gotten very good use of SVE and SME
Apple hated SVE and refused to implement it. I wonder if it was even the main thing holding them back from ARMv9-A.
I suspect SSVE (Streaming SVE) was added at their insistence. Furthermore, when the core is in SSVE mode, no execution of SVE or SVE2 instructions is allowed. This seems like Apple coercing ARM to kill off SVE/SVE2.
> if vector instructions are deprioritized in some of the cores in a heterogeneous design like this, do different OS’s have sufficient logic to schedule programs that need them appropriately?
Nope. I'm not aware of any examples of heterogeneous CPUs with any ISA differences, including vector width. Alder Lake seemed like it was close to releasing with AVX-512 in the P-cores, but Intel disabled it in BIOS and then fused it off in later revisions. The problem with having ISA-level asymmetries between cores is primarily that userspace software isn't designed for something like that.
Apps would do things like starting too many threads for the number of cores that could actually run them and then you get performance problems with those threads all competing for execution time a smaller number of P-cores. Or the app starts on an E-core and sees no AVX-512, so it doesn't even try to use it in threads that later get scheduled to a P-core.
It'd just be an even bigger headache than heterogeneous CPUs already are, for both app and OS developers.
Unlike earlier versions, the most recent Apple CPUs (M4 and M5) apparently use or can use SME, but not SVE, at least not all of it; they seem to support SVE2. I'm not an expert on this, but it's mentioned here: https://code.videolan.org/videolan/dav1d/-/merge_requests/1787
While IDK exactly why SVE has (had?) a "bad rep" as being difficult, ARM didn't help it with remarks about using assembly to make it work well, for example in here: https://documentation-service.arm.com/static/6168352bac265639eac59083?token=
Lastly, I believe it'll be with SVE on ARM just like it was for AVX512 on x64 years ago: once there are really compelling upsides to use SVE (and SVE2), it'll show up in a lot more applications.
I think once there's enough hardware out there with SVE/SVE2 support, we'll start to see more applications take advantage of it. I don't think needing assembly to make it work well is unusual. Vector extensions often need intrinsics or assembly to really shine.
I sure hope we get a look at the x925 cores in there as well. It would be very interesting to see the structure sizes and queue depths on ARM’s most aggressive off the shelf IP.
While not in the same depth as Chester's excellent analysis, you can read some of those details about it, here:
https://fuse.wikichip.org/news/7761/arm-launches-next-gen-flagship-cortex-x925/
The SVE approach is simply not fitting well with too many cases in general purpose usage of SIMD. While it would be a nice extra what people really want and need is NEON 256 bit.
Great piece of write-up, DGX spark runs on a proprietary Ubuntu like version of Linux.
Did Nvidia make any optimization to the OS scheduler in terms of workflow distribution?
Any chance you can run some gaming tests on the DGX spark on titles that have native ARM build?
Wendell from Level1techs (his YouTube channel) has done exactly that. It's quite interesting 😄.