16 Comments
User's avatar
Schrödinger's Cat's avatar

"554.roms is the worst offender, and makes X925 execute more than twice as many instructions compared to Zen 5."

It would be interesting to know how heavily that test is utilizing SVE2 and AVX. It seems like the main reason it needs so many more instructions could be simply due to its narrower vector width.

I'd be curious if the same test would report a different instruction rate on a Graviton 3 CPU, which has Neverse V1 cores with a 256-bit wide SVE implementation. Maybe ARM just can't keep trying to get by with 128-bit vectors, at a time when even Intel is going back to 512-bit.

Schrödinger's Cat's avatar

"That said, getting a high performance core is only one piece of the puzzle. Gaming workloads are very important in the consumer space, and benefit more from a strong memory subsystem than high core throughput."

Yes, and you compared a system with LPDDR5X against two desktop CPUs with regular DDR5 memory. LPDDR has an extra latency penalty, compared to regular DDR memory, because it must multiplex address and data over the same pins. This makes the GB10's rate-1 performance even more impressive, because it's paying the LPDDR latency penalty without getting any real benefits from the 256-bit data path.

c3dtops's avatar

I thought GB10 was using MediaTek's (Taiwan) micro-architecture for the CPU?

So Nvidia is paying ARM Holdings both the ARM Architectural License fees and also micro-architecuture design fees?

https://www.mediatek.com/press-room/newly-launched-nvidia-dgx-spark-features-gb10-superchip-co-designed-by-mediatek

The GB10 Grace Blackwell Superchip leverages MediaTek’s experience in designing power-efficient and high-performance CPU, memory subsystem, and high-speed interfaces to power the Grace 20-core Arm CPU. Combined with the latest generation Blackwell GPU and 128GB of unified memory, GB10 delivers up to 1 PFLOP of AI performance to accelerate model tuning and real-time inferencing.

Schrödinger's Cat's avatar

> I thought GB10 was using MediaTek's micro-architecture for the CPU?

I'm not aware of MediaTek ever designing their own CPU or GPU microarchitecture. I'm nearly certain they've always licensed that IP from others.

I presume the way this particular partnership might've got going was MediaTek effectively trying to license the GPU IP from Nvidia. The collaboration came to light a couple years after Samsung licensed RDNA from AMD. So, maybe MediaTek worried that it needed to counter with an iGPU a bit more powerful than it could get from ARM or Imagination Technologies.

c3dtops's avatar

I was under the impression that MediaTek SOC on smartphone (those china OEM phones) with Dimensity cores were using their own micro-architecture?

https://www.mediatek.com/products/smartphones/dimensity-5g

So in a way that strategy from MediaTek is/was slowly enroaching onto the traditional stronghold of Qualcom mobile SOC business.

Maybe i'm wrong then. Tyvm for sharing

Peter W.'s avatar

I concur. AFAIK, Mediatek is and has been using stock-ARM designs for their CPU cores (TLA license ) and has, at least recently, also used ARM's designs of various GPU cores. It would be interesting to know if Mediatek even has an Architecture License Agreement (ALA) with ARM. This is the kind of license required to design and launch one's own core designs that are still ARM ISA compatible.

However , Mediatek does use their own designs and has significant IP for their LTE and 5G modems that are usually integrated in their mobile SoCs, making them one of only six or seven companies in the world with that capability. For an example for how difficult that is, I recommend a look at Apple's struggle to finally get there in 2025. That journey cost them billions of dollars and several years of trying until they finally had a good 5G modem with the low power draw needed for smartphones. It also illustrated the fact that modems and RF modules are their own universe, and just because a company is good at designing CPU or GPU cores doesn't mean they're automatically good at designing modems.

Schrödinger's Cat's avatar

>I was under the impression that MediaTek SOC on smartphone with Dimensity cores were using their own micro-architecture?

No, I don't recall any of them using in-house cores. The better phone sites, like gsmarena, tend to have pretty good coverage of the SoCs, as well. Just search of a SoC on there, and you'll probably find a lot more details than whatever press release the manufacturer puts out about it. https://www.gsmarena.com/mediatek_announces_dimensity_9500_flagship_chipset-news-69618.php

NotebookCheck is probably another good resource. https://www.notebookcheck.net/MediaTek-Dimensity-9500-Processor-Benchmarks-and-Specs.957550.0.html

With the Qualcomm SoCs, they even tend to say which cores the SoCs actually have. Before Oryon, Qualcomm tried to obscure which IP they licensed by calling everything Kryo, but the better sites would tell you which ARM IP cores they really were. Qualcomm did have their own in-house cores, but from like 2016 until the last couple years, all of their SoC's used IP cores licensed from ARM.

Almost none of the phone SoC makers design their own cores. Right now, I think it's just Apple, Qualcomm, and HiSilicon/Huawei. Samsung used to be in that club, but they ended their in-house ARM core design efforts about 7 years ago.

Peter Lafreniere's avatar

Just so y'all know, this article isn't on the old site. I hope this is the result of an oversight, and not that the old site has stopped receiving new content with no notice.

Chester Lam's avatar

It’s a lot of overhead to copy content into google docs for editing, then to both Substack and Wordpress. I neglected to do that for Wordpress, but also want to deprecate the old site. When there’s only 1-2 people involved in running the site day-to-day, it’s not sustainable. We don’t even have resources to address Substack-related problems, let alone Wordpress related ones too

Avik De's avatar

Are there any comparisons of these microarchitectural choices to Apple or Qualcomm’s designs published anywhere? I think people using the GB10 may for example compare to Qualcomm’s IQ series.

Schrödinger's Cat's avatar

Snapdragon X's core was covered in these articles:

* https://chipsandcheese.com/p/qualcomms-oryon-core-a-long-time-in-the-making

* https://chipsandcheese.com/p/hot-chips-2024-qualcomms-oryon-core

Snapdragon X2 was discussed here:

* https://chipsandcheese.com/p/qualcomms-snapdragon-x2-elite

To my knowledge, this publication has yet to cover any Apple CPUs, but you can find a fairly comprehensive compilation of information about their M-series, here:

* https://github.com/name99-org/AArch64-Explore/tree/main

Anandtech benchmarked & analyzed Apple's cores up to & including the M1, but the publisher took down those articles. You should be able to get them on archive.org, if you know which one you're looking for.

Peter W.'s avatar

Apple has been notoriously quiet about any details of their own silicon. Their engineers were also (AFAIK) not that "popular" at conferences like Hot Chips, because they take a lot of notes, but never share their own knowledge. Chester or George would know more, especially if that has changed in recent years.

Peter W.'s avatar

Firstly, thanks for another good write-up, Chester!

One comment, two questions:

Comment is about the limit of the SVEs in the 925X to 128 bit. That sort-of makes sense if used in a mobile SoC (smartphone), but in the CPU for such a powerful SoC like here in the Spark, it's almost a bit weird. SVEs, since their inception in the Fujitsu A64FX were up to 512 bit wide, and ARM has further widened the maximum SVE since. Is the 925X SVE stock design limited to 128 bit?

My other question is about the transistor count and approximate area of the 925X: do you have any sources or estimates for those?

Thanks again!

Schrödinger's Cat's avatar

I think ARM designed the Cortex X925 with 128-bit SVE2 and that's it. I doubt they provide customers with an option to order it with wider SVE2, as that would require a new layout of the entire core. L2 cache is probably easier to parameterize, since it's sitting at the core's periphery.

https://kurnal-insights.com/dieshot/nvidia-gb10/

Furthermore, every core in the SoC would need the same width SVE2 implementation, since it's software-visible. So, if they made it 256-bit, then the A725 would also need a 256-bit SVE2 implementation. Heck, if your SoC had A520 cores, even those would need to have a 256-bit implementation! I think this explains why they've stayed at 128-bit, in all the client cores that have it. I checked the docs and even the new C1 cores are still just using 128-bit implementations.

AFAIK, the only ARM cores with SVE > 128-bit were Fujitsu's (512-bit) and the Neoverse V1 (256-bit).

Note that ARM now has SME for matrix operations. Also, SSVE seems to supersede SVE2. Apple has shunned SVE/SVE2 and basically gone straight to SSVE + SME.

Regarding die area, see above link. I've seen claims about transistor count, but none that I consider trustworthy.

Schrödinger's Cat's avatar

BTW, this article's SPEC2017int rate-1 scores mostly align with David Huang's, but the 9800X3D is somewhat of an outlier. He gets 13.8, whereas this article claims just 10.8.

https://blog.hjc.im/spec-cpu-2017

However, this article's score of ~11.6 for the 9900X is better aligned with his score of 12.6 for the 9950X. If you scale 11.6 by the relative clock speed differential, the expected score of the 9950X would be 11.8, which is only about 6.3% below Huang's. So, that makes the case of the 9800X3D rather odd.

Obviously, given different OS, RAM, and compiler versions, some differences in score are to be expected. What I found surprising was that the Zen 5 CPUs didn't all scale somewhat proportionately between the two sets of benchmarks.

FWIW, he benchmarked the GB10's X925 at 12, which is much closer to this article's score of ~11.8.

Schrödinger's Cat's avatar

"Matching the best from Intel and AMD must have been a distant dream in 2012, when Arm launched their first 64-bit core, the Cortex A57."

I think it's slightly generous to credit ARM for launching it in 2012. As far as I can tell, the first SoCs using it didn't ship until 2014.