In the GB10 chip specs, both the S-dielet (CPU, memory subsystem and G-dielet (GPU) using TSMC 3nm.
Blackwell Discrete GPU cards are using TSMC N4P node and since the SM count on the G-Dielet share similarity to the RTX 5070s any chance you can test out the performance of the GPU portion?
Firstly, thanks Chester and a Happy and Healthy New Year to you and George!
Second, that makeup of the CPU reminded me of the recent large Dimensity SoCs - large X cores and the medium size A700 series cores. Maybe Mediatek figured that making the two CPU clusters larger versions of their mobile CPU was easier and quicker?
But, I can't help wonder if the area spent on the ten A cores in the Spark wouldn't have been better invested in ~ 4 additional X cores; saving a few milliwatts here and there makes sense for a smartphone SoC, but not sure it's the way to go for a ~ 200 Wh mini AI accelerator on a wall outlet.
That all being said, this is, after all, an AI box, and the key function of the CPU in those is to keep the GPU cores fed and manage traffic. A bit different from Strix Halo, which is intended to be both this and a strong general purpose computer. Strix Halo manages to be quite good at both, but it's (IMHO) a lot less specialized than the Spark. However, AFAIK, the Spark is still the way to go if AI and the Nvidia ecosystem is important; ROCm is getting better, but, from what I hear, still needs a lot more hands-on effort to make it work.
I wonder if you had a chance to test just how busy those two clusters of ARM Cores are when the Spark is used as intended - for AI? Maybe Mediatek was right about the medium size cores after all, and they'll do just fine "managing traffic"? In that case, were the X cores even necessary?
Lastly, your findings on memory and power allocation with priority for the Spark's GPU reminded me more than a little of your findings for the custom SoC for - the Steam Deck (of all things 😄). Another case of CPU cores getting starved (power and memory) so that the GPU can run at fuller tilt.
In closing, I really appreciate your tests and reviews. And, hopefully, you'll get to test a Battlematrix system from Intel soon, maybe even at or after CES (George, don't be shy asking Intel 😁) .
You say GB10 is a specialized AI accelerator, but the N1X Windows-on-ARM chip is rumored also to have the same 10+10 core composition. Not sure whether they're actually the same silicon, even.
Considering that, one thought I've had is that the 10+10 core configuration was decided upon to offer competitive multi-threaded performance, in Windows benchmarks like Cinibench. If so, then 10+10 is surely better than 14+0. X925 cores are fast, but nowhere close to 2.5x as fast as A725's. Maybe X925's are 2.5x as fast as A520 cores.
Hopefully, we'll be hearing more about the N1X in the next week, since a Computex announcement was rumored and seems likely.
The rumored N1X SoC is, AFAIK, intended to be a more "general" solution, and a 10+10 CPU might make more sense for general computing. Given the current RAM prices, I wouldn't be surprised if laptops with the N1X will yet again be no-shows at CES.
OT: I wonder how Qualcomm's board partners for the newest and greatest Elite Snapdragon will handle the shortage of LPDDR5 RAM.
Weird chip! Could Windows' thread scheduler vs. the asymmetrical clusters possibly have anything to do with their decision to delay the N1X, assuming this aspect is structured the same as the GB10?
Also, the cluster read + rw table seems like some of the different GB10 test cases aren't measuring the same thing. Otherwise, it's quite paradoxical!
Finally, why is Add considered Read-Modify-Write? Is it simply adding a constant to an array in-place? Or is it doing A += B?
My first thought when I saw the heterogenous cores was that they were wanting to set things up so they could bin out SKU's with homogenous cores. However it definitely does look odd, and I can't see NVDA wanting to offer parts like that; perhaps MTEK does?
In any event, OS's have been doing thread scheduling between heterogenous cores in a single cluster for a while, and are improving at scheduling homogenous and heterogenous clusters that have homogenous cores inside each cluster. Throwing heterogenous clusters with heterogenous cores inside each cluster just feels like the hardware guys have decided to mess with the OS guys' minds so have thrown them a curveball.
Scheduling of SMT seems a lot like scheduling heterogeneous cores. In that case, Strix Point should already be an example of scheduling asymmetrical cores within asymmetrical clusters. Perhaps OS thread schedulers treat SMT as a special case, but you could certainly look at it that way.
I guess that means we need to be careful about what we mean by a "cluster". To me "shared cache" = cluster, and from that perspective an SMT core is a symmetric mini-cluster (they share L1's and the various predictors), and we therefore end up with clusters of clusters, and potentially clusters of clusters of clusters once we get into multi-socket scenarios. It's turtles all the way down.
In the GB10 chip specs, both the S-dielet (CPU, memory subsystem and G-dielet (GPU) using TSMC 3nm.
Blackwell Discrete GPU cards are using TSMC N4P node and since the SM count on the G-Dielet share similarity to the RTX 5070s any chance you can test out the performance of the GPU portion?
Firstly, thanks Chester and a Happy and Healthy New Year to you and George!
Second, that makeup of the CPU reminded me of the recent large Dimensity SoCs - large X cores and the medium size A700 series cores. Maybe Mediatek figured that making the two CPU clusters larger versions of their mobile CPU was easier and quicker?
But, I can't help wonder if the area spent on the ten A cores in the Spark wouldn't have been better invested in ~ 4 additional X cores; saving a few milliwatts here and there makes sense for a smartphone SoC, but not sure it's the way to go for a ~ 200 Wh mini AI accelerator on a wall outlet.
That all being said, this is, after all, an AI box, and the key function of the CPU in those is to keep the GPU cores fed and manage traffic. A bit different from Strix Halo, which is intended to be both this and a strong general purpose computer. Strix Halo manages to be quite good at both, but it's (IMHO) a lot less specialized than the Spark. However, AFAIK, the Spark is still the way to go if AI and the Nvidia ecosystem is important; ROCm is getting better, but, from what I hear, still needs a lot more hands-on effort to make it work.
I wonder if you had a chance to test just how busy those two clusters of ARM Cores are when the Spark is used as intended - for AI? Maybe Mediatek was right about the medium size cores after all, and they'll do just fine "managing traffic"? In that case, were the X cores even necessary?
Lastly, your findings on memory and power allocation with priority for the Spark's GPU reminded me more than a little of your findings for the custom SoC for - the Steam Deck (of all things 😄). Another case of CPU cores getting starved (power and memory) so that the GPU can run at fuller tilt.
In closing, I really appreciate your tests and reviews. And, hopefully, you'll get to test a Battlematrix system from Intel soon, maybe even at or after CES (George, don't be shy asking Intel 😁) .
You say GB10 is a specialized AI accelerator, but the N1X Windows-on-ARM chip is rumored also to have the same 10+10 core composition. Not sure whether they're actually the same silicon, even.
Considering that, one thought I've had is that the 10+10 core configuration was decided upon to offer competitive multi-threaded performance, in Windows benchmarks like Cinibench. If so, then 10+10 is surely better than 14+0. X925 cores are fast, but nowhere close to 2.5x as fast as A725's. Maybe X925's are 2.5x as fast as A520 cores.
Hopefully, we'll be hearing more about the N1X in the next week, since a Computex announcement was rumored and seems likely.
The rumored N1X SoC is, AFAIK, intended to be a more "general" solution, and a 10+10 CPU might make more sense for general computing. Given the current RAM prices, I wouldn't be surprised if laptops with the N1X will yet again be no-shows at CES.
OT: I wonder how Qualcomm's board partners for the newest and greatest Elite Snapdragon will handle the shortage of LPDDR5 RAM.
I have a Snapdragon X1P laptop with only 16 GB of DRAM. I wouldn't be surprised if we even see some X2E laptops with that amount.
Funny enough, even with so little RAM, my X1P has a full NPU. It's the only part of the chip that matches the same spec as the X1E laptops.
Weird chip! Could Windows' thread scheduler vs. the asymmetrical clusters possibly have anything to do with their decision to delay the N1X, assuming this aspect is structured the same as the GB10?
Also, the cluster read + rw table seems like some of the different GB10 test cases aren't measuring the same thing. Otherwise, it's quite paradoxical!
Finally, why is Add considered Read-Modify-Write? Is it simply adding a constant to an array in-place? Or is it doing A += B?
Yes, I'm adding a constant to an array in place.
My first thought when I saw the heterogenous cores was that they were wanting to set things up so they could bin out SKU's with homogenous cores. However it definitely does look odd, and I can't see NVDA wanting to offer parts like that; perhaps MTEK does?
In any event, OS's have been doing thread scheduling between heterogenous cores in a single cluster for a while, and are improving at scheduling homogenous and heterogenous clusters that have homogenous cores inside each cluster. Throwing heterogenous clusters with heterogenous cores inside each cluster just feels like the hardware guys have decided to mess with the OS guys' minds so have thrown them a curveball.
Scheduling of SMT seems a lot like scheduling heterogeneous cores. In that case, Strix Point should already be an example of scheduling asymmetrical cores within asymmetrical clusters. Perhaps OS thread schedulers treat SMT as a special case, but you could certainly look at it that way.
I guess that means we need to be careful about what we mean by a "cluster". To me "shared cache" = cluster, and from that perspective an SMT core is a symmetric mini-cluster (they share L1's and the various predictors), and we therefore end up with clusters of clusters, and potentially clusters of clusters of clusters once we get into multi-socket scenarios. It's turtles all the way down.