4 Comments
User's avatar
David. Hellyx's avatar

"Nvidia’s conservative hardware" This is a f***ing 1600mm² die, it's not conservative. lol

Expand full comment
Chester Lam's avatar

I meant it's a conservative multi-die setup compared to using two or four base dies and stacking eight compute dies on top

Expand full comment
Avik De's avatar

Thanks for the impressive testing, as always. I have an article idea/request that seems like it would be well within your expertise - looking at the new CUDA tile programming model and how it may or may not bring other API’s closer to native CUDA performance on NVIDIA’s GPUs.

Expand full comment
Neural Foundry's avatar

Excellent deep dive into B200's memory hierarchy! The HBM3E bandwidth advantage over MI300X is impressive but what really caught my eye is how the 126MB L2 cache almost completely offsets the cross-die latency penalty. Ran some FluidX3D variants last quarter and bandwidth-bound workloads like that absolutley benefit more from raw HBM throughput than extra L3 capacity. Dunno if this stays true for transformer inference where the attention mechanism can exploit larger caches, but for traditional HPC the B200's aproach seems optimal.

Expand full comment