Thanks for the impressive testing, as always. I have an article idea/request that seems like it would be well within your expertise - looking at the new CUDA tile programming model and how it may or may not bring other API’s closer to native CUDA performance on NVIDIA’s GPUs.
Excellent deep dive into B200's memory hierarchy! The HBM3E bandwidth advantage over MI300X is impressive but what really caught my eye is how the 126MB L2 cache almost completely offsets the cross-die latency penalty. Ran some FluidX3D variants last quarter and bandwidth-bound workloads like that absolutley benefit more from raw HBM throughput than extra L3 capacity. Dunno if this stays true for transformer inference where the attention mechanism can exploit larger caches, but for traditional HPC the B200's aproach seems optimal.
"Nvidia’s conservative hardware" This is a f***ing 1600mm² die, it's not conservative. lol
I meant it's a conservative multi-die setup compared to using two or four base dies and stacking eight compute dies on top
Thanks for the impressive testing, as always. I have an article idea/request that seems like it would be well within your expertise - looking at the new CUDA tile programming model and how it may or may not bring other API’s closer to native CUDA performance on NVIDIA’s GPUs.
Excellent deep dive into B200's memory hierarchy! The HBM3E bandwidth advantage over MI300X is impressive but what really caught my eye is how the 126MB L2 cache almost completely offsets the cross-die latency penalty. Ran some FluidX3D variants last quarter and bandwidth-bound workloads like that absolutley benefit more from raw HBM throughput than extra L3 capacity. Dunno if this stays true for transformer inference where the attention mechanism can exploit larger caches, but for traditional HPC the B200's aproach seems optimal.