Discussion about this post

User's avatar
Neural Foundry's avatar

Excellent deep dive into B200's memory hierarchy! The HBM3E bandwidth advantage over MI300X is impressive but what really caught my eye is how the 126MB L2 cache almost completely offsets the cross-die latency penalty. Ran some FluidX3D variants last quarter and bandwidth-bound workloads like that absolutley benefit more from raw HBM throughput than extra L3 capacity. Dunno if this stays true for transformer inference where the attention mechanism can exploit larger caches, but for traditional HPC the B200's aproach seems optimal.

Expand full comment

No posts

Ready for more?