1 Comment

Interesting building block. Though does it seem like 100Gb is lagging a bit?

I think the bigger question is: if you want something general and scalable, how do you structure it? 800Gb dc-ethernet or comparable nvlink will do it, but that's pretty expensive. Lot of switching, lot of cabling. Of course, AI is not really cost-sensitive right now.

But am I just being a sentimental coot to remember BlueGene, or even SiCortex? Systems where the network topology was fundamental. (That reminds me: a big part of BG was trying to obtain reliable systems at scale - something you don't seem to hear much about when the PR starts flying about how many bazillions of GPUs someone has. Has large-scale training figured out a trick to allow machines to crash at some realistic rate?)

Expand full comment