Interesting building block. Though does it seem like 100Gb is lagging a bit?
I think the bigger question is: if you want something general and scalable, how do you structure it? 800Gb dc-ethernet or comparable nvlink will do it, but that's pretty expensive. Lot of switching, lot of cabling. Of course, AI is not really cost-sensitive right now.
But am I just being a sentimental coot to remember BlueGene, or even SiCortex? Systems where the network topology was fundamental. (That reminds me: a big part of BG was trying to obtain reliable systems at scale - something you don't seem to hear much about when the PR starts flying about how many bazillions of GPUs someone has. Has large-scale training figured out a trick to allow machines to crash at some realistic rate?)
Interesting building block. Though does it seem like 100Gb is lagging a bit?
I think the bigger question is: if you want something general and scalable, how do you structure it? 800Gb dc-ethernet or comparable nvlink will do it, but that's pretty expensive. Lot of switching, lot of cabling. Of course, AI is not really cost-sensitive right now.
But am I just being a sentimental coot to remember BlueGene, or even SiCortex? Systems where the network topology was fundamental. (That reminds me: a big part of BG was trying to obtain reliable systems at scale - something you don't seem to hear much about when the PR starts flying about how many bazillions of GPUs someone has. Has large-scale training figured out a trick to allow machines to crash at some realistic rate?)