Meta (formerly Facebook) has a new AI training platform for the newest NVIDIA GPUs. Dubbed the Meta “Grand Teton” the new platform is designed to house 8x NVIDIA H100 GPUs while greatly increasing interconnect bandwidth.
This is the Meta Grand Teton 8x NVIDIA H100 Machine
This is the side view of the massive Meta Grand Teton system with 8x NVIDIA H100 GPUs in a pull-out tray:
One can see a few common features. For example, Meta is using many OCP NIC 3.0 form factor network cards. There are at least ten that we could on the front of the chassis. In this generation with PCIe Gen5, they can be 200GbE or 400GbE/ 400Gbps IB ports. The company also has an EDSFF (E1) storage array on the front of the chassis and another section for E1.S boot drives. If you want to learn more about EDSFF you can see E1 and E3 EDSFF to Take Over from M.2 and 2.5 in SSDs. This is a great example of the M.2 boot devices being phased out in favor of E1.S.
Meta says that the new system offers significantly more bandwidth, and also simplifies deployment with a single box versus having a CPU head node, a switching box, and then the accelerators.
At the same time, power is up 2x over the previous generation which is quite impressive. That is a bigger jump than the NVIDIA A100 to NVIDIA H100 jump with the H100’s public specs at 700W. CPU TDP is going up in this generation as is NIC TDP.
Meta uses AI all over, and is a leader in the space, so seeking more performance makes a lot of sense here. This is an especially interesting announcement since Meta RSC Selected NVIDIA and Pure Storage for its AI Research Cluster in the NVIDIA A100 generation. Now, Meta has its own training platform. This does not seem like the company’s craziest platform, but it is nice to see some innovation happening in the space. For companies that are not Meta, this platform looks very similar to many of the Supermicro, QCT, and other platforms that we have seen for 8x NVIDIA H100 configurations.