As supercomputers get larger and accelerators get faster, teams planning infrastructure need faster interconnects. While there are future products that are still in the planning phase that could displace some of the markets, like Gen-Z, one of the tangible technologies aside from HPE Cray Slingshot that we know will power exascale supercomputers, is Infiniband NVIDIA Mellanox NDR 400Gbps Infiniband. To that end, NVIDIA’s recent Mellanox acquisition is now yielding a NDR 400Gbps Infiniband line.
NVIDIA Mellanox NDR 400Gbps Infiniband
Mellanox drove Infiniband to effectively double performance every generation and that is happening again. As current systems are looking to HDR 200Gbps Infiniband, NDR 400Gbps Infiniband is the next stop now that the NVIDIA-Mellanox deal closed.
For the new NDR generation, we get a portfolio of products including adapters, DPUs, switches, and cables. NVIDIA has not announced ConnectX-7 but Mellanox has been following a pattern of implementing a new generation of Infiniband in one generation of adapter, then refining offloads in the next. On the adapter side, we need PCIe Gen5 x16 or two PCIe Gen4 x16 slots to drive ports in servers, so it may be some time until we see adapters since we do not expect PCIe Gen5 until after the early 2021 generations of server CPUs.
NVIDIA also is calling out DPUs with Infiniband which is exciting since combined with GPUs that could obviate the need for a server altogether with direct fabric attached GPUs as we discussed in our NVIDIA to Acquire Mellanox a Potential Prelude to Servers piece.
On the cable side, NVIDIA told us they are able to use copper DACs (likely thick ones) and reach 1.5m. As a result, switches will not require co-packaged optics.
As we would expect, the new NDR Infiniband provides more performance than the previous generation as we double bandwidth from 200Gbps to 400Gbps.
As network speeds increase, two things happen. First, offloading functions for communication become more important so ConnectX’s accelerators will be required. Second, we can push limits of disaggregation further. With NDR likely slated for PCIe Gen5, that means we could see it in the CXL 1.1/ CXL 2.0 era which will be an exciting time for servers and infrastructure since it is when we will start to see disaggregation.
Finally, while writing this, something came up that is interesting. NVIDIA is using NDR 400G Infiniband in its marketing materials. At what point can we just have 400G Infiniband and drop the “xDR” naming? It seems like we could do this already just as we say “400GbE”.