One of the coolest cards we saw at the NVIDIA GTC 2020 keynote was not a pure GPU. Instead, it is the NVIDIA EGX A100. This is a marriage of a GPU and NIC all in a single PCIe card. The card features a NVIDIA A100 Ampere-based GPU package along with a Mellanox ConnectX-6 Dx NIC. That means one can get 200Gbps of networking plus a GPU on a single card.
NVIDIA EGX A100
The NVIDIA EGX A100 is a product NVIDIA needed to show at this GTC. With the recent Mellanox acquisition, NVIDIA needed to show it has a vision for combining fabrics and GPUs. That is exactly what we see wit the EGX A100.
One can see the traditional PCIe (Gen4) card. One will quickly notice that there is a rounded edge, likely for a cooling solution. There are also two QSFP28 100Gbps ports. Using ConnectX-6 DX VPI IP, the company gets various networking and security offloads. One also can use the card to connect to either Infiniband or Ethernet fabrics.
NVIDIA is already touting a large ecosystem for its EGX platform. In some ways, this is what Mellanox and NVIDIA were trying to accomplish with existing products. This is a new level of integration so hopefully, we will see new classes of solutions arise from this type of device.
The impact of the NVIDIA EGX A100 is not saving a PCIe slot. Instead, it is NVIDIA moving in a direction of CPU offload. The vision of the SmartNIC capabilities is that the EGX A100 can be connected to the network via Infiniband or Ethernet. Another option is that one can use Infiniband for GPU-to-GPU communication and Ethernet to get data from NVMeoF storage. That data can then be securely moved to the onboard NVIDIA A100 GPU. That GPU can do processing it needs then send data back out over the network, without host CPU intervention.
If one looks at what NVIDIA is doing with this product, it is essentially the first step in disaggregating the x86-based CPU servers from GPU compute. While these cards are likely still to be used in PCIe slots in standard servers, the EGX A100 gives an opportunity to show real bypass of the host system. As we discussed over a year ago when NVIDIA moved to purchase Mellanox in NVIDIA to Acquire Mellanox a Potential Prelude to Servers, the next step is a BlueField version of this device.
Assuming NVIDIA is relentless in moving to this model, there are enormous cost savings. NVIDIA is already working on full pipeline offload for large application areas such as Apache Spark 3.0. The next step is adding network-attached GPUs to existing clusters to greatly speed up workloads without adding new x86 servers from competitors such as AMD or Intel.
While the Ampere generation A100 is a big deal, the EGX A100 may be the most impactful if we look back to this announcement five years from now.
Note: We got late word that this is now the NVIDIA A100 without the “Tesla” branding.