At Supercomputing 2018, Inspur displayed its newest supercomputer building block. The Inspur AGX-5 is built on the NVIDIA HGX-2 platform but offers a number of improvements. One immediately noticeable improvement is a shrink of the 10kW system to an 8U form factor versus the NVIDIA DGX-2‘s 10U form factor. Inspur quotes figures that it makes 60% of the AI servers for the Chinese market and is the top server provider to hyper-scale clients like Baidu, Tencent, and Alibaba. In this article, we are going to talk about the Inspur AGX-5 deep learning solution and recap our discussion with the company.
Inspur AGX-5 Supercomputer in an 8U Box
The Inspur AGX-5 is the company’s design that stems from the NVIDIA HGX-2 architecture. Some of the HGX-2 designs we have seen follow the NVIDIA template, but the Inspur design is in many ways radically different. One still gets 16x NVIDIA Tesla V100 32GB GPUs dual Intel Xeon Scalable CPUs, and the NVSwitch fabric.
The Inspur AGX-5 supports current generation NVIDIA Tesla V100 packages as well as SXM3 for NVIDIA Volta-Next. That is code for NVIDIA’s next-generation GPU.
The server supports 3x UPI link connected Intel Xeon Scalable CPUs. Customers can select how many GPUs they want. With a full set of 12x DDR4 DIMMs per CPU and 24x DIMMs total. That means that with the upcoming Intel Xeon “Cascade Lake-SP” generation, this solution can support Intel Optane Persistent Memory DIMMs that will range from 128GB to 512GB per DIMM in its first generation.
Inspur supports SATA and NVMe (2.5″ and dual M.2″) for storage and utilizes integrated controllers for up to 4x 10GbE networking base. The server also supports a full set of Mellanox EDR/ 100GbE NICs or other fabrics via its PCIe slots.
Our Discussion with Inspur at SC18
We had an opportunity to talk with the company at SC18 about the Inspur AGX-5 as well as some more general market trends.
One of the key questions we asked is “what makes the AGX-5 different from the NVIDIA DGX-2?”
Inspur noted a few features that make their design better than the NVIDIA DGX-2. The first was obvious, it is an 8U design instead of a 10U design for the DGX-2 which means, if you can handle 50KW+ racks, you can have five to a rack instead of four, increasing density.
Beyond this, Inspur noted that it has full redundancy in its power supplies using eight 3kW PSUs for a fully redundant PSU configuration.
Aside from the hardware basics, Inspur has a few other advantages it sees. It can configure these systems however a customer desires. The DGX-2 is a set configuration. Inspur can add less or more RAM, it can install different CPU SKUs and different storage configurations. The company can even introduce liquid cooling into its design for companies looking to reduce TCO using more exotic cooling methods. Inspur has used JDM to engage hyper-scale clients and sees this as being an effective differentiator for selling AGX-5 to its clients.
Inspur also has the supply chain to sell larger clusters of these systems for those customers who want more than a few. DGX-2 availability has been less than demand so the company sees itself as a major supplier. Inspur claims it is the largest AI training system vendor in China so it has the experience to deliver large numbers of systems to the world’s hyper-scalers.
I also asked Inspur where it believes that the AGX-5 fits into its portfolio. The company has the AGX-2, a 2U 8x Tesla V100 design as well as several PCIe solutions available. Here the company said that there are still use cases for the PCIe solutions, but it sees the Inspur AGX-5 as being the model deployed for the highest-performing training clusters. The company was also quick to note that it provides more than NVIDIA solutions and also has other accelerators such as FPGA’s available.
Another theme regarding the Inspur AGX-5 story was one around the server in the context of the company’s broader portfolio. Inspur also uses its JDM model to deliver software and management solutions for its customers. If a customer wants to go from bare hardware to an installed and running HPC cluster, Inspur can deliver a fully integrated solution.
These are certainly interesting machines. A key theme we learned is that customers are deploying the DGX-2 more as a proof of concept solution. The trend seems to be that after the POC is completed, companies that want to scale-out their POC will turn to solutions like the Inspur AGX-5 to get a customized and higher-volume solution for their customers.