While the NVIDIA DGX-2 has been king of the deep learning / AI training world since its launch (see NVIDIA DGX-2 March 2018 launch) NVIDIA’s partners are catching up with their HGX-2 designs. Now, NVIDIA is upping its top-level offering with the NVIDIA DGX-2H. For this, NVIDIA is essentially adding faster processors and raising the thermal limit on the Tesla V100 GPUs even more.
The NVIDIA DGX-2H builds upon the DGX-2 platform to offer even more performance to NVIDIA’s customers as its OEM partners ramp up HGX-2 sales. Since the DGX-2H shares most of the same topology as the DGX-2, we can point you to our NVIDIA DGX-2 at Hot Chips 30 piece for more information on the basic platform.
NVIDIA DGX-2H Specs
While the NVIDIA DGX-2H is an evolution compared to the DGX-2, there are a few interesting morsels in their data sheets. Here is the side-by-side of the official data sheet PDFs:
It appears as though the major differences are:
- Intel Xeon platinum 8174 v. Intel Xeon Platinum 8168
- 12kW Maximum Power Consumption v. 10kW
- Dual port primary networking is via dual 10/25/40/50/100GbE instead of 10/25GbE
- Weight has gone up 20lbs to 360lbs
- GPUs run at 450W instead of 350W TDP
- Maximum operating temperature decreases from 35C to 25C
What is extremely strange is that the performance has not moved despite the higher frequency CPUs and extra 100W TDP NVIDIA Tesla V100 modules. We have reached out to NVIDIA and will update this piece with a response on why this is still considered a 2PF machine despite the CPU and GPU updates.
[Update 19 November 2018 at 9:30 AM Pacific] We reached out to NVIDIA regarding the 2 petaflop number. NVIDIA said that it should be 2.1 petaflops and will be updated accordingly.
This is a big deal. You may have seen AMD Radeon MI60 numbers that compared to a 250W PCIe Tesla V100. Most DGX-1 class offerings run the SXM2 NVIDIA Tesla V100’s at 300W. The DGX-2 ran these V100’s at 350W TDP. Now the NVIDIA DGX-2H ups this to 450W TDP.
It was not long ago when accelerators had TDP in the 225W-300W range. Now, we are seeing 450W components. Extra TDP usually yields better performance, so we are not sure why this figure has not moved. At the same time, 450W is a sign of things to come. In a 10U chassis that is rated to consume 12kW, air cooling can be an option. For dense HPC applications to cool something like this will require liquid cooling. We are not sure if the extra cooling is the reason behind the 20lb weight gain, but the Intel Xeon Platinum SKU changes should weigh the same and the dual port networking changes should cause negligible weight differences.
I have a feeling that AMD will beat this configuration with EPYC2+VEGA20 in the same 10U form factor at 12 kW.
For a price of around $300k incl. 4 TB of memory, 15 Mellanox ConnectX-6 and a Mellanox QM8700 switch.
With Tensor the DGX-2H might still be around 15% faster, with fp16, 32 and 64 the AMD system will be around 70% faster.