NVIDIA DGX-2 Details at Hot Chips 30

1
NVIDIA DGX 2 System Packaging
NVIDIA DGX 2 System Packaging

At Hot Chips 30, NVIDIA gave out more details on its largest system for deep learning and AI training. The NVIDIA DGX-2 scales up to 16x NVIDIA Tesla V100 32GB GPUs for up to 512GB. It also uses the NVSwitch fabric to allow high-speed memory access between GPU HBM2 memory. By providing a high-speed interconnect, it can reduce the cost of “remote” (in the same chassis) memory lookup. Here are some of the details that NVIDIA provided.

NVIDIA DGX-2 with NVLink

Here are the key speeds and feeds of the NVIDIA DGX-2. We covered many of these in the NVIDIA DGX-2 Pushing Limits at 16 GPUs and 512GB of HBM2 RAM launch piece.

NVIDIA DGX 2 Speeds And Feeds
NVIDIA DGX 2 Speeds And Feeds

Each NVIDIA DGX-2 has two GPU boards that utilize these NVSwitches to create a fabric that allows for fast GPU-to-GPU communication.

NVIDIA NVSwitch In DGX 2 Fabric
NVIDIA NVSwitch In DGX 2 Fabric

The NVIDIA DGX-2 also needs to connect the GPUs back to the CPUs and other PCIe devices such as storage and NICs. In the DGX-, each NIC has a clear path to a pair of GPUs. This allows high bandwidth GPUDirect RDMA.

NVIDIA DGX 2 PCIe Network Diagram
NVIDIA DGX 2 PCIe Network Diagram

GPU baseboard. The DGX-2 has two GPU baseboards connected via passive PCBs. There are no buffers on the PCBs to conserve space and power.

NVIDIA DGX 2 Baseboard Complex
NVIDIA DGX 2 Baseboard Complex

PCIe is piped in via the midplane. Switches from 12V power distribution to 48V power.

NVIDIA DGX 2 System Packaging
NVIDIA DGX 2 System Packaging

A 10kW air cooled system requires heavy-duty fans. Cooling wise there are four 92mm fans cooling each GPU baseboard. 60mm front fans and extra internal fans cooling the system.

NVIDIA DGX 2 System Cooling
NVIDIA DGX 2 System Cooling

The NVIDIA DGX-2 utilizes custom connectors and traces. Instead of being standard PCIe cards, the system uses a differnet package than the DGX-1  and groups Rx and Tx I/O.

NVIDIA DGX 2 Signal Integrity
NVIDIA DGX 2 Signal Integrity

Overall, there is a lot of systems engineering that went into the DGX-2. This is for good reason. The DGX-2 has performance benefits over two DGX-1’s because it is a larger scale-up solution.

NVIDIA DGX-2 versus DGX-1

NVIDIA also shared some cherry-picked numbers on performance. The company admitted that these were best cases to show off the benefits of the DGX-2 over the DGX-1.

The first example is bisectional bandwidth for GPU-to-GPU communication. You may notice a small kink at 8 to 9 GPUs. This is where requests start eating into the responses, but the kink is relatively muted.

NVIDIA DGX 2 Bisection Memory Bandwidth
NVIDIA DGX 2 Bisection Memory Bandwidth

NVIDIA’s CUFFT showed half of a DGX-2 to DGX-1. This is a bit of a best case since the messaging between GPUs costs are high compared to the GPU compute required to solve the problems.

NVIDIA DGX 2 V DGX 1 CUFFT
NVIDIA DGX 2 V DGX 1 CUFFT

NVIDIA compared two DGX-1 servers versus a single DGX-2 in all-reduce. The performance at larger Infiniband message sizes was 3x while smaller messaging sizes offered an 8x advantage. This is another cherry-picked example that pushes a lot of data over the Infiniband fabric.

NVIDIA DGX 2 V DGX 1 All Reduce
NVIDIA DGX 2 V DGX 1 All Reduce

Other highlighted benchmarks give you a 2-2.7x speed-up moving from two DGX-1 class systems to a DGX-2 system.

NVIDIA DGX 2 V DGX 1 2x Speedup
NVIDIA DGX 2 V DGX 1 2x Speedup

Final Words

There are always problems where scale-up servers out-perform scale-out. Many machine learning/ AI systems problems are constrained by memory bandwidth and capacity. Being able to scale up the problem sizes means that researchers can work on bigger datasets without having to go back to main memory or over a network which are both slow access patterns.

SHARE
Previous articleWowerEdge The Dell EMC PowerEdge MX Launch
Next articleNVIDIA NVSwitch Details at Hot Chips 30
Patrick has been running STH since 2009 and covers a wide variety of SME, SMB, and SOHO IT topics. Patrick is a consultant in the technology industry and has worked with numerous large hardware and storage vendors in the Silicon Valley. The goal of STH is simply to help users find some information about server, storage and networking, building blocks. If you have any helpful information please feel free to post on the forums.

1 COMMENT

  1. Very interesting package. When this dense integration is upgraded to Turing (1/2 FP), it should set a unit volume, compute record. I would like to get one to perform sub meter resolution planetary geophysics. I wonder if the machine is so noisy when running a large work load that it needs to be acoustically isolated.

LEAVE A REPLY

Please enter your comment!
Please enter your name here