Gigabyte G481-S80 GPU-to-GPU Performance
With our system, we have the ability to do peer-to-peer GPU-to-GPU transfers over NVLink. That is different than the single root and dual root PCIe servers for deep learning. For our testing, we are using 8x NVIDIA Tesla P100 16GB SXM2 modules. Folks would probably want us to use Tesla V100 32GB modules, but STH has a tight budget for buying GPUs.
First off, we wanted to show that our 8x NVIDIA Tesla P100 16GB GPUs in the Gigabyte G481-S80 SXM2 GPU complex. This is important since we saw with the system topology that the GPUs are attached to different CPUs.
With NVLink, we have 64x PCIe uplink lanes to the server’s CPUs while maintaining peer-to-peer connectivity. As we covered in our piece: How Intel Xeon Changes Impacted Single Root Deep Learning Servers, with a PCIe switch architecture and Intel Xeon Scalable, one is limited to 16x PCIe lanes attached to one CPU. The Gigabyte G481-S80 is significantly more efficient.
NVIDIA Tesla P100 SXM2 / DGX-1 P2P Testing
Gigabyte G481-S80 P2P Bandwidth
Here is the bidirectional P2P bandwidth on the dual root PCIe server:
Here is that same server with P2P enabled. You can see a large uptick in performance.
Moving to the 10x GPU DeepLearning11 one can see the P2P disabled results:
Here are the P2P enabled results:
Here is the Gigabyte G481-S80 DeepLearning12 system with 8x NVIDIA Tesla P100 SXM2 GPUs and P2P disabled:
Here are the P2P enabled figures for DeepLearning12:
You can clearly see single versus dual hop NVLink runs. For the single hop runs you can see nearly 37GB/s compared to the figures just over 25GB/s for the PCIe 3.0 root systems.
Gigabyte G481-S80 P2P Latency
That shows the DeepLearning12 figures for the latency between the Gigabyte G481-S80’s eight GPUs:
Comparing these to the dual root server’s P2P results, you can see a huge latency jump.
Even the single PCIe root server had a 2-3x increase in P2P latency over the DeepLearning12.
The key here is that the solution is vastly improved over the PCIe solutions, and that is a major selling point over single root PCIe systems. Looking at this, you can clearly see why NVLink users tout GPU-to-GPU latency benefits.
Gigabyte G481-S80 with 8x NVIDIA Tesla P100 GPU Linpack Performance
One of the other advantages of a solution like this is the double-precision compute performance. While many in the deep learning community are focusing on lower precision, there are HPC applications, and indeed many deep learning applications that still want the extra precision that dual precision offers. Linpack is still what people use for talking HPC application performance. NVIDIA’s desktop GPUs like the GTX and RTX series have atrocious double precision performance as part of market de-featuring. We are instead using some HPC CPUs for comparison from Intel, AMD, and Cavium.
That is hugely impressive. Note here, each of these results probably has more tuning possibilities, but the improvement is immediately noticeable.
This is an area where the DeepLearning11 server with its 10x NVIDIA GeForce GTX 1080 Ti’s cannot compete due to its double precision compute performance being hamstrung. One can clearly see why Tesla GPUs are popular in the HPC world. While the deep learning crowd focuses on features like INT8 support and single or lower precision floating point performance, there is something to be said about deploying a box that can handle many types of compute. The NVIDIA Tesla P100 card also features ECC memory which can be important in large simulations.