Inspur NF5488M5 Performance Testing
We wanted to show off a few views on what makes this different than other GPU compute servers we have tested.
Inspur NF5488M5 CPU Performance to Baseline
In the original draft of this piece, we had a deep-dive into CPU performance. Since we already have more in-depth CPU reviews, and CPU performance is not the focus of this system.
We instead are just going to present our baseline Platinum 8276 performance versus the same performance in the Inspur NF5488M5.
As you can see, we generally stayed very close to our primary testbed which shows we are getting adequate cooling to the CPUs.
Inspur NF5488M5 P2P Testing
We wanted to take a look at what the peer-to-peer bandwidth looks like. For comparison, we have DeepLearning10, a dual root Xeon E5 server, and DeepLearning11 a single root Xeon E5 server, and DeepLearning12 a Tesla P100 SXM2 server. If you want to compare some of these numbers to an 8x Tesla V100 32GB PCIe server, you can check out our Inspur Systems NF5468M5 review.
Inspur NF5488M5 P2P Bandwidth
Here is the Unidirectional P2P bandwidth on the dual root PCIe server:
Here we can see the unidirectional P2P bandwidth is 143GB/s that was about 9-18GB/s on the PCIe dual root server with the Tesla V100’s. Also, that is more consistent across the GPUs whereas the PCIe server had a lot of variation depending on the placement of the GPUs.
Looking at bidirectional bandwidth:
We again see 266GB/s bandwidth between GPUs and very consistent results. We compare this from about 18-37GB/s on a PCIe Gen3 switched server. You can also see the 800GB/s figures here for the same GPU (e.g. 0,0) which was closer to 400GB/s on the Tesla P100 SXM2 generation.
Just for good measure, we also had the CUDA bandwidth test:
We wanted to show here that the on-device bandwidth is phenomenal around 800GB/s as you can see in these P2P numbers.
Inspur NF5488M5 P2P Latency
Beyond raw bandwidth, we wanted to show Inspur Systems NF5488M5 GPU-to-GPU latency. Again, see links above for comparison points:
Here are the P2P enabled latency figures:
These figures are again very low and consistent.
While this is not intended to be an exact performance measurement, it is a tool you can quickly use on your deep learning servers to see how they compare.
Raw Deep Learning/ AI Performance Increase over PCIe
We had data from the Inspur Systems NF5468M5 that we reviewed and so we ran some of the same containers on this system to see if we indeed saw a direct speedup in performance.
Of course, the usual disclaimers here are that these are not highly optimized results, so you are seeing more of a real-world out-of-box speedup across a few companies that we help test their workloads on different machines as part of DemoEval. Realistically, if one uses newer frameworks, and optimizes for the system, better results are obtainable. The above chart took almost two weeks to generate, so we did not get to iterate on optimizations since we had a single system and a limited time running it.
This is one where we are just going to say, our testing confirmed what one would expect, faster GPUs and faster interconnects yield better performance, the degree of which depends on the application.