Putting it Together
Now we have a few key components. Similar daughterboards are used for the Intel Xeon Scalable Skylake-SP 10x GPU systems by Supermicro and other vendors. We know that the Intel Xeon E5 V3 / V4 generation was advertised as single root while the Skylake-SP version is not. We showed the differences in how the Intel IIO controllers work in each generation. The final step is putting it together into a diagram.
Here we can see how you would connect a 10x GPU daughterboard to a single Intel Xeon CPU to avoid the QPI/ UPI socket to socket interconnect. We circled two PCIe x16 controllers on each CPU diagram. On the Skylake side (left), there is a third available so we picked two, but it can easily be the third.
The implication of this diagram is that the arrows terminate to the same PCIe x16 lanes that feed each of the daughterboards PCIe switches. The arrows originate from two different IIO blocks depending on CPU generation. That difference is the reason that single root 8x and 10x GPU servers are much less common in the Intel Xeon Skylake-SP generation.
One can still make a server that is single root, but it would have a few drawbacks. It would need to have an extra PCIe switch of at least 48 lanes. It would also have a maximum of 16x PCIe 3.0 lanes for all 10 GPUs or a 10:1 oversubscription ratio versus 32x PCIe lanes for a 5:1 ratio today.
Why AMD EPYC Naples Cannot Fill This Gap
For the record, we think that the AMD EPYC “Naples” platform is excellent in many applications, but it is simply not as good for NVIDIA GPU-to-GPU computing. You can read AMD EPYC and Intel Xeon Scalable Architecture Ultimate Deep Dive to get a sense as to the differences of AMD and Intel in this generation.
Understanding the mechanics, we looked back at the materials we had and we could not find the exact view we would want. However, we can make due with what we have. The AMD EPYC “Naples” chips have four dies per package. Each die has on package Infinity Fabric (die to die) I/O. For external I/O there are two 16 lane SERDES links per die that have two primary operating models: socket-to-socket or PCIe.
In a dual socket Naples system, one set is dedicated for die-to-die links across the two sockets. In a single socket AMD EPYC system, one can use all 32 links per die for PCIe. At this point, you may be thinking this is the answer to getting 32 lanes to connect to two PLX switches and 10 GPUs for a single root server. Unfortunately, this is not the case.
With Infinity Fabric of this generation, each die has both top (G) and bottom (P) 16 lane SERDES links. One swaps from being PCIe to socket-to-socket Infinity Fabric in dual socket configurations. Therefore, there are two different controller locations instead of a monolithic PCIe block as on the Intel Xeon E5-2600 V3/V4 series. One gets back to a Skylake-SP issue of having two PCIe 3.0 x16 controllers on the same die, but not in the same PCIe root.
While one may think that this puts AMD and Intel on equal footing at this point, it does not. The deep learning community does just about anything it can to avoid NUMA transfers. With AMD EPYC, the die that a PCIe switch or PCIe switches connect to only has two DDR4 DRAM channels. Essentially with Naples, building a GPU compute architecture with PCIe switches means that 75% of the systems DDR4 channels are a NUMA hop away. Likewise, although one can get 128 lanes on an AMD EPYC system, that involves connecting the GPUs across NUMA nodes which the deep learning and AI communities avoid if at all possible.
The quest for single root is largely driven by NVIDIA’s software considerations. If the companies were perfectly collaborative with Intel not finding NVIDIA as a threat and NVIDIA not trying to push its own margins higher with NVLink for P2P, this may be fixed. There is a chance that tools like nccl could be adapted to work in scenarios with multiple PCIe roots on a single NUMA node/ die making both Intel Xeon Scalable and AMD EPYC more competitive. Business concerns may be influencing the technical requirements here.
With the new Intel Xeon Scalable generation, one can still do single root, but it requires a third PCIe switch and only has a PCIe x16 link back to the CPU through a single IIO module. Likewise, AMD EPYC Naples has a similar constraint. If you want a single root deep learning / AI training server, the Intel Xeon E5-2600 V3/ V4 series platforms are a better bet as they can provide 32 (or technically up to 40) PCIe lanes between the CPU and GPUs in a single root complex.
Stay tuned to STH, DeepLearning12 will be our first 8x GPU NVLink server based on Skylake-SP in what we are going to dub a “DGX-1.5” style system. The final components for DeepLearning12 are arriving the day this article goes live.