How Intel Xeon Changes Impacted Single Root Deep Learning Servers

7
10x NVIDIA GTX 1080 TI FE Plus Mellanox Top
10x NVIDIA GTX 1080 TI FE Plus Mellanox Top

In this article, we are going to explain why Intel’s Skylake-SP PCIe changes have had an enormous impact on deep learning servers. There are several vendors on the market advertising Skylake-SP “single root” servers that are not single root. With the Intel Xeon E5 generation, each CPU has a single PCIe root. Therefore saying that all GPUs are attached to a CPU was an easy way of determining if a server was single root or not. That is not the case with Intel Xeon Scalable.

The world of AI/ deep learning training servers is currently centralized around NVIDIA GPU compute architectures. As deep learning training is able to scale beyond a single GPU and into many GPUs, PCIe topology becomes more important. This is especially so if you want to use NVIDIA’s P2P tools for GPU-to-GPU communication. Intel Xeon chips are still the standard for deep learning training servers. With the Intel Skylake-SP generation, Intel increased the number of PCIe lanes available by eight. At the same time, the company’s  PCIe controller architecture for deep learning servers made single root servers less feasible. Combined with CPU price increases, this means that Intel Xeon E5 V3/V4 systems are still widely sought after by customers for training servers.

We explored Single Root v Dual Root for Deep Learning GPU to GPU Systems previously, and now we are going to show why the new generation of CPUs creates a problem for the newest architectures.

Noticing Something is Different with Skylake-SP Single Root PCIe Servers

The Intel Xeon Scalable generation has been out for over a year. With the Skylake-SP launch, something strange happened. The number of single root deep learning training servers has plummeted. In the Silicon Valley, the Supermicro SYS-4028GR-TRT2 is perhaps one of the most popular deep learning training platforms we see in local data centers, yet the successor SYS-4029GP-TRT2 has been less popular. We covered the former in our DeepLearning11: 10x NVIDIA GTX 1080 Ti Single Root Deep Learning Server (Part 1) article.

When we looked at the next-generation, we noticed that the “4029” Skylake-SP version did not advertise that all GPUs would be under a single PCIe root complex. Here is a link to both:

Supermicro uses large riser platforms to host the PLX PCIe switches and PCIe slots for GPUs to plug into. Both the Intel Xeon E5-2600 V3/ V4 and the Xeon Scalable versions utilize the Supermicro X10DRG-O-PCIE daughter board.

Supermicro X10DRG O PCIe For SYS 4028GR TRT2
Supermicro X10DRG O PCIe For SYS 4028GR TRT2
Supermicro X10DRG O PCIe For SYS 4029GP TRT2
Supermicro X10DRG O PCIe For SYS 4029GP TRT2

This daughter board takes CPU1 PCIe lanes and pipes them through the PEX PCIe switches. Each switch is the large and expensive 96-97 port PLX model. These switches each utilize 16 ports as a backhaul to CPU1 leaving 80 ports left over for GPUs. With each GPU claiming 16 ports, each of the two PCIe switches can handle five PCIe GPUs. That topology means that in both cases all 10 GPUs are attached to the same CPU.

We knew why the SYS-4028GR-TRT2 is advertised as a single root system:

Supermicro SYS 4028GR TRT2 Single PCIe Root
Supermicro SYS 4028GR TRT2 Single PCIe Root

Conversely, the newer Skylake SYS-4029GR-TRT2 is not:

Supermicro SYS 4029GP TRT2 No PCIe Root Mentioned
Supermicro SYS 4029GP TRT2 No PCIe Root Mentioned

Being clear, single root PCIe complexes for 8-10 GPUs are a highly sought after configuration for P2P and so the fact that the newer model excludes this language is intriguing. This is not a Supermicro specific or 10x GPU specific concern. Our Tyan Thunder HX GA88-B5631 Server Review 4x GPU in 1U ran into the same limitation due to the new Intel Xeon Scalable architecture.

After speaking to multiple vendors, we found that this is actually a change in the Skylake-SP generation of Intel Xeon Scalable CPUs and it has a big impact. It is also one that the AMD EPYC 7001 cannot remedy.

Getting into this Mesh

One of the biggest changes with this generation is how the various parts of the Intel Xeon Scalable die are connected. We did an in-depth look at this in our piece: New Intel Mesh Interconnect Architecture and Platform Implications. For our purposes, the key change was moving from the company’s ring architecture to a mesh architecture to support higher core counts.

Intel Mesh Architecutre V Ring
Intel Mesh Architecture V Ring

We actually picked up on the nuance of this change last year but missed the implication. Specifically, what this means to PCIe topology. We focused on this in our first mesh article:

Broadwell Ring V Skylake Mesh PCIe Example
Broadwell Ring V Skylake Mesh PCIe Example

If you look at the Broadwell-EP (Intel Xeon E5-2600 V4) architecture, all of the integrated IO (IIO) is on a single controller in the ring. We are going to use a low-core count example here so you can see the two PCIe x16 and single PCIe x8 controller on the top right ring stop. Single root servers would typically use one of these PCIe 3.0 x16 controllers to connect to a PCIe switch, then connect to half the GPUs. The other PCIe 3.0 x16 controller would connect to another PCIe switch and the other half of the GPUs. Since everything is in a single IIO root, this shared a PCIe root and for deep learning, everything worked well.

Broadwell LCC Architecture
Broadwell LCC Architecture

Since there is one IIO component, the industry colloquially equated one CPU to mean one PCIe root. It was this way for four generations of Intel Xeon E5 CPUs.

Now let us take the Intel Skylake-SP 28-core die used on high-end CPUs like the Intel Xeon Platinum 8180 and Xeon Platinum 8176. As you can see there are three external PCIe 3.0 x16 hops atop the mesh. There is a fourth labeled “On Pkg PCIe x16” which can be used for on package components such as Omni-Path “F” series parts. For our purposes, the other three are the ones we are interested in. Note that they all have their own stop on the mesh.

Skylake SP 28 Core Die Mesh
Skylake SP 28 Core Die Mesh

Digging a bit deeper, we can see why these PCIe x16 controllers are different. Here is what the Processor Integrated I/O looks like on the Skylake-SP mesh where you can see that each PCIe 3.0 x16 controller sits on its own mesh stop:

Intel Skylake SP Mesh Interconnect Integrated IO
Intel Skylake SP Mesh Interconnect Integrated IO

Instead of a single big IIO block, these are split into multiple smaller IIO blocks, each with their own traffic controllers and caches. For deep learning servers, that means that each PCIe x16 controller that is connected to a 96 port PCIe switch and downstream GPUs is on a different mesh interconnect hop. For more technical PCIe enumeration and root complex reasons that we are not going into here, there is an implication. The architecture is not single root like the Intel Xeon E5 generation where all controllers sat in the monolithic IIO ring hop.

Next, we are going to put the pieces together and show how all of this means. We are also going to discuss AMD EPYC Naples and how that can fit into the process. Finally, we are going to give some of our closing thoughts.

7 COMMENTS

  1. You’ve got no idea how bad this is for us in the field. It’s ok now since you can still get E5’s but you can’t upgrade to the new gen Xeon Gold and still use P2P.

    What this article didn’t talk about enough is that the reason we use single root is to allow inter-GPU traffic. GPU’s talk to each other without having to go back to the CPU.

  2. What is the impact of PLX multi-plexers in the whole system?.
    Looking forward to VEGA 20 and NAVI with their own implementation of NVidia’s NVLINK.
    What is the impact of 8 vs. 16 PCIe links with current GPU’s like the GTX1080ti (without PLX chips)?
    EPYC 2 will support PCIe 4.0 and PCIe 5.0 is on it’s way.
    Meanwhile NVidia is laughing it’s but of and people are stupid enough to buy single supply stuff like CUDA, people never learn.

  3. @Lee Chen, thank you. That part was missing for me too, as I had no idea why single root is so important or nice to have.

  4. Hi, very nice work on this artice.

    Some other feedback: The “next page” buttons are almost invisible. I saw them only by chance. Consider making them more contrasting.

  5. @Lee Chen
    The thing is monolithic architectures are a thing of the past for sever CPUs. The mesh is Intel’s last ditch to keep it alive a litter bit longer but even they will move on to an MCM design sooner than later just like AMD is doing now. If you guys don’t start to adapt your tools to cope with these limitations now, things are just gonna be more painful in a few years. Right now you can still find Broadwell-based CPUs, but you won’t have that option anymore in a few years.

  6. What is the typical CPU utilization of a system like this? Does it even make sense to use Skylake-SP Platinum parts? Does the Xeon D 2xxx have the same issue with PCIe?

LEAVE A REPLY

Please enter your comment!
Please enter your name here