A few weeks ago we were allowed to discuss the new Intel mesh interconnect architecture at a high-level. You can read Things are getting Meshy: Next-Generation Intel Skylake-SP CPUs Mesh Architecture for that initial overview. Today we can share full details on the Intel Xeon Scalable Processor Family (Skylake-SP) mesh architecture, and what it means for the Intel platform.
Much of the information here was scattered in our Intel Xeon Scalable Processor Family (Skylake-SP) platform overview piece, and others on microarchitecture. The mesh architecture is a cornerstone, so important that it can have dramatic impacts on performance.
Understanding Intel Moving From Ring to Mesh and AMD Infinity Fabric
The first concept we wanted to cover is the Intel Mesh Interconnect Architecture and why it is moving away from rings. As we covered in our previous piece, the ring architecture was the product of a much smaller topology. As larger core counts became normal, and two sets of rings were added, Intel needed on-die bridges to connect the two hemispheres. Moving to a mesh interconnect aligns resources in rows and columns yielding higher overall bandwidth and lower latencies than Intel could have achieved with its older rings scaled up yet again.
Astute eyes will notice that the Intel diagram labeled “Broadwell EX 24-core die” is actually the XCC Broadwell-EP die. The Broadwell-EX 24 core die actually has a second QPI stop on the second ring set for its third QPI link. That is important for our discussion of mesh as it is comparable to the third UPI link in the Xeon Platinum and the top end of the Gold range.
Here is a larger view of the comparable new mesh:
We covered the UPI links in our Skylake-SP Platform Overview piece. In the meantime, here is the video we recently published on AMD EPYC Infinity Fabric and Intel Broadwell-EP rings that will help you understand how the install base (Intel Xeon E5 V1 to V4) and the new competitor (AMD EPYC) work:
In the 6×6 mesh topology of the 28 core die, there are up to a total of 36 stops on the fabric. 28 are for the cores. 2 for memory controllers (each handling 3x DDR4-2666 channels), two for UPI, three for PCIe and PCH connectivity, and one for integrated package devices such as Omni-Path fabric.
Also, before we get too far ahead of ourselves, we will often use the 28 core die in our examples. The vast majority of Skylake-SP CPUs sold are going to be smaller 18 core and lower die configurations. These have less complex structures.
Two items are missing from the smaller die mesh diagrams. First, there is no on-package PCIe x16 interconnect and second, the third UPI link is absent. This dovetails well with the Intel Xeon Processor Family Platinum Gold Silver and Bronze segmentation.
Diving into the Mesh Interconnect Distributed CHA
The first stop on the mesh interconnect discussion is the Distributed Caching and Home Agent. One of the major performance bottlenecks in the Intel Xeon E5 V4 architecture was a limited number of QPI home agents. With Skylake-SP, each core gets its own UPI caching and home agents.
There are a few inter-socket tests where one will see performance figures that look out of line with the simple performance improvement based on the rest of the platform. The new distributed CHA is one of the reasons for this. Intel needed to move to this type of architecture because it is planning for even more cores not just the 28 we have today. Cascade Lake, the next generation, will undoubtedly have more cores so being able to manage mesh traffic will be important.
Intel Mesh Interconnect Memory Implications
Looking at the memory controllers, there are two per die like the Intel Xeon E5 versions. The similarities essentially end there.
Each memory controller is a three channel controller, up from two on the Xeon E5 series. Each can support up to DDR4-2666 but is limited to 2 DIMMs per channel (down from 3DPC on Xeon E5.) Two controllers, three channels each and two DIMMs per channel give us 12 DIMMs total (2x3x2).
Each controller is on opposite ends of the mesh fabric which allows for RAM to core cache and RAM to PCIe, storage, and networking transfers to happen without hitting the PCIe bridges found in Intel Xeon E5 CPUs.
If we compare this with AMD, AMD does not need as “fancy” of an on-die design. AMD EPYC is using an 8 core silicon die that has access to two DDR4 channels locally. Everything else is over the Infinity Fabric.
The net impact is better performance and lower latency at a given bandwidth figure:
With more cores, latencies usually go up. Here we can see that core latency remain relatively similar to what we saw in the previous generation 24 core parts.
If you are looking at this and wondering how this compares to AMD EPYC, here is a key designation. The blue squares are local NUMA while the light blue dots are maximum remote NUMA hops. Intel has half of its resources on each NUMA node (each chip) in a two socket configuration. With AMD, only 12.5% of system RAM will be a local NUMA node in a two socket configuration. While the latency may be different, that is an important distinction on why Intel’s design is a higher performance piece of silicon.
Intel Mesh Interconnect PCIe Performance
PCIe bandwidth is likewise important. Whereas with the Intel Xeon E5 generation PCIe connectivity was on a single ring, with the Skylake mesh there are three full stops. One also will carry the DMI data.
The completely odd part of the PCIe complex of Skylake is that each PCIe controller can only bifurcate down to a PCIe 3.0 x2 lane. We double-checked this at an Intel tech day and it was confirmed. The implication is huge:
Intel Skylake-SP PCIe controllers cannot bifurcate down to PCIe 3.0 x2 to support twice the number of Intel DC D-series dual port NVMe drives.
That revelation is shocking as we have been hearing a push for dual port NVMe this generation. If each controller could handle down to 8×2 then Intel could attach significantly more NVMe drives.
From a storage perspective, that is why AMD EPYC single CPU configurations are hot with OEMs right now and that is an early frontrunner for storage arrays. This is a spec that Intel just whiffed on which is doubly strange since Intel sells dual port NVMe SSDs that use PCIe x2 lanes for each host.
Having three stops instead of one per CPU has a significant impact on overall IO performance.
The key here is that with more hops along more mesh paths Intel is able to push more data over the mesh fabric to PCIe devices. That was a major weakness of the Intel Xeon E5 design that has been remedied.
The Hidden Gem: Integrated Omni-Path Fabric
One of the biggest features is going to be Skylake-F, or Skylake CPUs with integrated fabric. Although it will likely be limited in sales, the fabric is underestimated in its potential impact.
Skylake-F is the codename for the Skylake package with integrated Omni-Path fabric. Much like the Intel Xeon Phi x200 generation has. From a price perspective, Intel charges under $200 premium for a dual port Omni-Path 100Gbps Xeon Phi x200 part. For comparison purposes, a Mellanox EDR Infiniband card costs around $1200.
With the new mesh and more memory bandwidth, Omni-Path sees a significant speedup over previous generation CPUs.
The actual implementation of Skylake-F looks something like this:
Each package has a protrusion where a cable is fed that connects to a backplane QSFP28 cage. That means each CPU has a 100Gbps RDMA network connection. In a four-socket system, that is 400Gbps of network capacity without using an add-in card or a standard PCIe lane.
If you adopt Omni-Path, you essentially get ultra low-cost 100Gbps RDMA enabled networking without using a PCIe 3.0 x16 slot. As a result, your effective PCIe lanes versus competitive products (e.g. AMD EPYC) skyrockets by 16 PCIe lanes per CPU. The downside is that you have to use Omni-Path. Unlike Mellanox VPI solutions, you cannot put these cards into Ethernet mode.
Just how low cost? Integrated 100Gbps OPA is about $155 more than the SKUs without it.
If Intel ever needs to change the game in this generation competing with EPYC in the scale-out storage/ hyper-converged storage space, adding an inexpensive 100GbE fabric option instead of OPA would be a complete win.
In the meantime, the bigger implication is that Intel is killing off the advantage that Xeon Phi has over the mainstream Xeons. In conjunction with adding AVX-512 we are seeing Intel move HPC back to its mainstream CPUs.
Overall, the new Intel mesh interconnect architecture is required so the company can scale CPU cores and add more I/O without seeing ever increasing ring latencies. Each path on the mesh has fewer hops than the old rings had. If Intel gets serious about pushing fabric on package, then it is going to provide a huge value proposition for users.