Intel Xeon Scalable Processor Family Microarchitecture Overview

1
Intel Skylake SP Microarchitecture Changes
Intel Skylake SP Microarchitecture Changes

In this piece, we are going to go a bit deeper in the microarchitecture features of the new Intel Xeon Scalable Processor Family. We are doing extensive coverage of the Intel Xeon Scalable Processor Family Launch including platform level, chipsets, mesh interconnects, benchmarks, vendor launches, and SKU stacks. Check out our Intel Xeon Scalable Processor Family Launch Coverage Central for more information. Since the new processors are a huge piece of text to keep writing out, you may see us use the official codename of Skylake-SP in this article.

We are going to focus this piece specifically on Skylake core architecture and one of the major new instructions, AVX-512.

Starting with a Skylake Core

We start the microarchitecture discussion with the basic Skylake-SP core. What may surprise some readers is that this is essentially a basic Skylake core that we have seen Intel use for years, and then add additional L2 cache and an additional FMA and AVX execution unit bolted on.

Intel Skylake SP Microarchitecture Core Cache AVX
Intel Skylake SP Microarchitecture Core Cache AVX

Not all chips will have the second FMA unit. That is another part of the differentiation between the various processor tiers.

Here is the slide that describes some of the more technical microarchitecture improvements over Broadwell-EP.

Intel Skylake SP Microarchitecture Changes
Intel Skylake SP Microarchitecture Changes

The net impact is that Intel expects a roughly 10% IPC improvement over Intel Xeon E5-2600 V4 cores (Broadwell-EP.) Our later benchmarking will show that at the same clock speed, we expect a >30% performance per clock improvement over Intel Xeon E5-2600 V1 cores (Sandy Bridge-EP.)

That is an important metric for two reasons. First, Intel Xeon E5-2600 V1 servers are those that will be hitting a 5-year refresh cycle with the Intel Xeon Scalable Processor Family generation. Second, that 30%+ IPC improvement is combined with a maximum core count increasing from 8 to 28 cores. This is the replacement cycle where consolidation can happen at an enormous scale. A general rule of thumb is that at the same clock speeds a three Skylake-SP chip will perform about equal to four Intel Xeon E5-2600 V1 cores.

Major Intel Xeon Scalable Processor Family L2 and L3 Cache Changes

If you compare the Skylake-SP core to a desktop Skylake core, you will notice that the L2 cache is much larger than the desktop counterpart. The server part has an additional 768KB of L2 cache for a total of 1MB L2 cache. Intel made this change in order to increase the amount of low latency data it has available to each core.

Intel Skylake SP Microarchitecture Major L2 And L3 Cache Changes
Intel Skylake SP Microarchitecture Major L2 And L3 Cache Changes

At the same time, Intel reduced the L3 cache by almost half from 2.5MB/core in Broadwell-EP to 1.375MB/core in Skylake-SP. That reduction helps Intel with transistor budget to increase the L2 cache size and core counts all on a 14nm process.

The question of why Intel would do this lies in cache latency. L2 cache is 3.5-4x faster than L3 cache. As core counts rise, L3 cache latencies rise so Intel needs to move more data closer to the CPU.

Intel Skylake SP Microarchitecture L2 L3 Cache Latency
Intel Skylake SP Microarchitecture L2 L3 Cache Latency

One methodology Intel is using is to make the L3 cache non-inclusive. Here is Intel’s slide describing the difference and the process:

Intel Skylake SP Microarchitecture L3 Cache Inclusive V Non Inclusive
Intel Skylake SP Microarchitecture L3 Cache Inclusive V Non Inclusive

The key here is that instead of data being copied both to L2 and L3 caches, data can be loaded directly into the L2 cache. If you are a storage professional accustomed to storage tiering, this is roughly similar to being able to load data directly to an NVMe tier and then working with it and then pushing data, as it is less frequently used to a SATA/ SAS SSD tier or HDD tier instead of needing to copy it to both tiers before using it.

To be clear, the total cache size goes down with this design. L2+L3 on Broadwell-EP was 2.75MB while on Skylake-SP it is 2.375MB. The Broadwell-EP L3 cache has a copy of the L2 cache data so effectively it is a 2.5MB total cache.

Intel never gave us this view, but through a long presentation, but the extremely rough sketch of why this works looks something like this:

Intel Skylake SP V Broadwell SP Average Cache Latency Cumulative Capacity
Intel Skylake SP V Broadwell SP Average Cache Latency Cumulative Capacity

Effectively since each core is able to get more low latency L2 cache, it can get big gains at capacities between 256KB and 1MB while giving up relatively little from 0-256KB and 1MB to 2.375MB. The final 128KB is the remaining delta in effective cache sizes.

That is a very rough diagram to illustrate the trade-off. The obvious other factor is that the shared L3 cache means that the higher latency data in Broadwell-EP’s L3 caches are available to other cores. With Skylake-SP there is not as much of this shared L3 cache data on the chip.

In practice, this means that combined with the Skylake-SP mesh improvements, most of the time Skylake-SP is faster, while sometimes Broadwell-EP is faster. Here is Intel’s data using SPECint_rate 2006 components:

Intel Skylake SP Microarchitecture L3 Cache Inclusive V Non Inclusive SPECint
Intel Skylake SP Microarchitecture L3 Cache Inclusive V Non Inclusive SPECint

Here is the SPECfp_rate 2006 version of that chart:

Intel Skylake SP Microarchitecture L3 Cache Inclusive V Non Inclusive SPECfp
Intel Skylake SP Microarchitecture L3 Cache Inclusive V Non Inclusive SPECfp

Essentially what this data is telling us is that for SPECint and SPECfp workloads, having a larger L2 cache means that the probability of hitting low latency caches go up. It also tells us that for these workloads L3 cache misses are relatively similar even with the smaller inclusive L3 cache of Skylake-SP.

Intel Xeon SP Instruction Set Changes Featuring AVX-512

Perhaps the most impactful change, from a business perspective, is the addition of AVX-512 into Skylake-SP’s instruction set. Beyond AVX-512, Intel added virtualization and security architecture enhancements.

Intel Skylake SP Microarchitecture ISA Changes
Intel Skylake SP Microarchitecture ISA Changes

We wanted to focus energy on AVX-512. AVX-512 allows 512-bit wide vectors to be computed which greatly improves speed.

Intel Skylake SP Microarchitecture ISA AVX 512
Intel Skylake SP Microarchitecture ISA AVX 512

AVX-512 was formerly an Intel Xeon Phi x200 exclusive and popular in the HPC space. During Intel’s HPC overview, the rationale for adding the instruction set into the mainstream Xeon line was that those wanting to do general purpose compute workloads in HPC can use Xeon instead of GPUs or Xeon Phi chips.

The AVX-512 instruction set is not the same as the Knights Landing AVX-512 instruction set. If you are using gcc, you will likely need to compile using different flags versus what is used for Xeon Phi.

Here is Intel’s performance and efficiency side on AVX-512.

Intel Skylake SP Microarchitecture AVX 512 Performance
Intel Skylake SP Microarchitecture AVX 512 Performance

With AVX and AVX2 we saw power consumption and power draw rise at a given clock speed. As a result, Intel downclocked cores while running AVX code.

Intel Skylake SP Microarchitecture AVX2 AVX 512 Clocks
Intel Skylake SP Microarchitecture AVX2 AVX 512 Clocks

With Skylake-SP cores running different AVX code can run in different frequency bound ranges. In older CPU versions running AVX on a single core meant all cores would down clock. Intel has come a long way to its current implementation of clocking in AVX workloads.

Final Words

The cache changes are huge as it gives Intel some transistor budget to use elsewhere. Likewise, the FMA and AVX-512 changes are very significant. We do believe that the AVX-512 inclusion may have a profound effect:

Intel Xeon Phi may be on the road to an untimely phase out.

We make this prediction for a few reasons. First, by moving AVX-512 to the CPU, Intel gives its HPC customers a migration path that they can use alongside GPUs and/ or FPGAs. With alternative HPC architectures like the ARM-based Cavium ThunderX2 and GPUs doing massive floating point performance, Intel needed to beef up its standard compute cores.

For the emerging AI/ deep learning workloads, Intel acquired both Altera and Nervana. Both have high-performance solutions better suited to solving deep learning type problems. The HPC and deep learning infrastructure look very similar these days. Despite Intel’s efforts workloads are not moving to Xeon Phi.

AVX-512 was the killer feature of Xeon Phi Knights Landing alongside the MCDRAM and on-package Omni-Path. As Xeon chips bring AVX-512 to general purpose compute cores, it gets very difficult to choose Knights Landing over Intel Skylake-SP if they are offering similar performance. If Knights Mill and the future roadmap does not change the picture, Intel Xeon Phi may be a casualty of the Intel Xeon Salable Processor Family.

1 COMMENT

  1. “Intel Xeon Phi may be on the road to an untimely phase out.”

    It does leave Xeon Phi in bit of a pickle. However, all is not lost.

    First, the MCDRAM advantage means most applications that are bound by memory bandwidth will have an advantage to the Xeon Phi. A not insignificant number I hear.

    Second, the peak Flops is still noticeably higher on the Phi. The 8180 with 2.5GHz AVX-512 frequency would have 2.2TFlops of performance. Xeon Phi 7250 has 2.6TFlops and 7290 has 3TFlops.

    In DL workloads we’ll see Knights Mill in a quarter or so. While the DP performance is said to be somewhat reduced, you get Quad Vector extensions for 32-bit to double SP FP performance and also half-precision vector support for yet another 2x increase in DL performance for a total of 4x. That would result in 6-7TFlops in SP FP and 12-14TFlops in 16-bit FP.

    The real threat is Nvidia. Volta is already out and beats Knights Mill in SP and 16-bit FP a bit, trounces it in DP FP, and crushes it on Tensor flops.

    Xeon Phi also has second OmniPath connections for better scaling on mass scale.

LEAVE A REPLY

Please enter your comment!
Please enter your name here