Hands-on Benchmarking with Intel Sapphire Rapids Xeon Accelerators

18

Intel QuickAssist Technology

Intel QAT is one of the strangest accelerators out there. Its goal is to focus on crypto and compression offloads. Intel has had several generations of QAT since Tolapai in 2007-2008. Each generation (usually) brings new performance but also different cipher support. We actually have a Intel QuickAssist Parts and Cards by QAT Generation guide. If you missed it, we showed what the QAT accelerator does and what it is in Intel QuickAssist in Ice Lake Servers What You Need to Know.

When QAT is leveraged, it is basically a cheat code. Intel’s customers, such as VPN and firewall box vendors, utilize it for low-cost acceleration. Other vendors use it for compression, not just for networking, but also for storage. If you have a Dell EMC PowerStore and are using compression, that is most likely being powered by QAT. There are many other vendors that use this, but usually, it takes an ISV or a platform vendor enablement because it also requires having a card, a built-in accelerator, or the right PCH.

Still, general adoption has lagged quite a bit. We did pieces in 20162017 around the technology. Really it took until asymmetric crypto was added with QAT support in OpenSSL 1.1 (finally) released for it to become usable by a broader audience. Even how QAT has been implemented is a bit strange. For example, only some of the Intel Xeon D-2700 and D-1700 series chips support QAT. The ones that do support two different models, with the Xeon D-2700 supporting in-line (higher-end) and the D-1700 supporting look-aside offloads.

Intel Xeon D Ice Lake D 3
Intel Xeon D Ice Lake D 3

With that, I wanted to take a bit of a look at what the performance means for the same workloads. To be clear, we did a search domain to find the right thread parking and thread count for systems both during the Ice Lake piece and also this piece. You will see a great example of that in the upcoming Ice Lake-D QAT piece. I frankly wanted to check Intel’s work on this to ensure the slides it showed at Intel Innovation a few weeks ago were above board. Intel’s team clearly did this work as well as they had the right sweet spots that also align with what we have seen previously.

We had Intel QAT versus AMD EPYC 7763 using the Sapphire Rapids development platform and a Dell EMC PowerEdge XE8545 server. We picked the Dell simply because that had the right chips in it and has huge cooling. Since we are not using the NVIDIA Redstone for AI acceleration, that platform has perhaps too much cooling since it is running more than 2kW under its max load.

Intel QAT Compression

Compression offload is probably the best case for QAT. Here, we are going to use the same setup that we did when using QAT as an add-in card. In all of these, you will likely note that our AMD numbers are slightly different from Intel’s, but we are going to attribute that to using the EPYC 7763’s in a Dell server designed to cool 2kW of GPUs that are not being utilized.

Intel Pre Production Sapphire Rapids Preview QAT Compression Performance Preview
Intel Pre Production Sapphire Rapids Preview QAT Compression Performance Preview

In this, we have the Xeon Gold 6338N and the AMD EPYC 7513 as well. There we gave AMD’s Milan a slight edge in TDP to account for the QAT accelerator TDP. The other big item to note is that with ISA-L, AMD can beat Intel Sapphire Rapids with Milan. We sense that some other sites that get a result like this will pass up the opportunity to showcase this, but AMD actually performs better here unless the QAT accelerator is utilized. When it is utilized, things are completely different. Intel needs vastly fewer cores to hit the same level. Our charts are labeled as threads because for some other tasks, we needed to use all threads in a system (like the 32-core Ice Lake and Milan testing.)

Intel Pre Production Sapphire Rapids Preview QAT Compression Performance Per Thread Preview
Intel Pre Production Sapphire Rapids Preview QAT Compression Performance Per Thread Preview

This is a really interesting result. Zen 4 is faster than Zen 3. We would expect a minimum of 10% (again, I know the answer, but we are just using a general generational plug from the desktop IPC gains here) increase in 64C performance with Genoa. That will put 64C Genoa ISA-L in-line with the Sapphire Rapids QAT throughput here. Indeed, we would expect that, since ISA-L performance scales with cores, the 96-core Genoa ISA-L will have the highest throughput. Intel will only use 4 cores versus AMD’s 64-96 cores, but the other side is that this is the performance limited by the accelerator. Intel will cap out, using onboard accelerators, with this performance, but AMD will scale, albeit at the expense of many more cores and higher power consumption (we assume.)

That is really the trick with accelerators. One gets more performance per core, but is capped by the accelerator performance.

Next, let us look at the IPsec VPN performance.

18 COMMENTS

  1. Read Hot Hardware’s version: “We don’t know what any of this means really, but here’s the scripts Intel gave us, and here’s what Intel told us to run. We also don’t have AMD EPYC so who cares about competition.”

    Read STH: “Here’s an in-depth look at a few storylines we have been showing you since last year, and here’s what and why you can expect the market to change.”

    It’s a world of difference out there.

  2. One thing left unmentioned is the physical impacts of acceleration. For instance, what is the die space budget for something like QAT and AVX? How are thermals impacted running QAT or (especially AVX) in mixed or mostly-accelerator workloads? And on the software side, (which was briefly touched on in the earlier QAT piece), what software enablement/dev work is needed to get these accelerators to work?

    In a future piece with production silicon I’d be eager to get some thoughts on the above.

  3. I don’t think QAT uses AVX. It’s like its own accelerator. Can you show the PCIE or other connection to the QAT accelerator?

  4. Thanks for the balanced view. It’s good you talked AMD and history.

    Now can you do that MikroTik switch review you mentioned in this article???

  5. Why acceleration? It seems like Intel is either not able to or do not want to take the direction on multiple chips on package direction. That leads to packaging specialized chips (accelerators) into the same package with the CPU.
    Intel was the company that brought the generic purpose CPU that can be tuned for multiple usages. It seems like AMD is heavily betting on that while Intel is taking the sideway with custom chips for individual workloads, like mainframes did.

    It almost seems like Intel is playing into his strength of being able to deliver custom chips leveraging its army of engineers. Would this work? Really hard to say.

    Server workloads are getting pushed into more and more to Cloud. So hyperscalers will make the decision but AMD’s strategy sounds better to me. Software is always more malleable over hardware and making the cores/cpus cheaper and abundant was the winning strategy of Intel. I expect it would work again for AMD.

  6. What would be the picture with a QAT card + AMD processor?
    There are enough PCIe lines for that.
    Looks like it would be the best of both worlds: highest general purpose compute, QAT accelerator if useful.

  7. I remember reading AMX is even worse regarding CPU clock down than (early) AVX512, when they added it to Linux they made it very difficult for workloads to run with AMX (the admin has to explicitly allow it for an application).

    All of this needs software support, which only seems to be widely available for QAT. A repo on github probably isn’t enough for most people who don’t want to spend their operations budget on recompiling large parts of their software stack. The only way this accelerator strategy is going to work is if you can replace an AMD machine with an Intel machine, install a few packages through your distribution and it magically runs a lot faster/more efficient.

    Also the big question what part of this is available in a virtualized environment. If AMX slows down adjacent workloads that might be cause enough to disable it for VMs in shared environments. I don’t know if you can pass down QAT to a VM.

  8. > I don’t know if you can pass down QAT to a VM.
    THIS is really the point in todays cloud world.
    Can QAT and other accelerating technologies be easily used in VMs and in Containers (kubernetes/docker).
    If they can be used:
    – what do one have to do to make it work (effort)?
    – whats the loss of efficiency, and with it
    – how does a bare metal deployment compare to a deployment in Kubernetes/docker e.g. on AWS EC2 ?

  9. I can’t say I’m on board with the STH opinion that these accelerators are a have changer in there market. From the trend I see is that every buyer but especially hyperscalers don’t want these vendor specific accelerators but they want general purpose accelerators.

    Even Intel QAT support is pretty scarce and harder than needed to use and for network functions seemingly overtaken by DPU/TPU hardware. I don’t really see a space for the other Intel specific extensions, and am not sure why STH is such a subscriber to this idea of encouraging vendor specific extensions.

  10. David, sorry but that’s crazy. NVIDIA has a huge vendor specific accelerator market. If hyperscalers didn’t want QAT Intel wouldn’t be putting it into its chips. I don’t think any features go into chips without big customers supporting it. TPU’s are Google only. DPUs outside of hyperscale how many orgs are going to deploy them before Sapphire servers? Even if you’ve got a DPU, you then have a vendor’s accelerator on it.

  11. “Imagine there is an application where you are doing various other work but then need to do one or a handful of AI inference tasks.”

    1. Servers are not Desktop PCs where you do a lil bit of this, then a lil bit of that.
    2. If it’s really just a handful of tasks you can do it on CPU fast enough without VNNI/AMX

    CPU extensions like VNNI and AMX have been designed many years before the CPUs came to market. Today it is clear that they are useless as they can’t compete with GPUs/real accelerators.

    Both Intel and AMD are stepping away from VNNI and moving to dedicated AI accelerators on CPU, just like smartphones SOCs. These are much faster and much more efficient than these silly VNNI/AMX gimmicks:

    Intel will start Meteor Lake embedding their “VPU”.
    AMD will integrate their “AIE” first in their Phoenix Point APU next year. They have also shown AIE is on their Epyc roadmap. I seriously doubt that we will ever see AMX on AMD chips.

    These accelerators will usually not be programmed directly.They will be called through an abstraction layer (WinML for windows), just like on smartphones.

    VNNI and AMX are both basically dead.

    “AMD’s strategy is to allow Intel to be the first with features like VNNI and AMX. Intel does the heavy lift on the software side, then AMD brings those features into its chips and takes advantage of the more mature software ecosystem.”

    Please stop making things up here: Intel is doing stupid things like VNNI and AMD has to follow for compatibility. Next to no one is using VNNI and there is almost no software ecosystem. They only did make VNNI accessible for standard libraries like Tensorflow you are using and also to WinML.

    I am really surprised that you are still pushing Intel’s narrative from few years ago, as even Intel has stepped away from VNNI/AMX and is embedding dedicated inference accelerator units (VPU) in their CPUs.

  12. Even the Arm makers are embedding AI inference extensions in their next DC procs so I’m not sure why there’s an idea that they’re dead. FP16 matrix multiply is useful itself.

  13. @Viktor

    “FP16 matrix multiply is useful itself.”

    Yes, but why do it in your CPU core with all inefficiencies that come along. Instead these small datatype matrix multiplication will be executed on dedicated units that have better power efficiency, better performance and they don’t stop your CPU from doing anything else while executing.

    Effective matrix multiplication is exactly what these VPU, AIE, NPU (Quallcomm), APU (AI processing unit / Mediatek) are doing.

    AMX and VNNI are zombies.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.