Intel Xeon D-2700 Onboard QuickAssist QAT Acceleration Deep-Dive

8
Intel Ice Lake D QAT Platform
Intel Ice Lake D QAT Platform

Today we are going to take a look at Intel QuickAssist acceleration in the Intel Xeon D-2700 series. The “Ice Lake-D” series of processors is designed for edge boxes ranging in applications from simple compute all the way to storage and networking installations. Intel QuickAssist Technology (QAT) was largely designed for these types of use cases, yet we hear very little about it. Today, that changes.

Intel Xeon D-2700 Onboard QuickAssist QAT Acceleration Background

A few weeks ago, we published Intel QuickAssist in Ice Lake Servers What You Need to Know. The goal today is going to be the same thing, but earlier this week, we did another piece on Sapphire Rapids that includes QAT acceleration. Our basic game plan is to go through the same tests but with embedded parts. The video for this one can be found here:

The video for the QAT card version can be found here:

In either case, we suggest opening them in their own tabs, windows, or app for the best viewing experience.

Also, as a quick disclosure, I recorded parts of this video on my trip up to Intel’s Jones Farm site in Hillsboro Oregon. We are going to say that Intel is sponsoring this piece, but that is largely because they helped cover the travel and to get a Xeon D SKU that was challenging to find. We also did this testing back-to-back over several days with the Ice Lake version because it is frankly easier to do this at once with everything set up. The systems under my left arm are the Intel Xeon D-2776TE and AMD EPYC 3451 systems.

Patrick With Intel QAT Test Setup In Oregon
Patrick With Intel QAT Test Setup In Oregon

The goal of the series was simple. We would take a look at what it would look like to add QAT hardware acceleration via add-in cards. Then, we would show the Intel Xeon D onboard acceleration and then build to the Sapphire Rapids release with built-in QAT hardware acceleration as those parts are released. The small wrinkle is that Intel allowed STH and a few analysts to show some accelerators (including QAT) this week, well before the Sapphire Rapids launch.

Intel QAT Test Setup In Oregon 3 Wires
Intel QAT Test Setup In Oregon 3 Wires

For this, one needs a number of items, including switches, a “rats nest” of cables (this was set up and torn down over the time I was in Oregon), load generation nodes, and systems to test. The CPU I wanted to use was the Intel Xeon D-2776NT. This is a high-end Intel Xeon D-2700 16-core CPU, importantly, with QAT acceleration.

Intel Xeon D 2700 D 1700 Ethernet Technology
Intel Xeon D 2700 D 1700 Ethernet Technology

Using the Intel Xeon D-2700 CPU was strategic. The Xeon D-1700 is designed for lower power form factors, and thus QAT acceleration on the D-1700 uses the previous generation technology and does not have the inline packet interface.

Intel Xeon D Ice Lake D Platform Architecture 2
Intel Xeon D Ice Lake D Platform Architecture 2

For this I wanted the Xeon D-2700 because it was the newer technology and we could also get more cores.

Intel Xeon D 2700 Gen 3 Intel QuickAssist Technology
Intel Xeon D 2700 Gen 3 Intel QuickAssist Technology

We are not going into the Ice Lake-D Xeon D series too in-depth here, but we have another piece, Welcome to the Intel Ice Lake D Era with the Xeon D-2700 and D-1700 series if you want to learn more about that.

We set up both the Intel Xeon D-2776TE and the AMD EPYC 3451 nodes with 128GB of memory and got to testing.

Intel Xeon D 2776 Topology
Intel Xeon D 2776 Topology

One quick note on the AMD EPYC 3451 we are using. Before using this, I sent a note to AMD asking if there was anything new coming soon in this space to replace the EPYC 3000 series, AMD’s direct competitor to the Xeon D. The EPYC 3000 series is from the EPYC 7001 “Naples” generation of parts and so it is quite old. Indeed, on the EPYC 3451 system we can see the two NUMA nodes to make 16 cores on a single package. During this, we managed to test the impact of this as well.

ASRock Rack AMD EPYC 3451 Topology
ASRock Rack AMD EPYC 3451 Topology

We will have AMD’s results in this piece, but they will be less competitive than the Intel Ice Lake Xeon v. AMD EPYC Milan from the previous piece because AMD has not updated this line in so long.

Intel QAT AMD Snowy Owl Atop Ice Lake D 2
Intel QAT AMD Snowy Owl Atop Ice Lake D 2

Also, to get the performance we needed on the networking side, we had to add a different NIC than the onboard EPYC 3000 series 10GbE NICs so we added an Intel 800 series NIC to match the IP the Xeon D had.

With that, let us get to our testing.

8 COMMENTS

  1. You are probably under NDA but did you learn something about the D-2700 ethernet switching capabilities? Like for example dataplane pipeline programmability like the Mount Evans/E2000 network building block ? As THAT would be a gamechanger for enterprise edge use!!!

  2. Hi patrik, also a follow up question did you try to leverage the CCP (crypto co-processor) on AMD EPYC 3541 for offloading cipher and HMAC?

  3. Hi patrik, thanks for the review. couple of pointers and query

    1. Here we are getting better performance with two cores instead of using the entire chip for less performance.
    – A physical CPU is combination of front-end (fetch, decode, opcode, schedule) + back-end (alu, simd, load, store) + other features. So when SMT or HT is enabled, basically the physical core is divided into 2 streams at the front end of the execution unit. While the back end remains the same. with help of scheduler, outof order and register reorder the opcodes are scheduled to various ports (backend) and used. So ideally, we are using the alu, simd which was not fully leveraged when no-HT or no-SMT was running. But application (very rarely and highly customized functions) which makes use of all ports (alu, load, store, simd) will not see benefit with SMT (instead will see halving per thread).

    2. is not Intel D-2700 atom (Tremont) based SoC https://www.intel.com/content/www/us/en/products/sku/59683/intel-atom-processor-d2700-1m-cache-2-13-ghz/specifications.html . If yes, these cores makes use of SSE and not AVX or AVX512. Maybe I misread the crypto-compression numbers with ISAL & IPSEC-MB, as it will make use of SSE unlike AMD EPYC 3451. hence CPU SW (ISAL & IPSEC_MB) based numbers should be higher on AMD EPYC 3541 than D2700?

    3. did you try to leverage the CCP (crypto co-processor) on AMD EPYC 3541 for offloading cipher and HMAC?

  4. people don’t use the ccp on zen 1 because the sw integration sucks and it’s a different class of accelerator than this. qat is used by real world even down to pfsense vpns.

  5. D-2700 is ice lake cores not Tremont. They’re the same cores as in the big Xeon’s not the Tremont cores. I’d also say if they’re testing thread placement like which ccd they’re using, they know about SMT. SMT doesn’t halve performance in workloads like these.

  6. @nobo `if you are talking about ccp on zen 1` on linux, this could be true. But have you tried DPDK same as ISAL with DPDK?

    @AdmininNYC thank you for confirming it is icelake-D and not Tremont cores, which confirms it has AVX-512. Checking Nginx HTTPS Performance, Compression Performance comparison with SW accelerated libraries, show AMD EPYC 3451 (avx2) is on par with Xeon-D icelake (avx512). Only test cases which use VAES (AVX512) there is a leap in performance in SW libraries. It does sound really odd right?

    Running ISAL inflate-deflate micro benchmarks on SMT threads clearly shows half on ADM EPYC. I agree in real use cases, not all cores will be feed 100% compression operation since it will have to run other threads, interrupts, context switches.

  7. Something is wrong with this sentence fragment: “… quarter of the performance of AMD’s mainstream Xeons.”

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.