Intel Xeon D-2700 Onboard QuickAssist QAT Acceleration Deep-Dive

8

Testing Intel QAT Compression

We are going to ease into this one. We need a few things to get this working. Here we are going to look at five cases compressing the Calgary Corpus. This is a well-known old data set of data to compress. What we are going to do is use Intel’s QATzip, which you can find on Github. We are also going to be using Intel’s ISA-L (Intelligent Storage Acceleration Library) also on GitHub. We are doing this specifically to have the ability to use one program and hook in the QAT hardware accelerator. For those that assume that just because we are using Intel projects for QAT, AMD’s performance is going to be poor, hold that thought for the discussion of the results.

What we are showing is a view of running five cases:

  1. Intel Xeon D-2776NT with:
    • No ISA-L nor QAT (base case)
    • ISA-L
    • QAT Hardware Acceleration
  2. AMD EPYC 3451 with:
    1. No ISA-L (base case)
    2. ISA-L

We are going to express these in two ways, one is looking at performance and the number of threads used:

Intel Xeon D 2776NT V AMD EPYC 3451 Compression Performance
Intel Xeon D 2776NT V AMD EPYC 3451 Compression Performance

As we can see with the base case, the AMD EPYC 3451, using all 32 threads, is not performing better than the Xeon D-2776NT. This is different from what we saw on the previous exercise looking at the EPYC 7003 “Milan” and the Ice Lake mainstream part.

We can also see that ISA-L compression performance gets a huge bump on both Intel and AMD. This is another one that when we did the mainstream processors, the Zen 3 architecture was faster, but the lack of updates in the embedded line means that is flipped.

Using the onboard hardware accelerator takes only 2 Ice Lake generation Xeon D-2700 cores to reach that ~65Gbps mark.

Here is the second view looking at the throughput per thread:

Intel Xeon D 2776NT V AMD EPYC 3451 Compression Performance Per Thread
Intel Xeon D 2776NT V AMD EPYC 3451 Compression Performance Per Thread

As you can see, the ISA-L is a big improvement in either Intel or AMD, but the QAT hardware acceleration gap again is absolutely massive. To put it into perspective, we got about 5x better performance per thread using ISA-L, but around a 19-20x improvement in throughput per thread using QAT.

When we looked at the mainstream server processors, it took 43 threads on Intel and 34 on AMD for our base cases, and those were running on physical cores. With only 16 cores and 32 threads, we just went for the best performance possible using all cores and threads, but using SMT is why the performance per thread here is different. That is very important in this space and for accelerators. Here we are getting better performance with two cores instead of using the entire chip for less performance.

QAT hardware powers compression for a number of commercial storage vendors. My go-to example is Dell EMC PowerStore. Storage vendors figured out this QAT offload years ago, and the cost is relatively minimal in many cases. Offering storage compression using QAT became very inexpensive because of this type of hardware acceleration.

Next, let us look at the crypto side and look at the IPsec VPN performance.

8 COMMENTS

  1. You are probably under NDA but did you learn something about the D-2700 ethernet switching capabilities? Like for example dataplane pipeline programmability like the Mount Evans/E2000 network building block ? As THAT would be a gamechanger for enterprise edge use!!!

  2. Hi patrik, also a follow up question did you try to leverage the CCP (crypto co-processor) on AMD EPYC 3541 for offloading cipher and HMAC?

  3. Hi patrik, thanks for the review. couple of pointers and query

    1. Here we are getting better performance with two cores instead of using the entire chip for less performance.
    – A physical CPU is combination of front-end (fetch, decode, opcode, schedule) + back-end (alu, simd, load, store) + other features. So when SMT or HT is enabled, basically the physical core is divided into 2 streams at the front end of the execution unit. While the back end remains the same. with help of scheduler, outof order and register reorder the opcodes are scheduled to various ports (backend) and used. So ideally, we are using the alu, simd which was not fully leveraged when no-HT or no-SMT was running. But application (very rarely and highly customized functions) which makes use of all ports (alu, load, store, simd) will not see benefit with SMT (instead will see halving per thread).

    2. is not Intel D-2700 atom (Tremont) based SoC https://www.intel.com/content/www/us/en/products/sku/59683/intel-atom-processor-d2700-1m-cache-2-13-ghz/specifications.html . If yes, these cores makes use of SSE and not AVX or AVX512. Maybe I misread the crypto-compression numbers with ISAL & IPSEC-MB, as it will make use of SSE unlike AMD EPYC 3451. hence CPU SW (ISAL & IPSEC_MB) based numbers should be higher on AMD EPYC 3541 than D2700?

    3. did you try to leverage the CCP (crypto co-processor) on AMD EPYC 3541 for offloading cipher and HMAC?

  4. people don’t use the ccp on zen 1 because the sw integration sucks and it’s a different class of accelerator than this. qat is used by real world even down to pfsense vpns.

  5. D-2700 is ice lake cores not Tremont. They’re the same cores as in the big Xeon’s not the Tremont cores. I’d also say if they’re testing thread placement like which ccd they’re using, they know about SMT. SMT doesn’t halve performance in workloads like these.

  6. @nobo `if you are talking about ccp on zen 1` on linux, this could be true. But have you tried DPDK same as ISAL with DPDK?

    @AdmininNYC thank you for confirming it is icelake-D and not Tremont cores, which confirms it has AVX-512. Checking Nginx HTTPS Performance, Compression Performance comparison with SW accelerated libraries, show AMD EPYC 3451 (avx2) is on par with Xeon-D icelake (avx512). Only test cases which use VAES (AVX512) there is a leap in performance in SW libraries. It does sound really odd right?

    Running ISAL inflate-deflate micro benchmarks on SMT threads clearly shows half on ADM EPYC. I agree in real use cases, not all cores will be feed 100% compression operation since it will have to run other threads, interrupts, context switches.

  7. Something is wrong with this sentence fragment: “… quarter of the performance of AMD’s mainstream Xeons.”

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.