We recently had the opportunity to test Intel QuickAssist Technology with OpenSSL 1.1. This is a test we have been waiting for since May 2016 but OpenSSL 1.1 was delayed for several months. In the interim, we have had several QuickAssist 1.6 capable cards in the lab, including those from Intel and Netgate. We will have more benchmarks with more applications in the coming weeks but wanted to provide initial impressions as well as some thoughts around getting something setup.
What is Intel QuickAssist Technology?
Intel QuickAssist Technology (commonly referred to as QuickAssist or QAT) is a hardware accelerator for cryptographic and compression algorithms. We first saw QuickAssist in September 2013 with the Intel Atom C2xx8 “Rangeley” parts. At that point, we had the hardware but the software ecosystem was far from user-friendly. If you were sticking closely to reference designs or had the ability to develop large-scale telecom service provider applications or big web scale systems it was accessible, but there was more effort involved. The benefits were enormous. You could use very low power Atom chips, certain PCHs or accelerator cards and essentially switch to encryption and compression without burdening your CPU.
With the web moving to HTTPS and HTTP/2, there is an increasing need to quickly handle cryptographic tasks. We expect to see QAT become a much more broadly adopted technology in Intel chips in 2017 and beyond.
OpenSSL 1.1 Why it is a big deal for Intel QuickAssist
OpenSSL is one of the most widely adopted technologies for securing communications between compute nodes. It powers encryption on an enormous number of public and private websites. OpenSSL is used to secure communications in other applications as well. To say it is an important part of the modern communication infrastructure is an understatement. When major vulnerabilities are found in OpenSSL it sets panic in the hearts of countless admins, software developers, and executives around the world.
OpenSSL 1.1, released in August 2016, greatly simplified getting Intel QuickAssist working in a system. Intel was able to get asynchronous OpenSSL adopted and a part of the base package. We have been using the term “out-of-the-box” support for QuickAssist in OpenSSL 1.1 which is slightly misleading. You still do need to install and use the QAT engine for the hardware you have in order to take advantage of the acceleration. With that said, you can now drop the engine into standard OpenSSL 1.1 rather than needing a special OpenSSL version which makes the solution much easier to deploy.
One piece of learning I had over the last few weeks is that there are actually QuickAssist versions. For example, the Intel Atom C2xx8 series parts have QuickAssist 1.5. The add-in cards from Netgate and Intel we had are both QAT 1.6 cards. You may think that this does not matter however you do need to ensure you are using binaries (either self-compiled or distributed) that work with the hardware you have. We tried using binaries compiled for QAT 1.6 with an Intel Rangeley system and that was an exercise in futility. Here is the easy way to remember where you will commonly encounter QAT accelerators through the end of 2016 and the versions:
- QAT 1.5 = Rangeley Atom C2xx8 series hardware. The Atom C2xx0 series does not have the onboard QAT accelerator making that final 8 in product names very important.
- QAT 1.6 = Coleto Creek. This is found in add-in cards (what we are using) and some embedded solutions with Intel Xeon E3 and E5 chips.
You do need to specify the version when compiling some QAT accelerated applications and when you install the cards into systems the Intel QAT driver will distinguish between the two main QAT versions.
We are going to have more on QAT 1.5 with Rangeley hardware in a follow-up piece. For today’s tests we used two sets of cards, both based on Intel’s highest-end Coleto Creek 8955 chipset. We utilized the Netgate CPIC-8955 as our primary QAT accelerators. At the time of this writing they retail for around $800 but Netgate loaned us two cards for our QAT testing. The Netgate CPIC-8955 is highest-end “Coleto Creek” PCIe QAT accelerator on the market rated for up to 50Gbps QAT throughput. These are full-height cards with active cooling. Netgate did provide both the cards and support for getting us started with QAT. That support made this article possible by greatly speeding our test setup cycle. If you are thinking of embarking on QuickAssist acceleration, having a QAT guru is extremely helpful.
Second, we had the Intel QuickAssist Adapter 8950 cards that Intel sent. We have been using these cards for some time and they will be used more heavily in one of our upcoming pieces. They utilize the Intel Coleto Creek 8955 chipset. We do wish Intel just called these cards the “Intel QuickAssist Adapter 8955” to make things easy. They are half height cards and require good chassis cooling to provide airflow over the large heatsink.
The load generation servers we were testing were dual Intel E5-2698 V4 machines with 128GB RAM. We also had a quad Intel Xeon E7-8870 V3 system with 512GB of RAM and a dual 40GbE Mellanox ConnectX-3 Pro NIC for prototyping our tests including our upcoming IPsec VPN tests. For our system under test in this article, we were using a dual Intel Xeon E5-2650 V4 system with 24x 2TB NVMe SSDs and 512GB RAM. Networking in that system was provided by an Intel XL710-QDA2 dual 40GbE card as well as 4x Intel X550 10Gbase-T ports.
Here is a look at one of our first test beds we used for the QAT series with 180gbps worth of networking and several QuickAssist accelerators:
When testing the higher-end QAT cards dual 10GbE is not sufficient for network bandwidth testing and you need at least 40GbE.
Today we are going to publish two benchmarks using the QuickAssist accelerators. First, we have an OpenSSL speed test which is informative but is less of a real-world application. For the second we are going to have a thoroughly real-world test, a preview of our STH HTTPS WordPress CDN benchmark that is based on hosting this site.
OpenSSL 1.1 Speed
We used a number of fairly standard OpenSSL speed metrics and also attempted to quantify what the difference in performance would be between using an Intel QAT hardware accelerator versus using more cores and software. We have been waiting to do this testing for many months as we were waiting for asynchronous OpenSSL support alongside OpenSSL 1.1 to make this happen.
RSA 2048 Cipher
RSA2048 is a widely used public key algorithm. We wanted to see how much processing power adding an Intel QuickAssist hardware adapter would free.
As you can see, getting asynchronous support in OpenSSL 1.1 was a big deal for Intel. The QAT performance skyrockets to the point it would take around 32-33 cores to reach the same RSA2048 sign/s values and on the verify/s side we saw QAT yield single core performance of just around 5.5 cores working via software. That is a stellar result.
Elliptic curve Diffie–Hellman (ECDH)
This is a popular key exchange algorithm for nginx OpenSSL configurations. If you look at your ssl_ciper line in you nginx and see ECDH (or likely something like ECDH+AES256) you will see an example of this being used.
These results show the power of the asynchronous OpenSSL and Intel QAT accelerator technology. While making the chart, I had to re-run the benchmarks just to ensure I did not make a typographical error in the results spreadsheet.
Chained Cipher AES CBC
We are setting this test scenario up with a view of being a CDN provider and using aes-256-cbc-hmac-sha1 as our cipher suite with larger size files and therefore focus on 1k, 8k and 16k block size results.
Here you can see that at the larger 16k block sizes the Intel QAT hardware accelerator is competitive with seven Broadwell cores at 16k and using two cores with the QAT hardware at 8k. Overall, these are solid results.
Overall our OpenSSL speed benchmarks show significant performance improvements that will free up cores. Depending on the network controller you are using (and the offload capabilities thereof) at 40GbE and 100GbE speeds this type of offload will become more important.
“Beta” Benchmark Preview – STH HTTPS WordPress Nginx CDN
This is a new benchmark we are working on. It is still far from prime time as we have a lot of work and tuning in order to get it tuned to where we want it. Expect to see more of this in 2017. The basic test infrastructure utilizes a snapshot of the entire STH WordPress site as of when we switched to OpenSSL earlier in 2016. We then simply took the ~120,000 files in /wp-includes/uploads/ and built a server image to serve these files as if it were a CDN server for the site. If you were wondering why there are so many image files, each image uploaded for reviews has different sizes stored for thumbnails, larger images and etc. By storing resized files, they can be served as static content when a web request is made which is much more efficient. After over 7 years, this accumulates to become a large number of static files.
While we generally use Ubuntu, and plan to in the future, we swapped over to CentOS 7.2 just for this test. That was driven by the fact that QAT on CentOS was easier to get working. We are using OpenSSL 1.1 and have a Nginx specifically for QAT installed. That version of nginx only works with up to 32 threads (seriously) so we limited our test system accordingly. We whittled the ssl_cipers down to AES256-SHA. We then ran a wrk script that simulates an access distribution from the live site. Although this is not going to be the most widely applicable nor the most reproducible test, I wanted to see for myself what swapping over STH to a QAT accelerated stack would look like. STH does host its own infrastructure so staying atop of hardware trends has real-world applicability to us.
Caveats aside, there was a demonstrable impact on performance simply by changing to the QAT accelerated nginx.
At over 23.7gbps this is an excellent performance. 10gpbs at the time of this writing costs $2,000/ month or (much) more in US data centers depending on connectivity providers. Adding a $800 card that can handle SSL offload duties on such a pipe while allowing the CPU to handle other tasks is a bargain.
Power Consumption Impact
Since Intel QAT is not part of the mainstream Intel Xeon E5-2650 V4 chips we were using, we did need a Netgate CPIC-8955 (25W) add-on card for this. We took measurements of the two systems during the WordPress tests just to show what the difference in power was between the two. Our environmental sensors between the two tests did have the same temperature and humidity ratings.
As one can see, the QAT enabled system did require more power to run. On the other hand, we were able to push a higher load through the machine so it was well worth the relatively minor bump in power consumption.
Although I do quite a bit of admin work these days, I am by no means the world’s best admin. I am happy to admit that I am far from it.
- Intel QAT resources on Intel.com
- Intel QAT 01.org on GitHub (mandatory reading if you want to use QAT)
- Intel’s Getting Started Guide
- Intel’s QAT resources on 01.org
- Netgate – invaluable for getting STH QAT testing online and they are doing a lot of QAT work for products they sell
- Intel – Intel did send me some rather long (70+ pages) QAT guides so the company does have reference materials available
Of these resources, having Netgate’s assistance was by far the most helpful. Netgate is the company behind the popular pfSense network appliance distribution. The company’s QuickAssist Technology gurus that had me up and running with a phone call and a matter of minutes versus spending hours where I would run into small “gotchas” by myself.
My hope is that as we move into 2017 we will see QAT embedded in more systems. Intel is set to expand the availability of QAT in new generations of processors. We are already seeing other chip vendors, such as Cavium, introduce their crypto and compression hardware accelerators so this is a space that will heat up in the near future. QuickAssist is finally becoming a technology that is becoming accessible to intermediate admins. We still recommend that getting support from a company such as Intel or Netgate that has gurus on staff is a wise decision to ease setup. We do have a number of exciting results to share including acceleration of IPsec VPNs which are set to receive a major boost from QAT. We also have heard that there has been work on a FreeBSD QAT driver that will enable QAT accelerated FreeBSD IPsec VPN appliances in the not-to-distant future.
Thanks for the great write up. Will you have a chance to benchmark the zlib acceleration as well?
Hi Michael, we are going to have compression in the near future. We are also going to test the Cavium ThunderX compression acceleration.
Bookmarked. I can’t believe this is the only QuickAssist benchmarking done. STH powah
That’s an odd choice of curves for ECDH. The most commonly used are P-256 and P-384. In regions with clusters of sceptics of the NIST P curves’ origins you will encounter Koblitz 283 and 409. No one sane uses any B- curves. Speed test would then be:
openssl speed -multi $(nproc) ecdhp256 ecdhp384 ecdhk283 ecdhk409
taskset -c 2 openssl speed -multi $(nproc) ecdhp256 ecdhp384 ecdhk283 ecdhk409
… and then, of course, you would use Intel’s patches to OpenSSL for AVX and AVX2. Here’s a ready package:
Remove the last `-multi $(nproc)` please. I did mean to show how to pin a process to a single core to avoid a flawed benchmark due to process relocation.
Please take a look at the Chelsio T6 in December timeframe. It has integrated crypto offload in addition to other impressive features for a variety of applications http://www.chelsio.com/terminator-6-asic/
Do you have results for xtr,ccm or gcm modes. Did you consider http time-out values because only the initial asymmetric handshake is worth the hw-accel. symmetric workloads are too easy to handle by allocating a few cores of your multicore cpus
Any chance you’ll benchmark the Rangely chip without external accelerators?
How was IPsec benchmarking done? With below driver/test, I observe only 10 Gbps.
Tunnel config: ip xfrm
Authenc Algo: sha1
Cipher Algo: aes
Do you have any suggestion here?
The Exar/Maxlinear DX2040 has better performance than Intel 8955.