Intel Atom C3958 16-Core Top End Embedded QAT Linux Benchmarks and Review

11

Intel Atom C3958 Benchmarks

For this exercise, we are using our legacy Linux-Bench scripts which help us see cross-platform “least common denominator” results. We do have a full set of expanded benchmarks from our next-gen test suite (Linux-Bench2) which you may see in other STH reviews that include this chip. The target market of the Intel Atom C3958 is on embedded applications making the original tests more useful. Generally, embedded applications such as storage controllers and networking appliances will not see heavy workloads where AVX2 / AVX-512 will be useful.

From what we saw in the Intel Atom C2000 series, there are only two OSes that matter for these embedded parts: Linux and FreeBSD. OSes like Windows have a negligible market share on these platforms and we would not recommend using an Atom C3000 series as a desktop. There are many offerings in the market more appropriate for that use case.

Python Linux 4.4.2 Kernel Compile Benchmark

This is one of the most requested benchmarks for STH over the past few years. The task was simple, we have a standard configuration file, the Linux 4.4.2 kernel from kernel.org, and make the standard auto-generated configuration utilizing every thread in the system. We are expressing results in terms of compiles per hour to make the results easier to read.

Intel Atom C3958 Linux Kernel Compile Benchmark
Intel Atom C3958 Linux Kernel Compile Benchmark

Here we see a solid performance, not quite up to what the Intel Atom C3955 compute-focused part can offer. Keen eyes will place performance around that of a Xeon D 8 core part. The microarchitecture difference is going to highlight some bigger performance differences than we would be otherwise accustomed to in our other tests.

c-ray 1.1 Performance

We have been using c-ray for our performance testing for years now. It is a ray tracing benchmark that is extremely popular to show differences in processors under multi-threaded workloads.

Intel Atom C3958 C Ray Benchmark
Intel Atom C3958 C Ray Benchmark

Here you can see solid performance due to having more cores and L1 cache. The Intel Xeon E3 line is not really a competitor as it lacks the features of the Atom and has significantly higher power consumption.

7-zip Compression Performance

7-zip is a widely used compression/ decompression program that works cross-platform. We started using the program during our early days with Windows testing. It is now part of Linux-Bench.

Intel Atom C3958 7zip Compression Benchmark
Intel Atom C3958 7zip Compression Benchmark

There is a fairly large chasm between the 16 core Atom C3000 series part and the 16 core Xeon D part. This compression is not using QAT offload which we will have more on soon. We also sorted the chart based on compression speed which puts the Intel Atom C3958 between the six and eight core Xeon D low power parts. Decompression sort would have put it between the eight and twelve core Xeon D parts. That is solid performance either way.

Sysbench CPU test

Sysbench is another one of those widely used Linux benchmarks. We specifically are using the CPU test, not the OLTP test that we use for some storage testing.

Intel Atom C3958 Sysbench CPU Benchmarks
Intel Atom C3958 Sysbench CPU Benchmarks

We had to remove the 2-core Atom CPUs such as the C2358 and D525 from this list as those generations made this chart borderline unreadable. This test tends to favor many cores and have strong scaling based on core counts which is why the C3958 performs so well here.

OpenSSL Performance

OpenSSL is widely used to secure communications between servers. This is an important protocol in many server stacks. We first look at our sign tests:

Intel Atom C3958 OpenSSL Sign Benchmark
Intel Atom C3958 OpenSSL Sign Benchmark

We also have the verify results sorted in the same order to make comparison easier.

Intel Atom C3958 OpenSSL Verify Benchmark
Intel Atom C3958 OpenSSL Verify Benchmark

Here we see the Intel Atom C3958 competitive with the Xeon Silver 4108. The Intel Xeon 4108 is a similar price part for higher power, more expandable Xeon Scalable servers. The other key point to look at here is the generational improvement. The Intel Atom C2758 was the top-end Rangeley generation Intel Atom C2000 series SKU with QuickAssist. Even without leveraging QAT, the top-bin performance has increased 4x on this test. OpenSSL is a key metric for these parts as they are commonly used in network and storage appliances.

UnixBench Dhrystone 2 and Whetstone Benchmarks

One of our longest running tests is the venerable UnixBench 5.1.3 Dhrystone 2 and Whetstone results. They are certainly aging, however, we constantly get requests for them, and many angry notes when we leave them out. UnixBench is widely used so we are including it in this data set. Here are the Dhrystone 2 results:

Intel Atom C3958 UnixBench Dhrystone 2 Benchmark
Intel Atom C3958 UnixBench Dhrystone 2 Benchmark

Here are the whetstone results.

Intel Atom C3958 UnixBench Whetstone Benchmark
Intel Atom C3958 UnixBench Whetstone Benchmark

Having a lot of cores makes up for some of the microarchitecture trade-offs made to keep power consumption low. Still, we see some solid performance out of this part.

Final Words

Gone are the days of the “wimpy” Atom. The Atom C3958 sports a low clock speed (2.0 GHz) and does not have turbo boost, L3 cache, nor higher-end features such as AVX2/ AVX-512 support. Yet with 1MB L2 cache per core, massive IPC improvements, and 16 cores, the Intel Atom C3958 is competitive with the Xeon D and Xeon Bronze/ Silver lines in terms of performance. Although the Xeon lines are better for virtualization and general purpose compute, for most networking and storage appliances this is a very fast chip.

From a competitive side, there is a lot of talk about AMD EPYC in the market. AMD does not yet have a competitive offering in this segment since even the EPYC 7251 is a 120W TDP CPU before adding any other component to the system, or about 2x what we are seeing an entire configured Gigabyte MA10-ST0 test system pull at the outlet. Being fair, AMD Is not targeting this market with EPYC. Likewise, ARM has made lots of noise but the Intel Atom C3958 provides a solid mix of core performance and acceleration for crypto and compression. The Intel Atom C3000 series is certainly enough to hold current ARM offerings at bay for the near-term future.

Looking at the top-end QAT SKU from this generation versus the previous generation (Atom C2758) one can see that the lineup has significantly expanded its market coverage at the top end. Clock speeds are down ~17% but that is the only area where we are seeing specs decline. Core count has doubled from 8 to 16 cores. Cache size and RAM capacity have quadrupled to 16MB and 256GB respectively. Networking is effectively 10x the speed of the previous generation. PCe and SATA have moved up a generation and greatly expanded in numbers. TDP is up 55% to match the massive performance and platform upgrades. At the same time pricing is now much higher up around 116%. Of course, Intel has parts like the Atom C3758 which address a similar market segment to the previous top of the line part, but it shows how Intel is allowing the Intel Atom C3000 line to creep up higher in the performance stack.

Overall, this is an enormous generational upgrade in performance, but we expect the Intel Atom C3958 to be a lower volume part given its hefty price tag. At $449 for the CPU it is competing with the Intel Xeon Silver 4108 and Xeon D lines.

If you want to learn more, we have complete coverage at our Denverton Day Official STH Intel Atom C3000 Launch Coverage Central

11 COMMENTS

  1. I’m very confused. Where are the test results of benchmarks that actually use QAT here? I was expecting something in the SSL benchmarks but according to your numbers the non-QAT C3955 is faster than this chip that has QAT?

    It would be nice to see an article that focuses heavily on the QAT feature of this chip actually use that feature in some tests.

  2. @Don,

    In 2016 we published a few articles around using QuickAssist with OpenSSL and for 40GbE VPN acceleration. In the meantime, Intel has now launched a 100Gbps QAT version and has built QAT into the Burgeoning Intel Xeon SP Lewisburg PCH Options. We will have some cool QAT results soon. For now a few notes:

  3. Something that I saw when I went to Gigabytes website… using the PCIe slot disables 2 of the 4 SFF-8087 ports.

    I like the plethora of storage, but I’m not sure I have a need of the QAT… if I could find a use-case, this might be interesting. And I’m not sure I like just the SFP+ ports since I haven’t tried running FO wires thru residential walls yet. A copper option would be nice.

  4. How board. Price? Full review?

    All the commentary around QA was helpful. I don’t know when STH started doing it but I like this new direction.

    Your comparison dataset is real useful since that’s all the competition almost. I can extrapolate more data points.

  5. @Eddie – If you want copper then get a 10Gb copped SFP+ transceiver.

    I would trade the QAT for a GPU on it for transcoding anyday but this isn’t what these things are aimed at.

  6. @Goose – I’ve seen those but haven’t tried to figure out what they cost yet. Depending on the mobo manufacturer, C3xxx SKUs can offer either, both or neither of the 10GBe standards. Take the Supermicro they reviewed earlier, I think it had 2x of each. Of course that one only offered 12 SATA ports vs this one.

  7. Please redo the openssl tests enabling QAT, to check if QAT engine is available run:
    openssl engine
    Testing:
    openssl speed -engine qat -elapsed -multi 2 -evp aes-128-cbc-hmac-sha1
    (more info in 01.org/intel-quickassist-technology)

  8. I noticed every single motherboard doesn’t offer RAID capability as opposed to Intel C236/238 chipsets which do offer at least some basic ones.

    That means additional investment and +10/15W of power consumption, which in case of RAID 1 (which is only thing I need) is really questionable why should I pick this platform instead of XEON 45W one. For 5W of power savings in best case scenario?

    Really? Can someone elaborate how are you sorting RAID, especially RAID 1 and RAID 10 with say 8 disks on low power server. Purchasing £400 pound RAID card and using additional *precious* PCIe slot (which are rare on mini-ITX) while consuming more power when ALL THAT comes for free in case of XEON C236 solution, I don’t see a point.

    Please elaborate!

  9. Karol – I would suspect very few Atom C3958 users are going to use hardware RAID. Even fewer will use chipset RAID. Most applications will use software RAID which is great on this platform.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.