Intel Atom C3958 16-Core Top End Embedded QAT Linux Benchmarks and Review

11
Gigabyte MA10 ST0 Top
Gigabyte MA10 ST0 Top

Pivoting slightly from our focus on high-end, and high-power server CPUs, we have the Intel Atom C3958 performance benchmarks under Linux. We have already published benchmarks on the Intel Atom C3338, C3558 and C3955 which are instructive for other points of reference within the Intel Atom C3000 series. While the Intel Atom C3958 does not have the clock speeds to match the Intel Atom C3955 series, it still has 16 cores. What it can claim is that it is the highest-bin QuickAssist part in the current Intel Atom C3000 series lineup.

A Quick Word on Intel QuickAssist

In 2016 we published a few articles around using QuickAssist with OpenSSL and for 40GbE VPN acceleration. In the meantime, Intel has now launched a 100Gbps QAT version and has built QAT into the Burgeoning Intel Xeon SP Lewisburg PCH Options. We will have some cool QAT results soon. For now a few notes:

  1. Intel Atom C3000 and Intel Atom C2000 QAT have different features and therefore do not use the same driver version.
  2. You do need Intel Atom C3xxx compatible QAT drivers.
  3. The Intel QAT ecosystem is significantly stronger than it was in 2016. We speculate this is due to carrier network adoption and making the ecosystem more mature.

QuickAssist Technology acceleration still requires some effort as Intel is not adding it into every chip. Until Intel does so, we expect most software to require an additional step (or much more) getting QAT working.

Intel Atom C3xxx QAT Device
Intel Atom C3xxx QAT Device

We do have QAT working on the Intel Atom C3958 already and it is enumerated as a different device type than other QAT solutions as can be seen in the screenshot.

Gigabyte MA10 ST0 IQAT In BIOS
Gigabyte MA10 ST0 IQAT In BIOS

The iQAT can be disabled if that is desired. We wish Intel added this to every chip so it became automatic in applications as that would help QAT support considerably.

At the same time, there are only three reasons you would get a C3958 over a C3955: QAT support, extended lifecycle, and if a specific platform you wanted to use did not have a C3955 option. That makes the QAT support a significant piece of the puzzle.

Intel Atom C3958 Key Stats

Key stats for the Intel Atom C3958 series: 16 cores / 16 threads, 2.0GHz. Unlike the C3955, the C3958 does not feature turbo boost so 2.0GHz is also the maximum speed. The CPU features 31W TDP. This CPU also features a full 20x high-speed I/O lanes and has 4x10GbE making it top-bin in terms of features for QuickAssist parts. These chips are not socketed so end customer pricing will include a motherboard at a minimum. The CPU alone has a 1K unit tray price of $449. Virtualization features such as VT-d and SR-IOV are supported on this generation. Here is the ARK page for the CPU.

Also, for our readers who want to see feature flags, here is the Linux lscpu output:

Intel Atom C3958 Lscpu
Intel Atom C3958 Lscpu

Test Configuration

Our test configuration is very similar to what we used for our Intel Atom C2000 series reviews.

  • Motherboard: Gigabyte MA10-ST0
  • CPU: Intel Atom C3958
  • RAM: 4x 16GB DDR4-2400 RDIMMs (Micron)
  • SSD: Intel DC S3710 400GB
  • Boot device: Intel DC S3700 200GB

We are using the Gigabyte MA10-ST0 for our test platform. This is an absolutely stunning storage server solution with 16x SATA ports and onboard 10Gb SFP+ networking.

Gigabyte MA10 ST0 Top
Gigabyte MA10 ST0 Top

The board comes with an onboard 32GB eMMC storage from Kingston. For an embedded system this is an awesome feature. On this platform we expect this eMMC to be used as a boot device rather than a more expensive SATA DOM. The four SFF-8087 ports mean that using a SATA DOM is not easy on the platform in either case, but they provide easy connectivity to storage backplanes.

We will have a full review of the Gigabyte MA10-ST0 soon, but for those wondering, the maximum power consumption with 2x 10Gb SFP+ links (SR optics) and 2x 1GbE links we have seen is around 61W. We will publish formal figures with our platform reviews but this is certainly a solid low-power platform for the performance and connectivity you are getting.

Intel Atom C3958 Benchmarks

For this exercise, we are using our legacy Linux-Bench scripts which help us see cross-platform “least common denominator” results. We do have a full set of expanded benchmarks from our next-gen test suite (Linux-Bench2) which you may see in other STH reviews that include this chip. The target market of the Intel Atom C3958 is on embedded applications making the original tests more useful. Generally, embedded applications such as storage controllers and networking appliances will not see heavy workloads where AVX2 / AVX-512 will be useful.

From what we saw in the Intel Atom C2000 series, there are only two OSes that matter for these embedded parts: Linux and FreeBSD. OSes like Windows have a negligible market share on these platforms and we would not recommend using an Atom C3000 series as a desktop. There are many offerings in the market more appropriate for that use case.

Python Linux 4.4.2 Kernel Compile Benchmark

This is one of the most requested benchmarks for STH over the past few years. The task was simple, we have a standard configuration file, the Linux 4.4.2 kernel from kernel.org, and make the standard auto-generated configuration utilizing every thread in the system. We are expressing results in terms of compiles per hour to make the results easier to read.

Intel Atom C3958 Linux Kernel Compile Benchmark
Intel Atom C3958 Linux Kernel Compile Benchmark

Here we see a solid performance, not quite up to what the Intel Atom C3955 compute-focused part can offer. Keen eyes will place performance around that of a Xeon D 8 core part. The microarchitecture difference is going to highlight some bigger performance differences than we would be otherwise accustomed to in our other tests.

c-ray 1.1 Performance

We have been using c-ray for our performance testing for years now. It is a ray tracing benchmark that is extremely popular to show differences in processors under multi-threaded workloads.

Intel Atom C3958 C Ray Benchmark
Intel Atom C3958 C Ray Benchmark

Here you can see solid performance due to having more cores and L1 cache. The Intel Xeon E3 line is not really a competitor as it lacks the features of the Atom and has significantly higher power consumption.

7-zip Compression Performance

7-zip is a widely used compression/ decompression program that works cross-platform. We started using the program during our early days with Windows testing. It is now part of Linux-Bench.

Intel Atom C3958 7zip Compression Benchmark
Intel Atom C3958 7zip Compression Benchmark

There is a fairly large chasm between the 16 core Atom C3000 series part and the 16 core Xeon D part. This compression is not using QAT offload which we will have more on soon. We also sorted the chart based on compression speed which puts the Intel Atom C3958 between the six and eight core Xeon D low power parts. Decompression sort would have put it between the eight and twelve core Xeon D parts. That is solid performance either way.

Sysbench CPU test

Sysbench is another one of those widely used Linux benchmarks. We specifically are using the CPU test, not the OLTP test that we use for some storage testing.

Intel Atom C3958 Sysbench CPU Benchmarks
Intel Atom C3958 Sysbench CPU Benchmarks

We had to remove the 2-core Atom CPUs such as the C2358 and D525 from this list as those generations made this chart borderline unreadable. This test tends to favor many cores and have strong scaling based on core counts which is why the C3958 performs so well here.

OpenSSL Performance

OpenSSL is widely used to secure communications between servers. This is an important protocol in many server stacks. We first look at our sign tests:

Intel Atom C3958 OpenSSL Sign Benchmark
Intel Atom C3958 OpenSSL Sign Benchmark

We also have the verify results sorted in the same order to make comparison easier.

Intel Atom C3958 OpenSSL Verify Benchmark
Intel Atom C3958 OpenSSL Verify Benchmark

Here we see the Intel Atom C3958 competitive with the Xeon Silver 4108. The Intel Xeon 4108 is a similar price part for higher power, more expandable Xeon Scalable servers. The other key point to look at here is the generational improvement. The Intel Atom C2758 was the top-end Rangeley generation Intel Atom C2000 series SKU with QuickAssist. Even without leveraging QAT, the top-bin performance has increased 4x on this test. OpenSSL is a key metric for these parts as they are commonly used in network and storage appliances.

UnixBench Dhrystone 2 and Whetstone Benchmarks

One of our longest running tests is the venerable UnixBench 5.1.3 Dhrystone 2 and Whetstone results. They are certainly aging, however, we constantly get requests for them, and many angry notes when we leave them out. UnixBench is widely used so we are including it in this data set. Here are the Dhrystone 2 results:

Intel Atom C3958 UnixBench Dhrystone 2 Benchmark
Intel Atom C3958 UnixBench Dhrystone 2 Benchmark

Here are the whetstone results.

Intel Atom C3958 UnixBench Whetstone Benchmark
Intel Atom C3958 UnixBench Whetstone Benchmark

Haveing a lot of cores makes up for some of the microarchitecture trade-offs made to keep power consumption low. Still, we see some solid performance out of this part.

Final Words

Gone are the days of the “wimpy” Atom. The Atom C3958 sports a low clock speed (2.0 GHz) and does not have turbo boost, L3 cache, nor higher-end features such as AVX2/ AVX-512 support. Yet with 1MB L2 cache per core, massive IPC improvements, and 16 cores, the Intel Atom C3958 is competitive with the Xeon D and Xeon Bronze/ Silver lines in terms of performance. Although the Xeon lines are better for virtualization and general purpose compute, for most networking and storage appliances this is a very fast chip.

From a competitive side, there is a lot of talk about AMD EPYC in the market. AMD does not yet have a competitive offering in this segment since even the EPYC 7251 is a 120W TDP CPU before adding any other component to the system, or about 2x what we are seeing an entire configured Gigabyte MA10-ST0 test system pull at the outlet. Being fair, AMD Is not targeting this market with EPYC. Likewise, ARM has made lots of noise but the Intel Atom C3958 provides a solid mix of core performance and acceleration for crypto and compression. The Intel Atom C3000 series is certainly enough to hold current ARM offerings at bay for the near-term future.

Looking at the top-end QAT SKU from this generation versus the previous generation (Atom C2758) one can see that the lineup has significantly expanded its market coverage at the top end. Clock speeds are down ~17% but that is the only area where we are seeing specs decline. Core count has doubled from 8 to 16 cores. Cache size and RAM capacity have quadrupled to 16MB and 256GB respectively. Networking is effectively 10x the speed of the previous generation. PCe and SATA have moved up a generation and greatly expanded in numbers. TDP is up 55% to match the massive performance and platform upgrades. At the same time pricing is now much higher up around 116%. Of course, Intel has parts like the Atom C3758 which address a similar market segment to the previous top of the line part, but it shows how Intel is allowing the Intel Atom C3000 line to creep up higher in the performance stack.

Overall, this is an enormous generational upgrade in performance, but we expect the Intel Atom C3958 to be a lower volume part given its hefty price tag. At $449 for the CPU it is competing with the Intel Xeon Silver 4108 and Xeon D lines.

If you want to learn more, we have complete coverage at our Denverton Day Official STH Intel Atom C3000 Launch Coverage Central

11 COMMENTS

  1. I’m very confused. Where are the test results of benchmarks that actually use QAT here? I was expecting something in the SSL benchmarks but according to your numbers the non-QAT C3955 is faster than this chip that has QAT?

    It would be nice to see an article that focuses heavily on the QAT feature of this chip actually use that feature in some tests.

  2. @Don,

    In 2016 we published a few articles around using QuickAssist with OpenSSL and for 40GbE VPN acceleration. In the meantime, Intel has now launched a 100Gbps QAT version and has built QAT into the Burgeoning Intel Xeon SP Lewisburg PCH Options. We will have some cool QAT results soon. For now a few notes:

  3. Something that I saw when I went to Gigabytes website… using the PCIe slot disables 2 of the 4 SFF-8087 ports.

    I like the plethora of storage, but I’m not sure I have a need of the QAT… if I could find a use-case, this might be interesting. And I’m not sure I like just the SFP+ ports since I haven’t tried running FO wires thru residential walls yet. A copper option would be nice.

  4. How board. Price? Full review?

    All the commentary around QA was helpful. I don’t know when STH started doing it but I like this new direction.

    Your comparison dataset is real useful since that’s all the competition almost. I can extrapolate more data points.

  5. @Eddie – If you want copper then get a 10Gb copped SFP+ transceiver.

    I would trade the QAT for a GPU on it for transcoding anyday but this isn’t what these things are aimed at.

  6. @Goose – I’ve seen those but haven’t tried to figure out what they cost yet. Depending on the mobo manufacturer, C3xxx SKUs can offer either, both or neither of the 10GBe standards. Take the Supermicro they reviewed earlier, I think it had 2x of each. Of course that one only offered 12 SATA ports vs this one.

  7. Please redo the openssl tests enabling QAT, to check if QAT engine is available run:
    openssl engine
    Testing:
    openssl speed -engine qat -elapsed -multi 2 -evp aes-128-cbc-hmac-sha1
    (more info in 01.org/intel-quickassist-technology)

  8. I noticed every single motherboard doesn’t offer RAID capability as opposed to Intel C236/238 chipsets which do offer at least some basic ones.

    That means additional investment and +10/15W of power consumption, which in case of RAID 1 (which is only thing I need) is really questionable why should I pick this platform instead of XEON 45W one. For 5W of power savings in best case scenario?

    Really? Can someone elaborate how are you sorting RAID, especially RAID 1 and RAID 10 with say 8 disks on low power server. Purchasing £400 pound RAID card and using additional *precious* PCIe slot (which are rare on mini-ITX) while consuming more power when ALL THAT comes for free in case of XEON C236 solution, I don’t see a point.

    Please elaborate!

  9. Karol – I would suspect very few Atom C3958 users are going to use hardware RAID. Even fewer will use chipset RAID. Most applications will use software RAID which is great on this platform.

LEAVE A REPLY

Please enter your comment!
Please enter your name here