Cavium ThunderX2 Review and Benchmarks a Real Arm Server Option

22
Cavium ThunderX2 Chip In Socket
Cavium ThunderX2 Chip In Socket

At STH, we firmly believe that alternative architectures help spur the technology industry’s innovation. The new entrant for 64-bit Arm servers is the Cavium ThunderX2. We have long held that Cavium is the only vendor publicly selling Intel Xeon alternative. With the ThunderX2 there is now a dual socket capable 64-bit Arm CPU that has up to 32 cores and 128 threads in each socket. The Cavium ThunderX2 that we see today has its origins in the Broadcom Project Vulcan and so many of the features we saw in Cavium ThunderX, such as 40GbE ports, are not present. Instead, we have an Arm chip that can go toe-to-toe with Intel and AMD and come out ahead in some cases. Best of all, the list price of the 32 core top-bin CN9980 part is $1795 about half of the competitive Intel and AMD chips.

In this article, we are going to take you on a comprehensive journey exploring different aspects of the Cavium ThunderX2. We are going to look at how the ecosystem and platforms have evolved and why ThunderX2 is usable by a broader set of organizations than previous generations. There is a set of performance benchmarks where we explore how the 256 threads in a dual ThunderX2 system performs well against Intel Xeon and AMD EPYC. We also have a few numbers exploring the SKU stack in terms of 24, 28, 30 and 32 core versions. Finally, we are going to end with a look a power consumption and the competitive landscape. Suffice to say, grab a cup of coffee and dive in.

Previous Cavium ThunderX2 Pieces

We have covered the ThunderX2 for some time so for our readers, the launch of ThunderX2 may seem like déjà vu. Here is a sample of ThunderX2 coverage on STH to date:

Cavium ThunderX2 CPU and SKU Stack

While the original ThunderX was a BGA design, the Cavium ThunderX2 comes in both BGA and LGA form factors. The impact is tangible. It can be deployed soldered onto motherboards as with previous generations or now as a socketed part. That has major advantages for the supply chain as it allows a server OEM to stock platforms then socket ordered CPUs as needed. With this generation, Cavium has a SKU stack that can interchangeably utilize a standard socket. Here is a picture of that socket:

Cavium ThunderX2 Socket
Cavium ThunderX2 Socket

Just to give a sense of scale, here the ThunderX2 package is alongside four of the most popular x86 package types today: AMD EPYC, Xeon Scalable, Xeon E5-2600 V3/ V4 and Xeon E5-2600 V1/V2.

AMD EPYC 7000 Cavium ThunderX2 Intel Xeon Scalable And E5 V1 V4
AMD EPYC 7000 Cavium ThunderX2 Intel Xeon Scalable And E5 V1 V4

One can see that AMD and Intel have components on the bottom of the socket and a very interesting pin/pad layout. The square bottom of a ThunderX2 package is just a large pad grid.

AMD EPYC Cavium ThunderX2 Intel Xeon Scalable Pins
AMD EPYC Cavium ThunderX2 Intel Xeon Scalable Pins

When taking the above photo, one thing became clear. ThunderX2 it is significantly slimmer than its counterparts and we are using the 32 core ThunderX2 in this photo.

AMD EPYC Cavium ThunderX2 Intel Xeon Scalable Thickness
AMD EPYC Cavium ThunderX2 Intel Xeon Scalable Thickness

Cavium ThunderX2 Features, SKUs and Specs

The Cavium ThunderX2 is very competitive with both Intel Xeon Scalable and AMD EPYC 7000 series parts in terms of performance, but also in terms of features. Here is the key features overview of the Cavium ThunderX2:

Cavium ThunderX2 Key Features
Cavium ThunderX2 Key Features

There are 32 cores per socket and up to 128 threads. Unlike competitive Arm development chips, ThunderX2 is a dual socket capable design and indeed, we tested a dual socket server with a total of 64 cores and 256 threads. Cache is 32KB L1, 256KB L2 per core and then 32MB distributed L3 cache. Cavium also has a 600Gbps interconnect (CCPI2). Interconnects are hard, especially with multi-socket designs. That is a key feature that separates ThunderX2 engineering from some of the single-socket only Arm options.

Cavium ThunderX2 Architecture Block Diagram
Cavium ThunderX2 Architecture Block Diagram

Memory bandwidth is excellent with up to 8x DDR4-2666 memory controllers which is equivalent to AMD EPYC and more than Intel Xeon Scalable. These memory channels even support RAS features and NVDIMMs.

Cavium ThunderX2 Memory Capabilities
Cavium ThunderX2 Memory Capabilities

PCIe support is for up to PCIe 3.0 x16 slots with a total of 56x PCIe 3.0 lanes. One can bifurcate the PCIe lanes down to x1 and there are a total of 14 PCIe controllers for system vendors to utilize. Other features like SR-IOV are supported which helps maintain parity with the x86 ecosystems.

Cavium ThunderX2 IO Capabilities
Cavium ThunderX2 IO Capabilities

PCIe is a big deal since it allows for the platform to be utilized with high-speed devices like GPUs, FPGAs, NVMe SSDs, and high-speed networking. This level of connectivity puts the ThunderX2 squarely between the AMD EPYC and Intel Xeon Scalable lines which is a major achievement in itself.

In terms of actual launch SKUs, the list Cavium has around 40 SKUs ranging from 16 to 32 cores and sent us specs for five ranging from 24 to 32 cores and 96 to 128 threads. Cache comes in at 32MB per chip.

Selected Cavium ThunderX2 Launch SKUs
Selected Cavium ThunderX2 Launch SKUs

If you want to see the full SKU stack and its positioning relative to Intel Xeon Scalable, from Cavium’s point of view, here is the current SKU stack we were provided with:

Cavium ThunderX2 SKU Stack
Cavium ThunderX2 SKU Stack

We are going to go into how each Cavium ThunderX2 core can handle 4 threads. The performance of the chip and the power consumption soon. The old adage that more is usually, but not always better holds true here as does the saying “TDP does not equal power consumption.” We are going to get to that, but we are going to first set the stage in terms of the context behind why the Cavium ThunderX2 is the most important Arm data center release this year.

22 COMMENTS

  1. I’m through page 3. I’m loving the review so far but I need to run to a meeting.

  2. Looks like a winner. Are you guys doing more with TX2?

    It’s crazy Cavium is doing what Qualcomm can’t. All that money only to #failhardcentriq

  3. Cool chart with the 24 28 30 and 32 core models

    Cavium needs to fix their dev station pricing. $10k+ for two $1800 cpus in a system is too much. Their price performance is undermined by their system pricing

  4. Read the whole thing, very impressed with the TX2 performance and pricing, think i’m going to try one out. But was a bit bummed out when i found out on page 8, the most important thing, power usage, wasn’t properly covered and compared to the Intel and AMD systems :(

  5. Welcome competition, always good to see that there is pressure on the market leader.
    Microsoft is also working on an ARM version for windows, so this can go the right way…

  6. I’m very confused by some of what you wrote and the exact testing setups of these platforms is extremely unclear. To cite just one example of Linpack test where you state:

    “Our standard is to run with SMT on since that is what most of the non-HPC environments look like. This is a case where having 256 threads simply is too much. We also ran the test with 32 threads per CPU, or SMT off which yielded a solid improvement. ”

    On a 4-way SMT system you get 256 threads by operating 64 cores. You claim that the CPU only has at most 32 cores and in the same statement you re-tested at 32 threads. So…. what exactly did you test? A 32 core CPU that *cannot* have 256 threads? A two-socket 64-core system that can have 256 SMT threads but that was then dropped to a single-socket configuration with only a single 32 core processor?

    Please put in a clear and unambiguous table that provides the *real* hardware configurations of all the test systems.
    That means:
    1. How many sockets were in-use. Were *ALL* the systems dual socket? All single socket? A mixture? I can’t tell based on the article!

    2. Full memory configuration. Yes I know about the channel differences, but what are the details.

    3. That’s just a start. The article jumps from vague slides about general architecture to out-of-context benchmark results too quickly.

  7. Competition in the server industry great!

    Don – 32 threads per cpu means 64 threads total right? 2x 32 isn’t that hard.

  8. I don’t think this convinced me to buy them. But I’ll at least be watching arm servers now. We run a big VMware cluster so I’d have a hard time convincing my team to buy these since we can’t redeploy in a pinch to our other apps.

  9. We’ll be discussing TX 2 at our next staff meeting. Where can we get system pricing to compare to Intel and AMD?

  10. Can you do more about using this as Ceph or ZFS or something more useful? Can you HCI with this?

    Love the write-up. You guys have grown so much and it shows in how much you’re covering on this which is still a niche architecture in the market.

  11. Nice write-up, with plenty of details, on the newly launched. Congrats to Cavium.

    Cavium Arm server processor launch, suddenly Microsoft shows up and reiterates it still wants >50% of data center capacity to be Arm powered. And it’s loving Cavium’s Thunder X2 Arm64 system. Together designed two-socket Arm servers…

  12. Looks like cavium is taking on Intel with armv8 workstation. Same processor as used by cray. Interesting. Comparing to Xeon ThunderX2 is good in all aspects like performance, band width, No.of cores, sockets, power usage etc.

    Competition in silicon is good for the market.

  13. CaviumInc steps up with amazing 2.2GHz 48-core ThunderX2 part, along with @Cray and @HPE Apollo design wins, and @Microsoft and @Oracle SW support. Early days for #ARM server, but compelling story being told.

    ThundwrX2 Arm-based chips are gaining more firepower for the cloud.

    The Qualcomm Centriq 2400 motherboard had 12 DDR4 DIMM slots and a single >> 48 core CPU.

    The company also showed off a dual socket Cavium ThunderX 2. That system had over >> 100 cores and can handle gobs of memory

    “With list prices for volume SKUs (32 core 2.2GHz and below) ranging from $1795 to $800, the ThunderX2 family offers 2-4X better performance per dollar compared to Qualcomm Centriq 2400 and Xeon…”

    Cavium continues to make inroads with the ThunderX2 @Arm-compatible platform..

  14. Nice Coverage. 40 different versions of the chip that are optimized for a variety of workloads, including compute, storage and networking. They range from 16-core, 1.6GHz SoCs to 32-core, 2.5GHz chips ranging in price from $800 to $1,795. Cavium officials said the chips compete favorably with Intel’s “Skylake” Xeon processors and offer up to three times the single-threaded performance of Cavium’s earlier ThunderX offerings.

    The ThunderX2 SoCs provide eight DDR4 memory controllers and 16 DIMMS per socket and up to 4TB of memory in a dual-socket configuration. There also are 56 lanes of integrated PCIe Gen3 interfaces, 14 integrated PCIe controllers and integrated SATAv3, GPIOs and USB interfaces.

    Kudos to Cavium…

  15. Those power numbers look horrendous. A comparable intel system would be less than half that draw. In fact, 800W is the realm 2P IBM POWER operates in. I get that it’s unbinned silicon and not latest firmware but I can’t see all that accounting for ~50-75W tops. My guess is Broadcom didn’t finish the job before it was sold to Cavium, and if Cavium had to launch it now lest they come up against the next x86 server designs (likely starting to sample late 2018).

    I guess when Patrick gets binned silicon with production firmware, he’ll also have to redo the performance numbers because it’s quite possible that the perf numbers will likely take some hit. 800W! At least it puts paid to the nonsense about ARM ISA being inherently power efficient. Power efficiency is all about implementation.

  16. The performance looks quite good, but yeah the 800W are a show stopper…
    The xeons and epyc processors consume way less than that.
    I doubt they can get to the power consumption of the xeons and epyc without lowering quite a lot the max frequency and voltages accordingly. If they can do it, then that’s great. But I have some doubts.

  17. For the STREAMS benchmark (“Cavium ThunderX2 Stream Triad Gcc7”) I assume the Intel compiler is leveraging the FMA instructions, giving them the boost in performance.

  18. RuThaN – how would you propose performance per dollar? All SKUs used in the performance parts have list prices that are easy to get. Discounts, of course, are a reality in enterprise gear. The ThunderX2 is sub $1800 which is by fart the least expensive.

    Beyond the chips, what system/ configuration are you using them in? How do you factor in the additional memory capacity of ThunderX2 versus Skylake-SP, will that mean fewer systems deployed?

    What cost for power/ rack/ networking should we use for the TCO analysis?

    I do not think that performance/ dollar at the CPU level is a metric those outside of the consumer space look at too heavily versus at least at the system cost. For example, this is a fairly basic TCO model we do: https://www.servethehome.com/deeplearning11-10x-nvidia-gtx-1080-ti-single-root-deep-learning-server-part-1/

  19. Failure to publish measured power during *every* benchmark run is evasive. This is critical data, for the spread of workloads, and allows calculating energy efficiency.
    Please be honest and report the data. Caveats are fine but failure to report is not fine.

  20. Richard Altmaier – thank you for sharing your opinion. There are two components for sure, performance and power consumption. Both are certainly important, but for this review, performance seemed ready, power did not due to a variety of noted factors.

    As mentioned, the test system we have is fairly far from what we would consider comparable to the AMD/ Intel platforms that have been in our labs for more than a year. We do enough of these that it is fairly to see that power is higher than it should be. We do not want to publish numbers we are not confident in, lest they get used by competitors.

    We also mentioned that there will be a follow-up piece to this. The other option was to publish zero power numbers. Despite your opinion, performance alone is a compelling story. Unlike the x86 side, the ARM side has never had a platform that can hit this level of performance which makes the raw performance numbers quite important themselves.

    BTW – There was a well-known Intel executive also named Richard Altmaier.

  21. Would love to see the commands used to generate these results, especially on STREAM on the 8180. I’ve not seen more than ~92GB/s with 768GB installed across all 6 channels with OpenMP parallelization across all 56 threads…

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.