Cavium ThunderX2 Socket Performance
When looking at Cavium ThunderX2 performance there are a few key concepts that need to be addressed. Although the Arm ecosystem has come a long way, the x86 ecosystem has generations of software optimizations including compilers such as Intel’s icc. The fact is, these compilers exist but on the Linux side, gcc is still the most popular (and free/ open source) compiler out there. Also, like we saw when AMD re-entered the market with EPYC 7000, alternatives to Intel do not necessarily need to be the best across every workload. They need some workloads where they perform well and they need ballpark performance from the remainder of applications. With that context, we are going to look a bit at gcc results versus compiler optimized results. We are also going to look at some comparative gcc results.
The first workload we wanted to look at is SPECrate2017_int_peak performance. Specifically, we wanted to show the difference between published Intel Xeon icc and AMD EPYC AOCC results and ThunderX2 with gcc. We expect that Cavium will get slightly better results when they actually publish but this should give you an idea regarding dual socket results for the options. First, we have compiler optimized results:
Here you can see, the Intel and AMD results are very strong using custom compilers. Again, the ThunderX2 CN9980 is about half the cost per CPU of the Intel and AMD options so even using highly tuned compilers, ThunderX2 is price competitive.
Officially, AMD and Intel do not support our efforts to run SPEC CPU2017 on gcc since using their optimized compilers give vastly superior results. Some say these compilers have benchmark specific optimizations but we are going to leave that for debate elsewhere. Here is what happens when we run the same tests using gcc and using -Ofast optimization levels, not the -O2 that a lot of the Arm vendors like to show:
With gcc, ThunderX2 CN9980 rises to the top. That is impressive because, again, while these are not official published results, they are likely more representative of many workloads that use gcc as their compiler. While it may not be the most optimized, gcc is the most commonly used compiler and is open source. This is a case where it is not the best for performance, but where the wide adoption occurred in the market.
Cray has their own compiler that works with ThunderX2 and provides better results than our gcc numbers. We do not have access to that. We also expect published SPEC CPU2017 results to be slightly higher than what we are getting.
STREAM Triad Memory Bandwidth
When we look at memory bandwidth, STREAM by John D. McCalpin is what the industry uses (learn more here.) Since it is an industry standard workload, it is one where we see compiler optimizations. Here is what a similar view looks like for compiler optimized STREAM Triad results:
We wanted to take a pause here and show that the results here highlight one important fact: ThunderX2 is an 8-channel memory controller design running at DDR4-2666 speeds. The Intel Xeon Gold is only a 6 channel design so even with compiler optimizations, ThunderX2 remains ahead.
When we look at a baseline from a common compiler, gcc, the picture again changes:
The Cavium ThunderX2 CN9980 now comes out ahead. What is extraordinarily interesting is that the Intel Xeon Gold 6148 results fall by almost exactly 25%. It is almost like icc is finding a way to perform three operations instead of four to run this benchmark, perhaps optimizing how it writes to caches.
Cavium ThunderX2 Linpack Performance
Linpack is perhaps the most used HPC benchmark. We wanted to give some idea of where the ThunderX2 is. We also want to caveat this on two counts. First, the HPC space is rife with using custom compilers. Here using tools like icc are essentially a must, and we do not have access to Cray’s Arm compiler so we are using gcc for ThunderX2. Second, are using 32 threads per CPU in this benchmark for ThunderX2 since 4-way SMT hurts performance and HPC shops that run these types of workloads run SMT off.
The ThunderX2 performs well. AVX-512 and icc simply are awesome here which help Intel platforms significantly. Cavium does well but the company let us know that we have room for improvement. Our standard is to run with SMT on since that is what most of the non-HPC environments look like. This is a case where having 256 threads simply is too much. We also ran the test with 32 threads, or SMT off which yielded a solid improvement. The flip side to this is that we did not have the same level of optimized binaries that some of the custom Arm compilers, e.g. from Cray, would provide in this test. The HPC sites that focus on Linpack we expect will have access to vendor tools and therefore get better figures than we have here.