Today we are going to launch Part 1 of our performance series on the dual AMD EPYC 7601 series processors. This is going to be the first of a multi-part series. For those wondering why we discussed a bit in our AMD EPYC Infinity Fabric Latency DDR4 2400 v 2666: A Snapshot piece but we have been waiting for our test platforms to mature to the point that we feel confident that our numbers will resemble what our readers will see if they buy systems.
With that said, our expanded benchmark suite runs are automated but take 10+ days to run, especially with the higher-end applications. We also wanted to get enough data we could use to compare systems that will compete in the marketplace. Comparing a $10,000 Intel Xeon Platinum and a $4,200 AMD EPYC CPU is not something that is overly useful since they are focused on different market segments. Every day that goes by we are collecting a significant amount of data on AMD and Intel platforms, but this process does take time. Just to give our readers a sense, at any given time we are now using about 10kW in the data center testing this new generation of gear.
The AMD EPYC 7601 is a great chip and an absolute monster in terms of raw specs. It has 32 cores, 64 threads. L3 cache measures 64MB. There is a total of eight DDR4-2666 memory channels capable of two DIMM per channel configurations or up to 2TB per CPU. PCIe lanes abound with 128x (1P) 64x (2P) I/O lanes per CPU that can be used for PCIe or SATA III. Let us be clear, on a platform level, the AMD EPYC 7000 series is simply awesome.
For our tests we have been using a Supermicro Ultra platform configured as follows:
- System: Supermicro 2U Ultra EPYC Server (AS-2023US)
- CPUs: 2x AMD EPYC 7601 32-core/ 64-thread CPUs
- RAM: 256GB (16x16GB DDR4-2400 or 16x16GB DDR4-2666)
- OS SSD: Intel DC S3710 400GB
- OS: Ubuntu 17.04 “Zesty” Server 64-bit
- NIC: Mellanox ConnectX-3 Pro 40GbE
We did run tests both with DDR4-2400 and DDR4-2666 and stand by our recommendation that anyone purchasing an AMD EPYC 7000 series system use only DDR4-2666.
We also wanted to make a few notes that we thought warranted discussion. First, we do have some professional application benchmarks that run on CentOS/ RHEL. We use CentOS 7.3 but with EPYC, we are going to suggest upgrading the kernel. Likewise, Ubuntu 16.04 is the current LTS release but we suggest using Ubuntu 17.04 “Zesty” or upgrading the kernel to 4.10 or later if you must use 14.04 LTS or 16.04 LTS. Our general policy is to use standard Ubuntu LTS and CentOS releases but AMD EPYC is getting more performance from newer software ecosystems. This is normal and we expect EPYC to get performance gains as software optimizes for the new CPU architecture.
Just to give one an idea in terms of power consumption, here is what we saw with the platform:
Overall, strong results. We are going to delve more into power consumption as we do the system review since chassis cooling, power supplies and platform configurations make such a big difference in today’s overall server power consumption. These numbers are directionally correct which should provide a sense of power consumption of the platform.
We also expect the Supermicro EPYC platforms to be among the first that are commercially available so it is a reasonable starting point.
Dual AMD EPYC 7601 Benchmarks Part 1
For our testing, we are splitting up the benchmarks into a few different segments. This Part 1 is intended to provide a glimpse into EPYC compared to the legacy tests we have been running for years. While we bent our standard setup in terms of OS and tool chains and re-ran comparative data. We are still running these legacy tests to provide a glimpse of what was. Many organizations have VMs and applications that are going to move directly to EPYC from E5 V1 or V2 servers without much effort in re-tooling.
The next set of numbers for Part 2 we will have an expanded comparison set. We will also have applications such as Elasticsearch, Redis (expanded), Ansys/ LS Dyna results, GROMACS, containerized workloads, greatly expanded OpenSSL testing and more. We have been running the new test set on several dozen configurations to ensure consistency, but we are still building a comparison set. At 10-14 days per run this data simply takes time to build.
For our Part 1 testing, we are using Linux-Bench scripts which help us see cross platform “least common denominator” results.
Python Linux 4.4.2 Kernel Compile Benchmark
This is one of the most requested benchmarks for STH over the past few years. The task was simple, we have a standard configuration file, the Linux 4.4.2 kernel from kernel.org, and make the standard auto-generated configuration utilizing every thread in the system. We are expressing results in terms of compiles per hour to make the results easier to read.
Here is the key takeaway: use DDR4-2666. Also, in terms of where Intel and AMD are competing, it is the $3200-$4200 price point. We already published our Intel Xeon Gold 6150 benchmarks which included some EPYC results. As one can see, on this test, the 18 core / 36 thread Intel Xeon Gold 6150 at around $3300 is competitive with the $4200 AMD EPYC 7601 part. We expect when we finish 2P Gold series benchmarks closer to the $4200 price point Intel will be competitive with AMD here.
As we are going to see, AMD excels in several of our other workloads. We are going to revisit a subset of these results at the end so keep reading.
c-ray 1.1 Performance
We have been using c-ray for our performance testing for years now. It is a ray tracing benchmark that is extremely popular to show differences in processors under multi-threaded workloads.
In all of these types of benchmarks, AMD does simply awesome. We have seen it since the Ryzen 7 launch. There is a reason AMD uses Cinebench R15 so heavily in its marketing. Incidentally, we did test an Intel Xeon Platinum 8180 quad system that was so fast Cinebench R15 broke.
We started using c-ray in the Sandy Bridge generation when doing a 4K render seemed big. You will notice that over the years, many sites adopted our “hard” test which I made up by using 4K resolution which seemed hard at the time. Realizing that “hard” is starting to get demolished, we are going to add a new 8K class in Part 2 and have a few dozen configurations both physical and cloud finished.
7-zip Compression Performance
7-zip is a widely used compression/ decompression program that works cross platform. We started using the program during our early days with Windows testing. It is now part of Linux-Bench.
Here we see solid results from the AMD EPYC platform. We like the platform’s ability to run existing workloads effectively.
NAMD is a molecular modeling benchmark developed by the Theoretical and Computational Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign. More information on the benchmark can be found here. We are going to augment this with GROMACS in the next-generation Linux-Bench in the near future. With GROMACS we have been working hard to support Intel’s Skylake AVX-512 and AVX2 supporting AMD Zen architecture. Here are the comparison results for the legacy data set:
Here you can see with a least common denominator view, AMD EPYC does very well here. We are going to preview our GROMACS result near the end of this article. That is frankly more applicable to today’s work in this area.
Sysbench CPU test
Sysbench is another one of those widely used Linux benchmarks. We specifically are using the CPU test, not the OLTP test that we use for some storage testing.
Here AMD EPYC does very well. We should be getting Xeon Gold 6152’s shortly which should close the gap considerably. We also added a few single socket results to this picture to give additional breadth to the discussion.
OpenSSL is widely used to secure communications between servers. This is an important protocol in many server stacks. We first look at our sign tests:
This is one area that will be greatly expanded in Part 2. Frankly, we underestimated the demand for this data when we did the first Linux-Bench tests. We also wanted to make this CPU rather than accelerator bound but will be expanding that viewpoint in the next round.
UnixBench Dhrystone 2 and Whetstone Benchmarks
One of our longest running tests is the venerable UnixBench 5.1.3 Dhrystone 2 and Whetstone results. They are certainly aging, however, we constantly get requests for them, and many angry notes when we leave them out. UnixBench is widely used so we are including it in this data set. Here are the Dhrystone 2 results:
And the Whetstone results:
Overall, EPYC does well and has chart topping performance which we would expect from a 64 core/ 128 thread implementation.
With these results, we wanted to pivot into a conversation that drives a huge portion of the enterprise segment. What kind of per-core performance are we seeing?
A Different View: Per Core and Per Thread Performance
With high core count servers, many are virtualized or running containers. In theory, it should be uncommon for a 64 core/ 128 thread server to rely heavily on single threaded performance. If you have a single thread per system workload, there are better options in the SKU stacks. Instead, we wanted to look at per core performance running multi-threaded workloads.
Let us first look at those UnixBench Whetstone results. EPYC was a clear leader in the performance charts.
Here we see a general trend around the Intel chips performing a bit better on a per core/ thread basis. The Dhrystone 2 results:
Here you can see that EPYC’s commanding lead shifts significantly when it comes to per core performance. Directionally, these views make sense.
Turning to the Linux Kernel Compile benchmark, here is what that view looks like:
The Broadwell-EP v. Skylake-SP numbers are exaggerated a bit by clock speed differences but the trend is fairly simple: EPYC draws strength from having lots of cores.
For those reading this article that either are not in corporate IT, or do not have to deal with per-core licensing, the AMD EPYC 7601 is a beastly CPU. Frankly, when buying your next generation of servers you should look into getting a few to run your workloads. We hope to have EPYC in DemoEval soon, so you can run your own workloads on our hardware, but we are short on platforms to spare for that while we are testing. Likely in the next few months, this will happen so stay tuned.
If you live in environments with Microsoft, Oracle, or other applications where licensing is on a per-core basis then your TCO calculations will look considerably different. [Edit: Thanks, commenters! VMware vSphere version(s) that EPYC is supported on (vSphere 6.5u1) are per socket not per core licensing like older VMware licenses. Removed VMware from the list.] Especially if you are being hit by a double-whammy of OS/ hypervisor per core licensing/ support as well as application per core licensing/ support, per core performance is paramount. In Part 2 we will have performance on some high-end engineering applications that are several thousand dollars per core so using as few cores as possible is important.
At the initial AMD EPYC 7000 series launch, there are essentially nine performance variants, excluding “P” single socket only SKUs. We have seven of the nine already in the lab so there will be a lot of follow up. At the same time, Intel has several dozen public performance variants without touching other vectors of differentiation. Intel’s CPUs are not just optimized for performance, they are explicitly optimized for licensing as well. For example, we have a pair of Intel Xeon Silver 4112 CPUs in the lab that are around $500 but with only 4 cores and extended cache. Compare that to the similarly priced 8 core/ 16 thread Xeon Silver 4110’s that have lower clock speeds and less L3 cache per core and Intel’s per core licensing optimized SKUs reach far down into the stack.
If your environment costs thousands of dollars per core, Intel is very interesting. If your main licensing and support costs are at the server level or have a case where you want to plug heaps of DIMMs and PCIe devices into a server, AMD looks extremely interesting.
We did want to address one other elephant in the room, Intel Xeon Platinum comparisons.
Sneak Preview: Please Stop Comparing Intel Xeon Platinum and AMD EPYC
Intel seeded several review sites, including STH, with Xeon Platinum CPUs and systems. At STH, we tested a 4P Intel Xeon Platinum 8180 system that is an absolute beast. It uses $10,000 CPUs which are meant to be low volume parts where the list price (note list price) has a significant premium. These parts command premium list price both for potential discounting purposes and also as specialized tools for specific workloads.
To illustrate the concept of specialized tools, here is a preview of our first set of GROMACS benchmarking results using our “small” workload.
A key note on the above, it took us some time to get to that level of performance. We needed to grab the latest AMD AVX2 patches and the latest AVX-512 patches for Intel. The above we are going to call a “Work in Progress” but on the flip side, we have several thousand runs completed. For the particular workload, the above was very consistent run-to-run but we are going to modify the go-forward test cases.
The key takeaway here is that the quad socket Intel Xeon Platinum 8180 CPUs are over 3x as fast as the dual socket AMD EPYC 7601 parts largely thanks to AVX-512. We say our “small” workload as we found that the 4P Intel Xeon Platinum 8180 machine had so many cores and threads that we were well beyond the point where our performance was not scaling well due to the number of simulated atoms per CPU core being too low. We are currently testing an updated 150,000 atom model which we are calling “hard” for now. More on this in Part 2.
From a price perspective, AMD and Intel are going to compete in the single and dual socket markets between the Intel Xeon Bronze and Gold lines. The Platinum SKUs are Intel’s low volume highly specialized tools. That is why we pulled results from most of the charts above. While a $3300 and a $4200 CPU may compete in some markets, a $4200 and $10000 CPU are unlikely to compete.
If you are using open source software, or if you license software on a per node basis, AMD EPYC is going to be strong in many use cases. Performance even on legacy workloads is going to be extremely strong. From a platform perspective, AMD has more memory channels, more PCIe lanes (NB) and a great architecture to service a variety of workloads. There is no premium for going over 768GB/ socket memory (despite the fact that hitting 2TB/ socket is a $70K per socket cost itself.)
When looking at AMD EPYC performance there is a time element. There is a very good chance that several quarters down the road, AMD EPYC performance gets significantly better. We are testing this before commercially available systems are shipping en masse. Once EPYC gets to developers, we expect performance to improve. Likewise, Skylake is a significant architecture change so we expect Xeon Scalable systems to get a performance bump over time.
At STH we have a view of the platform informed by the fact that we have around 30% of all AMD EPYC and Intel Xeon Scalable 1P and 2P SKU’s covered by the CPUs in our lab, along with about two dozen legacy configurations online. We have been testing these AMD EPYC 7601 CPUs for some time. Now that we are seeing system firmware working as we would expect, and the software ecosystem improving, we can say that for our web hosting cluster, EPYC will be a great fit. We are working on adding EPYC to the cluster and actually have had a Ryzen 7 1700 system in the cluster for over a quarter just to profile where we would put EPYC.