AMD EPYC Genoa Zen 4 Performance
There is something that we noticed in this generation that we had to address. When we design benchmark suites, and many workloads you see online ideally try to run one workload over an entire CPU. If all goes well, this would be the result, a 100% x 384 thread system:
In real-world workloads, if you run a single workload on a large chip like this, sometimes there are single-threaded parts of a workload. That leads to some really poor performance on big chips because those workloads look something like this with 1 of 384 threads running at 100%.
The above is a bigger issue today than it was in the past. On a dual 4-core / 8-thread server, a single thread is over 6% of the total thread count. On a dual 96-core/192-thread server, a single thread is just over 0.26%.
Depending on the workload, these periods where a workload is held up by a single thread became a substantial issue. As a result, we had to do a lot more profiling than we normally do.
The other fun case we ran into was the below view:
We found a number of workloads that have scaled for years but were limited to 256 threads. That means 1/3rd of the threads are not being used.
That brings up the philosophical question: “Are these CPUs, especially 96-core CPUs designed to run single workloads?” John and I spoke about this for some time. The world of benchmarking has almost always been running a single workload across the entire CPU. That is rendering workloads, HPC workloads, and so forth that will use entire chips. Still, most of the chips are really being used for containerized or virtualized workloads. The cloud segment is a prime example of this. The key point is that going forward, we are going to be increasingly using bare metal containers and then virtualized workloads to scale. This is similar to what VMware VMmark does, but KVM is the bigger hypervisor with its cloud adoption and VMware puts restrictions on VMmark. Still, looking at both will be important in the future since there is an argument that hitting a single-threaded part of a workload on a 384-thread system is terrible for overall performance.
Python Linux 4.4.2 Kernel Compile Benchmark
This is one of the most requested benchmarks for STH over the past few years. The task was simple, we have a standard configuration file, the Linux 4.4.2 kernel from kernel.org, and make the standard auto-generated configuration utilizing every thread in the system. We are expressing results in terms of compiles per hour to make the results easier to read.
This is one of the workloads where we had to look at the scaling performance. In the future, we are going to express more benchmarks in terms of runs/ time because that metric will eventually allow for comparisons where the system is running parallel runs as well.
c-ray 1.1 Performance
We have been using c-ray for our performance testing for years now. It is a ray tracing benchmark that is extremely popular to show differences in processors under multi-threaded workloads. Here are the 8K results:
This looks perhaps much closer than it is. AMD is far ahead because this is not being expressed in runs per hour. A fun note, we started collecting data on this rendering-style benchmark back when 8K renders would stress four-socket servers for many minutes. Now, the new generation completes the run in 13 seconds.
7-zip Compression Performance
7-zip is a widely used compression/ decompression program that works cross-platform. We started using the program during our early days with Windows testing. It is now part of Linux-Bench. We are using our legacy runs here to show scaling even without hitting accelerators.
Again, this is a stellar performance, albeit we see scaling challenges at higher core counts on the compression side. Compression, however, is a function that will be ubiquitous in the future but will also warrant offloading to accelerators.
Chess is an interesting use case since it has almost unlimited complexity. Over the years, we have received a number of requests to bring back chess benchmarking. We have been profiling systems and are ready to start sharing results:
A major challenge here was that our benchmark stopped scaling at 256 threads. We had to split the benchmark up to run in two 192-thread instances via containers to get the result above. Otherwise, a third of the chip was not being used.
SPEC CPU2017 Results
First, we are going to show the most commonly used enterprise and cloud benchmark, SPEC CPU2017’s integer rate performance:
Here, adding more cores, and more clock speeds and memory bandwidth yielded crushing results. We managed to get better than some of the initial estimates that were in the 1590 range. From what we have heard, OEMs doing their full platform tuning will be just shy of 1800 at 1790. That is higher than we got, but it is a crushing figure. AMD will have effectively 3x the top dual-socket Intel Xeon 8380 result in the same socket count. It also means AMD is achieving better performance per core, even when packing cores into 96-core parts.
On the floating point side, we see something similar:
Here, AMD is expected to get a massive performance jump because of AVX-512. The dual Intel Xeon Platinum 8380’s were much more competitive 40 cores v. 64 cores per socket, because of AVX-512. Now that core counts jumped AMD pops to a more than 2x jump over Intel’s top-end. Remember, Intel is only adding around 50% more cores at its maximum and getting some IPC benefits from the new generation, but it needs to get more than 2x the performance to reach AMD. We also saw some preview publication results from OEMs that were higher, so we added those results as well.
Next, we are going to get to some more complex workloads so see how Genoa performs.