AMD Milan-X Delivers AMD EPYC Caches to the GB-era

12

STH AMD Milan-X Performance Testing

As a quick note here, our full suite running all of our normal iterations now takes over a week to run even on high-end CPUs like this. We only had six days to review these parts including having to do video and photography, so we did have to go with a shorter methodology here. Still, we just wanted to see what would happen.

Python Linux 4.4.2 Kernel Compile Benchmark

This is one of the most requested benchmarks for STH over the past few years. The task was simple, we have a standard configuration file, the Linux 4.4.2 kernel from kernel.org, and make the standard auto-generated configuration utilizing every thread in the system. We are expressing results in terms of compiles per hour to make the results easier to read:

AMD EPYC 7773X Linux Kernel Compile Benchmark
AMD EPYC 7773X Linux Kernel Compile Benchmark

There is no way when we first started doing this benchmark that we thought it would ever see a sub 1M result. Yet here we are. The AMD EPYC 7773X had a narrow performance lead over the EPYC 7763, but it was there.

c-ray 8K Performance

Although not the intended job for these, we also just wanted to get something that is very simple and scales extremely well to more cores.

AMD EPYC 7773X C Ray 1.1 Benchmark
AMD EPYC 7773X C Ray 1.1 Benchmark

For those coming from the desktop side, this is probably closer to a Cinebench style look at Milan-X performance. Here the cores, IPC, and clock speed dominate while the workload fits into the existing Milan caches without the extra 3D V-Cache. This is a great example of where the cache is not helping, and the lower clock speeds hurt Milan-X performance.

STH nginx CDN Performance

On the nginx CDN test, we are using an old snapshot and access patterns from the STH website, with DRAM caching disabled, to show what the performance looks like fetching data from disks. This requires low latency nginx operation but an additional step of low-latency I/O access which makes it interesting at a server level. Here is a quick look at the distribution:

AMD EPYC 7773X STH Nginx CDN Benchmark
AMD EPYC 7773X STH Nginx CDN Benchmark

We are using a consistent sorting between these two larger workloads just to help looking. For hosting STH, or hosting a circa 2017-2018 version of STH, the AMD EPYC Milan-X would not help. Perhaps that is why AMD is not targeting web hosting with this.

MariaDB Pricing Analytics

This is a personally very interesting one for me. The origin of this test is that we have a workload that runs deal management pricing analytics on a set of data that has been anonymized from a major data center OEM. The application effectively is looking for pricing trends across product lines, regions, and channels to determine good deal/ bad deal guidance based on market trends to inform real-time BOM configurations. If this seems very specific, the big difference between this and something deployed at a major vendor is the data we are using. This is the kind of application that has moved to AI inference methodologies, but it is a great real-world example of something a business may run in the cloud.

AMD EPYC 7773X MariaDB Pricing Analytics Benchmark
AMD EPYC 7773X MariaDB Pricing Analytics Benchmark

This was an example where we actually saw the working set fit nicely into cache for the database as well as the application running the pricing analytics. As result, we get a very nice speedup.

STH STFB KVM Virtualization Testing

One of the other workloads we wanted to share is from one of our DemoEval customers. We have permission to publish the results, but the application itself being tested is closed source. This is a KVM virtualization-based workload where our client is testing how many VMs it can have online at a given time while completing work under the target SLA. Each VM is a self-contained worker.

AMD EPYC 7773X STH STFB SLA KVM Virtualization Benchmark
AMD EPYC 7773X STH STFB SLA KVM Virtualization Benchmark

This is perhaps the coolest chart of them all. Using smaller VMs, the EPYC 7763 and EPYC 7773X are relatively close. In the middle of our range (H which is “Hard” because I wanted a fifth size to make an odd number years ago when we started using this), we see the two are identical with how many VMs can be supported. At the top end, we can actually see a larger percentage uptick with Milan-X.

The reason this chart is so exciting is simple: it shows a key concept. Milan-X is very fast if it is making effective use of the cache. Effective cache use is dependent on the workload being run. Also, something like our STH nginx web hosting example does not benefit from the extra cache, but perhaps if we run multiple instances, it might. It also may not if the access is too random. That is exactly what is making Milan-X so interesting and why I think it lends itself to a conceptual model.

An AMD Milan-X Performance Conceptual Model

From what we have seen with ISVs, AMD’s data, and our data is that the cache hit ratio and working set sizes are the keys to Milan-X performance. It is probably right to assume that the simulation vendors were presenting findings because they were generally positive. I also did not get the impression that AMD was showing Synopsys EDA cases because it was a negative case for Milan-X. Still, there is a pattern with the data we have seen and it is very simple.

AMD EPYC 7003X Milan X SoC Architecture
AMD EPYC 7003X Milan X SoC Architecture

If a server can effectively use additional cache because of tripling its size, then Milan-X is better. We are using “server” instead of “workload” because, for smaller workload sizes, it may take scaling multiple workloads to fill and utilize caches effectively. It also matters how accesses happen because if cache misses are frequent even with the larger size, then it is not necessarily helping.

Still, the conceptual model is that if a server is able to effectively use the 96MB of L3 cache per CCD, then Milan-X is a winner.

A Word on Power Consumption

At STH, we test servers usually with specific workloads under specific temperature and humidity ranges in the data center. During our testing window for this review, either the temperature or humidity were outside our target ranges so it was hard to get exact data.

Generally, AMD CPUs are fairly good with TDPs, and it seemed like the AMD EPYC 7773X was close to the EPYC 7763 in terms of power consumption. That means that we need to hit the right target range in order to see if there are appreciable differences.

HPE Cray EX AMD Instinct MI250X At SC21 Node Front
HPE Cray EX AMD Instinct MI250X At SC21 Node Front

In modern servers though, 280W TDPs are at the edge where many dense systems, like the Cray Shasta system or 2U 4-node systems will be able to cool on air cooling. If you are thinking about deploying 2U 4-node Milan-X servers, then we are going to have an awesome liquid cooling piece using 2U4N 64-core Milan (non-X) servers that we filmed last week. Expect to see that soon and we are going to show some of the massive power savings one gets with liquid cooling as well.

Next, let us take a look at the market impact and our guide to the AMD EPYC 7003 Milan series with this release.

12 COMMENTS

  1. This is excellent. I’m excited to get a r7525 with these and try them out. I sent this to my boss this morning and he OK’d ordering one so we can do profiling on our VMware servers

  2. @cedric – make sure you order it with all the connectivity you’ll ever want. Dell has been a bunch of [censored] when we’ve opened cases about bog-standard Intel X710 NICs not working correctly in our 7525s. So much for being an open platform.

    Not that I’m bitter.

  3. Now that the 7003x “shipping”, perhaps they can get around to shipping the 7003 in bulk. I’ve got orders nearly 9 months old.

  4. While per-core licensing costs seem to be a consideration for some people, I think this kind of optimisation is only possible because certain proprietary licensing models need updating to account for modern computer hardware. Given the nonlinear scaling between frequency and power consumption, it appears environmentally backwards to base hardware choices on weird software licensing costs rather than performance per watt or something similar that neglects arbitrary licensing constraints.

    On another note, NOAA open sourced their weather forecasting codes a few years ago and WRF (based on models developed by NCAR) has been open source for much longer. I think the benchmark problems associated with these applications would make for an interesting journalistic comparison between new server CPUs with larger cache sizes.

  5. @Eric – Environmentally backwards, possibly, but so often the hardware platform is the cheapest part of the solution – at least in terms of capital costs. I don’t think it’s necessarily unreasonable to optimize for licensing costs when the software can easily dwarf the hardware costs–sometimes by multiple orders of magnitude. To your point though, yes, the long-term operational expense, including power consumption, should be considered as well.

    The move to core-based licensing was largely a response to increasing core counts – per-socket licensing was far more common before cores started reaching the dozen+ level. Hopefully you’re not advocating for a performance/benchmark based licensing model…it’s certainly been done (Oracle).

  6. I find the speedups in compilation a bit underwhelming. My hunch is that the tests are performed the usual way – each file as a separate compilation unit. I work on projects with tens of thousands of C++ files and the build system generates files that contain includes for the several hundred cpp files each and then compiles those.

    When you have a complicated set of header files, just parsing and analyzing the headers takes most of the compilation time. When you bunch lots of source files together you amortize this cost. I guess in such scenario the huge L3 cache would help more than for a regular file-by-file build.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.