We finally have our Supermicro 2U Ultra server with what seems like it may be a production firmware. Over the past two weeks, we have upgraded the system to a new motherboard PCB revision, cycled through several BIOS versions, and upgraded to DDR4-2666. Our reviews live for many quarters so we wanted to ensure that if we are publishing numbers on something that one can actually buy. Now, we are ready to publish some numbers based on AMD Infinity Fabric and have a quick comparison point against the Intel UPI interconnect as well.
Some Background on Intel MLC
The Intel Memory Latency Checker tool has had several revisions and one of the main purposes is to show memory latency and bandwidth between different CPUs in a system. For example, Intel has, for years used different QPI speeds throughout its range that impacts socket to socket communication speed. MLC is one of those tools that has pushed companies to use slightly higher-end CPUs even in high-end 8x and 10x GPU systems as we described in DeepLearning10 and DeepLearning11 pieces.
We are using Intel’s MLC to provide the figures below as an initial data set. While one may immediately cry afoul with this, Intel essentially had no opportunity to fudge the program for AMD EPYC. We asked Intel about if they had tried MLC on AMD EPYC and as of the time we ran these benchmarks the company confirmed they did not yet have access to an EPYC system. That makes sense since AMD EPYC is still not shipping in retail available systems as of a month after its launch. What that also means is that Intel’s developers were not on an AMD EPYC system prior to these benchmarks. Without access to AMD EPYC systems, Intel could not reasonably tune or detune for AMD’s reliance upon NUMA as the NUMA enumeration looks slightly different on AMD v. Intel architectures.
We still took the figures with a bit of skepticism. We looked at the results we obtained and compared them against what we know of the architecture. Further, we shared them with a handful of industry analysts who are technically excellent and the general gut check is that they make sense. That gave us the confidence to publish these numbers.
We are also publishing figures for both DDR4-2400 and DDR4-2666 operating modes. The AMD Infinity Fabric is tied to the memory clock domains. If you want to see an example of how the firmware evolution and the change to DDR4-2666 has real-world implications, here is our Linux Kernel Compile Benchmark using the initial pre-production AMD AGESA firmware we received, and production AMD AGESA at DDR4-2400 and DDR4-2666.
That picture above is a prime example of why we have held off on publishing full benchmark result sets to date. Expect more coming. Infinity Fabric plays such an enormous role in AMD EPYC performance that we wanted to show real-world performance. If you are buying a new AMD EPYC server, we highly suggest using DDR4-2666 over DDR4-2400. We expect very few AMD EPYC dual socket buyers to opt for DDR4-2400 anyway now that DDR4-2666 is shipping.
Not to belabor the background information, but here is the test platform we are using:
- System: Supermicro 2U Ultra EPYC Server (AS-2023US)
- CPUs: 2x AMD EPYC 7601 32-core/ 64-thread CPUs
- RAM: 256GB (16x16GB DDR4-2400 or 16x16GB DDR4-2666)
- OS SSD: Intel DC S3710 400GB
- OS: Ubuntu 17.04 “Zesty” Server 64-bit
- NIC: Mellanox ConnectX-3 Pro 40GbE
We are going to have more on the server and performance at a later date.
AMD EPYC Infinity Fabric DDR4 2400 v. 2666
We are going to bound this discussion in terms of two MLC outputs. First core to core latency as well as bandwidth. We ran the tests on the system with DDR4-2400 and DDR4-2666 RAM to make a direct comparison. Since the AMD EPYC dual socket system has 8 NUMA nodes, we also have summary tables so you can easily parse the large data set. There is a ton of misinformation going around, so we wanted to simply present some data.
We did ten runs on each and the results were extremely consistent. We used the final runs (generally in the middle of a +/- 1 or 2 ns variance for each NUMA node to NUMA node figure) to showcase the configurations. MLC produced very consistent results.
AMD EPYC Infinity Fabric DDR4 2400 v. 2666 Latency
Here is a quick view of what AMD EPYC Infinity Fabric latency looks like across different cores using DDR4-2400. For these charts the 0-7 on the horizontal and vertical headers correspond to NUMA nodes.
Here is the same system with DDR4-2666 DRAM:
You can see that there is an appreciable drop in overall latency. Instead of trying to calculate each drop on an absolute basis, here is a table with the differences:
If you want to see those figures on a percentage basis, here is the view:
The overall average is about an 8% lower latency using the faster RAM. In an 8 NUMA node design, that is a big deal.
The extremely cool finding here is that we saw a clear pattern that one may have hypothesized. AMD EPYC socket-to-socket communication is done via four Infinity Fabric links that we were told connect each die to its respective die on the other package. So Socket 0 Die 0 has a direct link to Socket 1 Die 0 which is a single hop trip. Conversely, if Socket 0 Die 0 wants to communicate with resources from Socket 1 Die 3, we expect this is a two hop trip. Here is a diagram where Yellow is Socket 0 Die 0 and Blue is Socket 1 Die 0 for reference.
That is exactly what Intel MLC is showing us, to an extent. We highlighted four sets of figures corresponding to the same die results, the same package results, other package/ socket, and other package/ socket with the corresponding die. Taking that view, we saw a clear pattern:
Now, one will note that the bold and italicized results do not line up perfectly. E.g. we would expect one to fall at the intersection of NUMA 0 and 4, 1 and 5 and etc. Although we do not have a great answer for this, we do know AMD EPYC NUMA enumeration is slightly different. One extremely simple answer for this is that Intel did/ does not have a test system to check the enumeration is the same. Still, we are seeing a clear delta between those four latency domains as we discussed in our AMD EPYC and Intel Xeon Scalable Architecture Ultimate Deep Dive and our launch AMD EPYC 7000 Series Architecture Overview for Non-CE or EE Majors piece.
AMD EPYC Infinity Fabric DDR4 2400 v. 2666 Memory Bandwidth
Latency is great, but in NUMA architectures one of the other major factors is memory bandwidth. Here we have similar data to the latency numbers, instead expressed in terms of bandwidth at DDR4-2400:
Swapping that to DDR4-2666 you will see a significant bump in bandwidth:
Again here is the delta chart:
Like we did for the latency side, here is the improvement expressed as a percentage:
Again we see around 9% improvement.
We took this data and applied the same formatting from the latency measurements and the results did give us pause.
Here the inter-socket bandwidth between partners is not showing additional bandwidth versus the non-partners across sockets. The major gains are again coming across the board from DDR4-2666 over DDR4-2400.
AMD EPYC DDR4-2400 v. DDR4-2666 Quick Summary
If you cannot tell our theme thus far: do not buy an AMD EPYC system with DDR4-2400. Just get DDR4-2666. You can see from the Linux kernel compile benchmark (copied again for ease of reference) that there are tangible differences in a real-world application of around 5% that accompany the 8% latency and 9% bandwidth improvements we are seeing between the NUMA nodes.
Using DDR4-2400 on an AMD EPYC 64-core system is equivalent to running a DDR4-2666 system with only 60 cores.
We now have 50+ benchmarks run with the three firmware levels shown in that chart. We will note that smaller workloads that do not have much die-to-die traffic work about equally well on DDR4-2400 and DDR4-2666 (and even that “Initial Firmware” revision.) However, when you configure a system, the performance gained with DDR4-2666 over DDR4-2400
Since we are going to get asked the question, we did want to put some UPI perspective around these figures.
Intel Xeon Scalable 4 Socket UPI v. AMD EPYC Infinity Fabric
Since comparisons are all the rage these days, and we boldly made the statement that AMD’s on package latency looked much like Intel’s inter-socket latency (in a quad socket configuration.) To an OS, both an Intel Xeon Scalable quad socket system and a single socket AMD platform will look similar with four NUMA nodes.
Here is what a DDR4-2666 quad socket Intel Xeon Platinum 8180 platform looks like in terms of latency next to the numbers from the first socket of the AMD EPYC DDR4-2666 platform:
As you can see, the Intel inter-socket latency is roughly equivalent to the intra-socket latency for AMD EPYC Infinity Fabric. Intel is still doing a bit better in its most complex (with full 3x UPI link direct connection) four NUMA node topology than AMD EPYC is.
Quad socket for Intel is a decidedly tougher case than dual socket, but we wanted to show both configurations near their practical maximums. The Intel Xeon Scalable CPUs can scale to 8-socket configurations as an example, but those configurations lack direct die-to-die communication channels via UPI. Instead, we wanted to show what happens when we stress that architecture to its practical maximum.
Overall, the proof is in the real world benchmarks. The Linux Kernel Compile Benchmark above is a great example of how Infinity Fabric speeds have significant impacts on overall performance running real application workloads. With DDR4-2400 memory speeds there are significant performance hits that are frankly not worthwhile for anyone ordering a new AMD EPYC system.
Since the above will likely incite scores of supporters from the AMD and Intel camps, feel free to review our Editorial and Copyright Policies. We are the only major publication to publish a full and regularly updated conflicts list, down to pizza companies we have indirect investments in. We are also the largest technology website that does not sell ads or sponsored content posts to either AMD or Intel. That is terrible for our revenue but allows us to remain a highly independent source. All of our display ad placement sales are through Google so nobody from our organization is ever involved. Then again, we are also decreasing display ads significantly bucking the overall industry trend to give our readers the best experience and best information possible. This is done to provide STH’s audience of IT professionals the most independent and objective coverage out there.