AMD EPYC 7002 Topology Impact
We wanted to show a few views of why this matters from a system topology perspective. You may have heard Intel and other commentators mention how AMD needed to use multiple NUMA nodes to hit high core counts due to their EPYC 7001 chiplet design. That has changed. Now Intel needs more NUMA nodes to hit a given core count.
AMD EPYC 24-core Topology Changes
In the first-generation AMD EPYC 7001 CPUs, there are four die packaged into a socket. Each die has two channels of local memory. This creates four NUMA nodes per socket and there is a major latency impact going from one die’s memory to another. Worse was off socket access. Here is a look at an AMD EPYC 7401P in one of the HPE ProLiant DL325 Gen10 systems that we have been deploying in droves.
As you can see, there are four NUMA nodes. Also, PCIe devices are attached to different NUMA nodes so there is a lot of traffic that may need to cross the first generation Infinity Fabric.
Here is a different example from a Gigabyte R272-Z32 server where we have NVMe SSDs populated in different slots across the 24x NVMe SSD front bays.
Even with all of that PCIe Gen4, here is what the topology looks like:
As you can see, the package has a single memory domain since all of the DDR4 memory sits off of the I/O die. Also, all of the PCIe devices go to that NUMA node.
AMD EPYC 64-core Topology Changes
Scaling up, here is what 64 cores look like using an AMD EPYC 7702P 64-core CPU in a Supermicro WIO 1U platform:
That massive 256MB L3 cache is split across the cores. All 256GB of memory is attached to this large NUMA node. Likewise, PCIe is attached to a single NUMA node. Again, this would have been across four NUMA nodes previously.
Intel cannot field 64 cores in even a dual-socket mainstream platform. Instead, one needs to span four NUMA nodes. The example below is only 4x 12 core CPUs, but Intel with a 56 core maximum in dual-socket needs at least four NUMA (4×16 core) nodes to hit what AMD can now do in one.
Disparaging AMD’s four NUMA node design to one NUMA node on Intel now sees the tables turn at 64 cores. Intel needs four NUMA nodes, AMD only needs one.
Getting Big: 128-Core/ 256-Thread Topology
That scales-up as well. With 64 cores/ 128 threads per socket, AMD can now do this is in only two sockets:
Intel can only get to 112 cores/ 224 threads in quad sockets. If you wanted this many cores with Intel Xeon Platinum 8200, you would need to move to an exotic (and costly) 8-socket design.
Impact: Memory Bandwidth and NPS
AMD now has a setting with the EPYC 7002 generation to provide multiple NUMA domains on a system. Although AMD did not want to make a direct comparison, there is a feature that is similar to the Intel Xeon Scalable Sub-NUMA Clustering (SNC.) One can effectively partition the EPYC 7002 CPU to behave like four NUMA nodes with up to two compute die and one quarter of the I/O die each. This keeps data flowing through the shortest paths in the system. Indeed, with NPS=4 (4 NUMA nodes) one has a topology that looks not dissimilar to the AMD Ryzen 3000 topology, 1/4 of an IOD and up to two compute dies.
For this test, we are using the industry-standard STREAM benchmark. STREAM is a benchmark that needs virtually no introduction. It is considered by many to be the de facto memory performance benchmark. Authored by John D. McCalpin, Ph.D. it can be found here.
The default for AMD EPYC 7002 systems will be NPS=1. That is what we showed with the above charts, and what we use in our benchmarks. In most of the test we run, NPS=2 or NPS=4 does not get you much more performance, but for those optimizing hardware and software platforms for peak performance, this is an option available. Since NPS changes topology and results, we wanted it on this page rather than our main benchmark runs.
All of this architecture background is great, but we know our readers want to see the performance. We are going to cover that on the next page.