AMD EPYC 7002 PCIe Gen4 Network Performance
PCIe Gen4 has been around for some time, but the ecosystem for PCIe Gen4 capable devices is still relatively new. Mellanox has been a pioneer and has had PCIe Gen4 capable controllers since its previous generation ConnectX-5 line. Since we had the dual-port Mellanox ConnectX-5 VPI PCIe Gen3 cards from our Gigabyte G481-S80 8x NVIDIA Tesla GPU Server Review, we decided to test those cards against the Mellanox ConnectX-5 VPI PCIe Gen4 (CX556A) variants. We hooked each of the cards up to our 100GbE Dell Z9100-ON switch and wanted to see if there was indeed a benefit to newer PCIe Gen4.
There is certainly a lot of room for improvement here. Pushing dual 100GbE links even with something as simple as iperf/ iperf3 is not trivial. Still, this is a more or less out-of-the-box performance of dual 100GbE ports without a lot of tuning
Make no mistake, this is a big deal. The PCIe Gen3 100GbE network links are utilizing their upstream switch ports at under 60% utilization. In essence, expensive 100GbE switch ports are being wasted by more than 40% using Intel Xeon Scalable and PCIe Gen3. The $1000 NIC is going over 40% wasted with Intel. The “inexpensive” $100 DACs between the machine and its switch are wasted with Intel.
If you want to do NVMe over Fabric, as an example, then PCIe Gen4 has another benefit: it allows for more I/O to even PCIe Gen3 devices. Architecturally, a 24-drive PCIe x4 NVMe 2U chassis requires 96 lanes. That means for a dual Intel Xeon Scalable server, to add one 100GbE NIC to the server, one needs at minimum 112 lanes or 16 more than the dual Xeon server can provide. That also leads to an 6:1 oversubscription ratio with 16 PCIe lanes for network and 96 lanes for SSDs. Instead, we see NVMe servers based on Intel Xeon often with one or two PCIe switches and a x16 link back to the CPU from each switch. That means one keeps the 100GbE NIC bandwidth at 1:1. On the other hand, one is limited to a maximum of PCIe Gen3 x16 speeds using this configuration.
AMD EPYC 7002, even in a single-socket configuration, can handle 24x NVMe SSDs with 96x PCIe Gen3 lanes at full speed (theoretically even only using x48 PCIe Gen4 lanes with appropriate PCIe switches.) It still can place one or two PCIe Gen3 NICs in a Gen3-only system which is only a 3:1 oversubscription, half of Intel’s. Alternatively, it can handle one PCIe Gen4 100GbE NIC and still see a 3:1 oversubscription with 16 lanes for SATA or NVMe boot. The other configuration is two 100GbE PCIe Gen4 NICs in which case NVMe bandwidth to network bandwidth is a 1.5:1 ratio. That also happens on a single NUMA node where Intel needs to span two NUMA nodes to the detriment of performance.
Bottom line, PCIe Gen4 matters and will enable a new class of performance in systems even using legacy PCIe Gen3 storage, networking, and accelerators. System builders can create the systems using a single AMD EPYC 7002 CPU that are not possible with even two Intel Xeon CPUs.
AMD EPYC 7002 Power Consumption
As vendors crank up the TDP on their parts, power consumption is an area we hear less about. Packages like the Intel Xeon Platinum 9200 series and NVIDIA Tesla V100 can put pressure on cooling at a rack level due to high TDPs. On a relative scale, today’s mainstream CPUs are much easier to cool. Still, there is a big story with the AMD EPYC 7002 series.
We are going to present a few data points in a min/ max. The minimum is system idle. Maximum is maximum observed through testing for the system. AMD specifically asked us not to use power consumption from the 2P test server with pre-production fan control firmware we used for our testing. We are also not allowed to name because Intel put pressure on the OEM who built it to have AMD not disclose this information, despite said OEM having their logo emblazoned all over the system. Yes, Intel is going to that level of competitive pressure on its industry partners ahead of AMD’s launch.
Instead, we are going to present 1P power consumption. Each system was setup with Micron 32GB DIMMs in the fastest the CPU would allow. The systems also had four 3.84TB Micron 9300 NVMe SSDs installed. We found three 2U systems that could handle this type of configuration for AMD and Intel. We are using the Dell EMC PowerEdge R7415 (AMD EPYC 7001), the Supermicro SuperStorage SSG-5029P-E1CTR12L (Intel Xeon), and a new Gigabyte R272-R32 (EPYC 7002.) We will have our Gigabyte R272-R32 review coming, but this is a PCIe Gen4 capable 24-bay NVMe server with extra PCIe Gen4 slots available for high-speed networking.
Here we can see a completely intriguing pattern emerge. AMD uses about the same power as the higher-end Xeon Scalable processors. At the Intel Xeon Silver range, power is much lower, but we are looking at higher-end CPUs in this piece. Remember, an Intel Xeon Scalable system also requires a Lewisburg PCH with 15-21W TDP. The PCH is integrated into the AMD EPYC design. As such, Intel’s lower-end power figures tend to be a bit higher than what one may expect from a chip’s TDP.
AMD seems to have just decided against sampling us low TDP parts. One can set the cTDP to a lower value and constrain power/ performance. One can also set it higher, for example up to 240W cTDP on the AMD EPYC 7742 part. We are going to have a dedicated piece on those impacts on power and performance soon. Still, there are no sub-100W TDP parts on AMD’s SKU list. AMD is pushing a consolidation story, and that makes absolute sense.
This is the important part to remember: performance per watt. Intel absolutely hammered AMD on this in previous generations of server and client products. Now, the tables have turned, especially at the high-end. AVX-512 is a particular feature that absolutely pushes Intel Xeon power consumption. It seems AMD’s decision to omit AVX-512 in this generation means that it is more power-efficient. In every dual-socket test, we are seeing the AMD EPYC 7742 out-perform the dual Intel Xeon Platinum 8280 configuration sometimes by ~2x, and yet the maximum power is lower.
Real-World Power Savings Example: STH Load Generation
Practically, performance per socket has a big impact on power. We have a series of 2U 4-node chassis in the lab with dual Intel Xeon E5-2630 V4 CPUs, 128GB of RAM, and 10/50GbE NICs each that we use for load generation targets. This was the high-volume mainstream market through Q2 2017.
You may have seen a few of the results from this configuration leak into our performance results. We are looking at somewhere between 3-4 of these dual-socket mainstream Xeon E5 V4 systems from a year ago consolidating into 1 single socket AMD EPYC 7702P system.
Frankly, this completely breaks corporate IT purchasing cycles. Consolidating 6-8 sockets into 1 has an immense VMware licensing impact. Further, we are using about one-quarter of the power even with some of the efficiencies we gained by going 2U 4-node. To your IT procurement folks, this is the view:
The TDP of the AMD EPYC 7702P is higher than the Intel Xeon E5-2630 V4’s, but when you are replacing 6-8 sockets to 1 socket, the power savings are absolutely immense. We have not seen 6:1 consolidation ratios in under 2.5 years of technology advancement happen in the industry.
Next, we are going to discuss the solution’s market positioning and then give our final thoughts.