What Changes with AMD EPYC “Rome”
With AMD EPYC “Rome”, there are two I/O die with x86 cores that hang off the I/O die. When we look externally, AMD has a set number of pins that need to carry data to and from the socket. That includes pins for eight memory channels along with eight x16 links. We are going to leave the memory channels alone, and discuss the x16 links.
With AMD EPYC 7002 or “Rome” already has PCIe Gen4 support announced, while the 2nd Gen Intel Xeon Scalable CPUs still use 14nm and Intel’s PCIe Gen3 IP. In a single socket, that means that AMD EPYC Rome will have 128 PCIe lanes using eight x16 links for PCIe. We expect AMD will use the Rome generation to add another PCIe lane, making 129 PCIe lanes total, and we are going to discuss that in our “Bonus Lanes” section later.
PCIe Gen4 is significant. AMD has to upgrade their SerDes to accommodate essentially twice the bandwidth of PCIe Gen4 along with faster chip to chip IF. Using the rough model that Infinity Fabric on the first generation was using the Gen3 speed, the AMD EPYC Rome Infinity Fabric links are set to double in speed.
Doubling speed means that AMD’s aggregate socket-to-socket I/O die to I/O die bandwidth will increase much faster than we have seen Intel move from QPI and UPI generations. With the 2nd Gen Intel Xeon Scalable chips, Intel is still using 10.4GT/s UPI. Quad socket Intel Xeon Scalable systems have one UPI link between sockets. Dual socket systems will usually have two links but may have three on higher-end PCB systems. AMD’s aggregate socket-to-socket bandwidth is set to double taking what was a closer cross-socket story, to a lopsided benefit.
Having so much bandwidth available raises questions. Does one need 64 (4×16) lanes between the two sockets? Since AMD uses a flexible interconnect in their PCIe and Infinity Fabric designs, and because the SP3 socket is designed for 128 of these lanes, AMD and its platform partners have an option: they can use fewer 16 lane sets for socket-to-socket bandwidth.
For those that will want PCIe Gen4, we have had it confirmed since before Rome was officially announced that many platforms will see new generations. PCB quality used was fine for Gen3, but Gen4 needs better PCB to ensure that the communication channels stay clean. For smaller single socket compact systems, some vendors think they will not need a PCB re-spin to support PCIe Gen4. For larger systems exposing large numbers of PCIe lanes, as we have discussed previously, they will require a Gen4 re-spin.
With new PCB and a virtual superhighway of socket-to-socket bandwidth, some systems designers are looking at using x16 links for additional PCIe Gen4 connectivity versus socket to socket Infinity Fabric links. Instead of using four x16 links on each CPU for socket-to-socket, vendors are looking at Rome re-spins as opportunities to use one x16 link on each CPU as extra PCIe Gen4. Using five of the x16 lane sets on each CPU for PCIe gives 160 total instead of 128 using four-lane sets.
Some systems designers may elect three x16 socket-to-socket links rather than four, instead choosing to maximize PCIe Gen4 lanes at 160. That will decrease socket-to-socket bandwidth which is not necessarily desirable. It also does not fit well with some of how larger systems vendors look at systems. For more niche players, and those willing to accept this trade-off, this is a possible way to differentiate and create very cool platforms for applications like GPU compute, NVMe storage, traditional SAS storage, and Xilinx CCIX FPGAs.
In theory, using two x16 lane sets would allow for similar socket-to-socket bandwidth as current “Naples” chips at twice the speed and half the lanes. From what we hear, and given the additional system girth we can see with Rome, a two x16 lane for Infinity Fabric topology will not be officially supported. If it was supported by AMD and its partners, that would allow similar socket-to-socket bandwidth as today. Stepping down to fewer links cross-socket would require additional validation resources, and given how much compute, RAM, and PCIe the system will support, it is a trade-off AMD is not expected to invest in.
Update 05-April-2019: We have not heard of an OEM supporting 2x inter-socket links with 2×6 x16 (192) PCIe lane configurations yet, even though it would be good for their AMD’s GPU division.
That still does not solve a constraint found on “Naples” that Intel does not face: the lack of a PCH. Without a PCH, on the AMD EPYC 7001 series, one loses many lanes due to lower-speed I/O such as the BMC. This lower-speed I/O is one of the reasons we do not see systems with 128x PCIe Gen3 lanes exposed today. AMD is aware of this and has a plan to fix it.
WAFL Bonus PCIe Lane(s)
Unlike Intel’s PCH, AMD’s chips are designed to have all I/O on package. When looking at 1st and 2nd Gen Intel Xeon Scalable systems versus AMD EPYC, one has to remember that Intel systems include a Lewisburg PCH that uses power and adds cost.
AMD had a goal of using flexible I/O and its Server Controller Hub (SCH) to remove the need for the PCH.
To address the extra I/O needs of adding lower-speed devices like a server’s BMC, AMD is readying an extra lane per CPU. From what we understand, these are not meant for high-end devices. Instead, these are meant to relieve the main x16 lane blocks from being used by lower-speed devices. A common one is the server’s BMC.
As we have worked with dozens of AMD EPYC platforms and published many reviews, a common question is why the platforms do not expose 128x PCIe lanes to the motherboard or server in the AMD EPYC 7001 generation. Here is a good example of where that is in play on a Gigabyte G291-Z20 dual GPU EPYC platform.
A common reason is that current platforms still include SATA, which the EPYC 7001 can support via its I/O lanes. Another key reason is the BMC. Adding a BMC to an AMD EPYC 7001 lane can take away three other lanes from NVMe storage which generally uses x4 lanes. You can see that there is a PCIe 3.0 x2 M.2 slot, to help uses the remaining lanes since the ASPEED AST2500 is using an x1 lane. This block diagram shows the impact of adding the BMC as we conceptually showed in our diagram above.
Server BMCs are ubiquitous, and they use PCIe to connect to the system. By moving that PCIe x1 connection off the main lane sets, it creates a cleaner solution for motherboard re-spins. If you recall from earlier in this article, Intel has the Lewisburg PCH with its own SATA and PCIe lanes that often handle the BMC connectivity. Here is an example using an Intel-based Gigabyte G191-H44.
For newer platform spins, that can mean 129 PCIe lanes for a single socket CPU or 130 / 162 lanes for dual socket AMD EPYC Rome systems. We do not expect this to be available to current AMD EPYC platforms since the lanes and platforms are already set.
Assuming we do see uptake on this feature, it will open up the AMD EPYC Rome generation to have even better PCIe connectivity. With Dell EMC PowerEdge BOSS-like cards for SATA boot drive connectivity as well as NVMe proliferation, this could be a great way to modernize the AMD EPYC platform.
Update 05-April-2019: Everyone seems to be calling this feature WAFL.
Putting this into perspective, AMD’s strategy is looking very strong. Seeing the Gen3 to Gen4 transition, one can imagine that a similar architecture in the future can scale to Gen5 (via another board re-spin) doubling performance yet again. This is a vastly different strategy to Intel UPI and the increase in bandwidth is scaling much faster for AMD. In this generation, Intel left UPI at 10.4GT/s which is the first time it has failed to increment QPI/ UPI speeds in the last decade.
For most systems, the less sexy 129/130 numbers are actually the most exciting. The addition of auxiliary I/O lane(s) means that one can add more higher-speed PCIe devices like NVMe SSDs or GPUs to a system.
This article has been in the works for a few weeks and I did give AMD the heads-up that it would be going live just after the Cascade Lake launch. They did not sanction this article (indeed, they will not be overly excited to see it is live.) I wanted to ensure that we could compare AMD’s plan to the actual number for the new Intel platforms that we were under embargo for at the time this is being written. Now that we can discuss Intel’s mainstream 2019 platform, we wanted to bring up the coming competition. Things may change between today and the actual launch. As always, there is an asterisk here until the platform ships from AMD and its partners.
For the record, knowing the above and armed with benchmark data we had already generated on 2nd Gen Intel Xeon Scalable, we purchased several AMD EPYC platforms for our infrastructure the week before the new Xeon launch. We will add Xeon Gold with Optane DCPMM in the coming quarter as well simply because DCPMM memory mode is very useful to us.
Intel has been shipping Cascade Lake for revenue for months while the AMD EPYC Rome generation is not publicly available. There is always something better coming and Intel made major gains against AMD EPYC 7001 with this generation. 2019 is shaping up to be an exciting year and Rome is leading the way toward a modern architecture.