AMD EPYC 9004 Genoa Platform
AMD breaks its chip delivery into two parts, the SoC, and then the platform enabled by the SoC. Here is the key overview slide. We have already discussed the top left side with the Zen 4 changes. So let us now get to the rest of the chip.
The 4th Gen EPYC CPU scales in the number of (up to) 8 core CCDs from 4 to 12. The halo part at 96 cores (and the 84-core variant) will utilize 12 CCDs. Lower core count options can use fewer CCDs. Such as 8 or 4. The 4x CCD variants (up to 32 cores) have an interesting trick where they can get 2x the links to the IO die per CCD. The 12 and 8 CCD variants only have one GMI3 link to the IO die.
Here is a look at that 4x CCD configuration with two GMI3 links to the IOD per CCD.
Here is the 8 CCD version of that where each CCD gets a single GMI3 connection
On the 12 CCD chips, all CCDs have a single GMI3 link to the IO die.
That is how the CCDs and IO die are connected, but then comes building out the rest of the system with I/O and memory. That is what gives us single and dual-socket servers.
AMD has a new generation of its Infinity Fabric interface which is AMD’s socket-to-socket interconnect. Infinity Fabric gets another speed bump in this generation. AMD’s Infinity Fabric uses SerDes that also can handle things like PCIe and so as those SerDes get faster for PCIe, Infinity Fabric follows suit. AMD moving from PCIe Gen3 era to Gen5 era on the links means that the interconnect bandwidth is now massive.
In the above diagram, you may notice that there are 3Link and 4Link options. AMD now has so much interconnect bandwidth that some customers will decide that three links are enough. The fourth link that requires an x16 port on the other end can be configured to instead operate as PCIe Gen5 x16 on each CPU, giving us 160x PCIe Gen5 lane configurations in 2P servers. We first covered this in Dell and AMD Showcase Future of Servers 160 PCIe Lane Design. Many applications such as storage and GPU servers were already finding that with the previous PCIe Gen4 era of Infinity Fabric only three socket-to-socket links were needed instead of four. Here is what this functionality can look like in a dual-socket server:
The cabled connections are part of the socket-to-socket interconnect. Not only does that free up some motherboard trace requirements, but it also allows the same connectors to be used for PCIe instead by simply using different (longer) cables connecting to PCIe devices. In this server, these links are forward of the CPUs making them closer to the front panel for servicing PCIe Gen5 SSDs and accelerators. In the above example, these are configured socket-to-socket, but the other option is that they can be re-routed to other components such as front panel NVMe SSDs or even FPGAs, AI accelerators, or GPUs.
This functionality is due to the awesome SerDes that EPYC employs, and has employed for generations. Each CPU has 128 lanes that are generally broken up into x16 segments. 64 (or 4×16) are xGMI (socket-to-socket Infinity Fabric) and PCIe Gen5 capable. The other 64 lanes are xGMI, PCIe, CXL, and SATA capable. SATA is so slow compared to PCIe Gen5 x1 lanes that it is a very inefficient use of lanes. AMD still supports it, but at some point, we expect support to be dropped and a small amount of transistor count to be reclaimed. Those same 64 (4x 16) can also be used for CXL.
You may also notice on the slide above a few features. First, one can have up to 9 devices on a x16 root but the bifurcation can go down to x1. The other big one is the IO connectivity lanes. These are PCIe Gen3 lanes, not Gen5. AMD has a few extra lanes for things like motherboard low-speed SATA connections, 1/10GbE NICs, and so forth instead of a single WAFL lane (WAFL still lives in Genoa.) In dual-socket servers, one gets 12 extra lanes, and in a single socket 8 extra lanes. The difference is that two of the eight in each CPU are used for inter-socket communication in dual-socket configurations.
The impact of this is that AMD is able to get 128 PCIe Gen5 lanes (up to 64x can be CXL) in a single socket plus eight PCIe Gen3 lanes for miscellaneous functions. It would be strange to say that is a 136 PCIe lane CPU because of the vastly different speeds, but that is one way to look at it. In dual socket configurations, we will see 2x 64 = 128 or 2x 80 = 160 PCIe Gen5 lanes depending on the socket-to-socket xGMI link configuration mentioned earlier. Add another 12 PCIe Gen3 lanes and we get up to 172 total lanes, but only up to 160 PCIe Gen5 lanes.
Gen 5 is going to add a lot of cost to systems. OEMs have told us that the signaling challenges are becoming formidable. Cabled connections cost more, but can bridge further distances. Even with cabled connections, we are seeing views like this where connectors are angled to help with bend radius and shorter cabled runs. PCIe Gen5 is fast, but it also adds cost to systems.
AMD also has a new AVIC controller. This helps with interrupt performance and means that out of the box, the new chips can almost saturate 200Gbps links. With 100Gbps, AMD platforms could saturate the links, but there were more configuration steps to get that to work.
AMD also has new power management features. Something that many do not know you can do on AMD platforms is set performance or power-focused profiles. In performance determinism mode, performance is maintained across all chips. This is standard and what you see most benchmarks on. In power determinism mode, one gets to play the “silicon lottery” a bit and set a power level and let the EPYC chips in a system run at whatever performance level they can achieve. We did not get to play with this as much as we wanted but in about 30 minutes running a few workloads we were able to eke out about 2% better performance but also with a bit higher overall power consumption (~50W) using power determinism. That is not going to make the main piece because we did not have time to really profile that result, but it seemed to be clearly doing something.
Next, let us get to the memory and CXL 1.1 implementation.