CXL 3.1 Specification Aims for Big Topologies

0
CXL 3.1 Fabric Port Based Routing
CXL 3.1 Fabric Port Based Routing

Recently the CXL 3.1 spec was announced. The new spec has additional fabric improvements for scale-out CXL, new trusted execution environment ehnahcments, and improvements for memory expanders. A lot has happened since our Compute Express Link or CXL What it is and Examples piece, so let us get to it. Here is the old CXL Taco video for the introduction.

CXL 3.1 Specification Aims for Big Topologies

CXL 3.1 has a number of big changes under the hood, largely to address what happens when teams build larger CXL systems and topologies.

CXL 3.1 Fabric Enhancements Overview
CXL 3.1 Fabric Enhancements Overview

CXL 3 brings things like port-based routing (PBR), which is different than hierarchy-based routing that is more similar to a PCIe tree topology. This is needed to facilitate larger topologies and any-to-any communication.

CXL 3.1 Fabric Port Based Routing
CXL 3.1 Fabric Port-Based Routing

One of the CXL 3.1 enhancements is supporting host-to-host communication over CXL fabric using Global Integrated Memory (GIM.)

CXL 3.1 Fabric Host To Host Global Integrated Memory
CXL 3.1 Fabric Host To Host Global Integrated Memory

Another big one is the direct P2P support for .mem memory transactions over CXL. With all of the discussion on GPU memory capacity being a limiter for AI, this would be the type of use case where one could add CXL memory and accelerators onto a CXL switch and have the accelerator directly use Type-3 CXL memory expansion devices.

CXL 3.1 Fabric Direct P2P Mem Support For Accelerators Though PBR Switches
CXL 3.1 Fabric Direct P2P Mem Support For Accelerators Though PBR Switches

There is also a Fabric Manager API definition for the port-based routing CXL switch. The fabric manager might end up being a key CXL ecosystem battleground as that will need to track a lot of what is going on in the cluster.

CXL 3.1 Fabric Manager API
CXL 3.1 Fabric Manager API

The CXL 3.1 Trusted Security Protocol (TSP) is the next step in handling security on the platform. Imagine a cloud provider with multi-tenant VMs sharing devices that are connected via CXL.

CXL 3.1 Security Trusted Security Protocol TSP
CXL 3.1 Security Trusted Security Protocol TSP

As a result, things like confidential computing that are a hot topic in today’s cloud VMs need to extend past the confines of a server and to devices that are attached to the fabric.

CXL 3.1 Security Elements Of TSP
CXL 3.1 Security Elements Of TSP

CXL Attached memory also is getting a number of RAS features and then additional bits for metadata. Again, this is important as the topologies get bigger to ensure reliability.

CXL 3.1 Memory Enhancements
CXL 3.1 Memory Enhancements

There are a surprising number of new features in CXL 3.1.

Final Words

CXL 3.0/ CXL 3.1 is still far enough out in terms of products that our sense is that most companies will adopt CXL 3.1 over CXL 3.0 when we see products hit the market. At the same time, my question to the CXL folks at Supercomputing 2023 is whether CXL 3.1 is designed to be big enough. Currently, the specification is designed for a few thousand CXL devices to be connected. At the same time, we have AI clusters today being built with tens of thousands of accelerators. In the CXL world, there may be more attached CXL devices than today’s accelerators so my question is whether CXL will need to scale up as well.

There is still a lot of work to be done, but the good news is that we will start to see CXL support pick up in 2024 with not just experimental, but also useful production use cases.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.