Hyper-scalers Are Using CXL to Lower the Impact of DDR5 Supply Constraints

1

Marvell Structera A 16-core Arm and DDR5 Chip

Let us get a bit more exciting and take the Structera X 2504 up a notch by adding sixteen Arm Neoverse V2 cores. These are high-performance Arm cores like you would see in a NVIDIA Grace CPU, but just on a CXL accelerator. That is what makes the Marvell Structera A a different animal.

Marvell Structera A 2504 Overview
Marvell Structera A 2504 Overview

Taking a look at the block diagram, this is just like the X 2504, but the A 2504 has the Arm Neoverse V2 cores in the mdidle.

Marvell Structera A Block Diagram
Marvell Structera A Block Diagram

If you are wondering how this works in practice, think of it like there is a lightweight Linux distribution running onboard this as an endpoint. That makes this like a mini-server in another server that is designed to run at the speed of memory. Fast Arm cores where the goal is to do compute on the attached memory.

Marvell Structera A CXL In Lab 2
Marvell Structera A CXL In Lab 2

The cards we saw these on in the lab had DDR5 packages on both sides of the cards that were in low profile x16 form factors.

Marvell Structera A CXL Rear DDR5
Marvell Structera A CXL Rear DDR5

If you need a mental model of how this works, think about it like a DPU, but instead of processing networking flows, it is designed to work on its local memory. That means one can scale both memory and compute in a server at the same time. Of course, the fun part of going to Marvell is that we got to see a live demo.

The CXL Can Be Fast Demo

Folks know that at STH I prefer to show the behind the scenes instead of perfectly curated images. Take Exhibit A in that journey: the CXL Demo Rig. THis is one of the systems with Structera A 2504 cards with a Supermicro Intel Xeon Server, an Antec power supply, a few fans, and lab-appropriate wiring.

Marvell Structera A CXL Demo Time
Marvell Structera A CXL Demo Time

The other demo rig looked quite a bit better.

Marvell Structera A CXL Rear DDR5
Marvell Structera A CXL Rear DDR5

In the video we showed the demo, but we showed the Structera X acting as storage for KV cache when the GPU would have otherwise run out of memory. This was effectively telling a Llama 3 model to read books, then ask questions about the books.

Marvell Structera X Demo 20 Questions With Dante S Inferno
Marvell Structera X Demo 20 Questions With Dante S Inferno

The model was still running on the GPU, but just being able to have the KV cache stored in memory changed the time to first token. The runs varied a bit, but when you do many runs the TTFT saved us about 30 seconds total. Of course, this really just saying having more memory is good even if the latency is a bit higher.

Still, I wanted to know how the Structera A worked. Here is a fun one. You can see the left system where a search is running 48 Intel Xeon cores and the local memory attached to the Xeon socket.

Marvell Structera A Demo Image Three Structera A Devices
Marvell Structera A Demo Image Three Structera A Devices

On the right hand side, we are logged into the three Structera A cards that are running near 100% utilization and then the Xeon cores are not loaded at all (in the bottom right corner.) The idea with this demo is that we have 48 Arm Neoverse V2 cores across the three cards. With the Xeon cores free, the system can do more by scaling memory with the added compute.

These were demos that were really focused on showing off how the modules can be used for performance benefits, but there are a number of use cases where having more memory, or more memory and compute in a system can add more performance.

Final Words

This was a neat look behind-the-scenes at a technology that is being used at hyper-scalers. From an industry perspective, this idea was one that started at one hyper-scaler but almost all are now using, especially since they see the benefits from recycling their DDR4.

Marvell Structera A CXL In Lab 1
Marvell Structera A CXL In Lab 1

The Marvell Structera is currently focused on those hyper-scale projects which makes sense. Recycling DDR4 or adding DDR5 is easier when there are a limited number of host systems, module types and so forth. Just getting all of the firmware to work is one of the reasons we do not see commercially available CXL expansion devices. Structera A also requires managing the Linux distribution on the cards, sending commands to the Arm cores, and so forth. Still, this is super-cool technology that is being deployed at hyper-scalers and that we do not get to see often. Hopefully the STH community likes seeing these behind-the-scenes lab visits.

1 COMMENT

  1. Without third-party testing of latency, bandwidth and latency while being bandwidth bottlenecked it’s difficult to know whether this is useful or not.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.