At AWS re:Invent 2021, the company actually went into one of the fundamental changes it used to accelerate innovation: its Nitro cards. Drawing the closest comparison we can, these cards are most similar to DPUs that are starting to become more prominent in the industry, but AWS was deploying them at scale in 2017. Still, this is the model that the industry will need to adopt so it is worth looking at what drove the change.
AWS Nitro the Big Cloud DPU Deployment Detailed
AWS is on a path to build its own chips. While some think that the simple reason is cost, we are actually going to get into why it goes beyond cost and instead is about delivering capabilities before the rest of the industry.
We have discussed AWS Nitro many times on STH and the unique capabilities it provides. Key here is that it has been a cornerstone of AWS for the past four years. Putting that into perspective, the capabilities Nitro provides are still not replicated for private clouds.
The impact has been huge. Nitro has enabled AWS to rapidly expand the types of instances it can provide by being a common point to manage hardware.
Taking a step back, here is AWS’s diagram of a standard server. One can see the various components of the server, and probably one of the most interesting PCIe connectors ever rendered.
Prior to the Nitro system, AWS utilized the Xen hypervisor. The key here is that the VPC networking, EBS storage, local storage, and management functions all happen on the dom0 level. An easier way to look at this is that the same CPU that is being provisioned for EC2 instances is also having to handle the networking, storage, and management.
One of the interesting bits in the re:Invent presentation is that there are several types of Nitro cards. There are cards for networking, EBS storage, SSDs, Aqua for Analytics, and also a system controller. Looking at the VPC networking side, there are several key features. First, the data plane is offloaded to the Nitro card as well as the end-to-end encryption. One of the key capabilities here is that the Elastic Network Adapter (ENA) is presented by the Nitro card. This ENA can scale from 10Gbps to 100Gbps on the same driver to help in portability in AWS’s cloud. If you know Intel’s Ethernet drivers changing for generations as an example, that is quite an accomplishment. The Elastic Fabric Adapter is a high-performance option more for HPC-style workloads.
Nitro cards effectively are designed to take the VPC networking, EBS storage, local SSD storage, and the management controller functions from the hypervisor.
On the security side, the Nitro security chip is very important. One of the big functions is to block unauthorized writes to the nonvolatile storage. While one may immediately think this means local SSDs, the more impactful way to think about this is all of the small stores in a server. For example, the BMC has its own flash storage. Also a fan controller can have its own storage for the microcontroller. In an environment like the AWS cloud, AWS cannot afford for an attacker to compromise microcontroller storage. Also, it allows AWS to do things like push firmware updates without having to impact customers running workloads on a server.
Another part of AWS’s magic is that they stopped using Xen and use the lighter-weight Nitro hypervisor.
Since AWS is using Nitro cards for the functions we discussed earlier, it no longer needs to have the host CPU handle those. That does a few things. First, the functions in dom0 are offloaded so the hypervisor can be focused on just providing CPU and memory provisioning. The second impact is that cycles that used to be consumed by the CPU are now offloaded to the Nitro.
Offloading the dom0 functions reduced jitter. The “Series1” below is a c4 generation instance pre-Nitro. “Series 2” is c5 generation with Nitro. “Series 3” is Graviton with Nitro.
So there are performance benefits for AWS in getting rid of the old Xen setup and instead moving to the Nitro solution.
The key takeaway here is really the gap between what AWS is doing and what the private clouds of the world are doing. The promise of DPUs is that they can bring this operating model to clouds. The Intel Mount Evans was designed to bring this functionality to Google Cloud. What is clearly missing is the solution for the rest of the industry. This is a big miss that the HPE, Dell EMC, and Lenovo and other vendors in the industry need to close the gap on. The NVIDIA BlueField-2 is a step in this direction but the software support is still early. VMware Project Monterey is hoping to bring this to VMware’s ecosystem. Support for this functionality needs to go beyond VMware and into OpenStack and other clusters. The challenge is that AWS Nitro is the leading practice in the industry and AWS is four, going on five years ahead of the industry.
The big challenge, of course, is that just as AWS is able to add instances such as Apple Mac Mini nodes (AWS added M1 at this re:Invent), greatly increasing the types of instances it offers, adding DPUs is a bigger challenge for the big hardware providers. On one hand, they need the functionality to stay relevant. On the other, adding DPUs take away a lot of their differentiation. This is a big engineering challenge, but is one that will need to be tackled for the industry to thrive.