Confidential Computing Needs to Go Mainstream

1

The TEE and Remote Attestation Magic

Taking a step back, the idea of having a secure TEE where you can bring your data into it, run your application code, and know that others cannot see what is going on inside is a great one. A fundamental challenge for a customer is determining whether you have a TEE that is secure, rather than a random virtual machine on a compromised hypervisor. That is where remote attestation comes in. Remote attestation is the technical mechanism that allows an application to verify its compute environment is secure before sharing secrets. Instead of just trusting that a virtual machine is secure, this is a survey and verification process in the form of a cryptographic handshake.

AMD Measurements Of Trusted Computing Base
AMD Measurements Of Trusted Computing Base

How this works is that the TEE generates a signed “attestation document,” which acts as verifiable proof that the environment is a genuine TEE. That means gathering information about the hardware, firmware, and other components in the environment to ensure none have been tampered with. Imagine keeping a baseline of the entire stack, all the way from the hardware and into the virtual machine, and then having a cryptographically signed proof of what is running the TEE and comparing the two. If the attestation fails, even for a benign reason, then the customer knows that the TEE is not running in a trusted environment.

AMD SEV SNP Attestation Flow
AMD SEV SNP Attestation Flow

These days, the major cloud providers have reputations earned over the years for doing a lot to further security in the industry. Instead of just trusting that the compute environment a company like Microsoft or Google provides is secure, remote attestation of a TEE lets you verify that it is. Once you know that the TEE is secure, then you can move your encrypted data (at rest) to the TEE via encrypted networking (data in transit) so that you can process the data (in use.)

What About AI Accelerators?

At this point, you might be thinking that this works well for CPUs and the locally attached memory, but what about AI accelerators? After all, we are in an AI supercycle, and we have been talking about the TEE in terms of server CPUs running VMs on locally attached memory. That is where TDISP comes in, or TEE Device Interface Security Protocol. Here is a good PCI-SIG primer for TDISP if you want to get into more detail.

MiTAC G8825Z5 AMD Instinct MI325X GPUs OAM UBB Tray Heatsinks And Handle
MiTAC G8825Z5 AMD Instinct MI325X GPUs OAM UBB Tray Heatsinks And Handle

With TDISP, the goal is that we get to Confidential AI and other accelerated computing platforms by ensuring that the PCIe communication channels are secure, along with the accelerators. While CPUs have been implementing confidential computing, the AI accelerators are quickly adding this capability, and we expect that this will continue in the future.

AMD SEV TIO Diagram
AMD SEV TIO Diagram

While at first you might think that this section is just another “because everything needs to be AI these days” take a step back. Think of all of the confidential datasets that are massive, but are also sensitive because they are proprietary and/or they cover so much that the details cannot be made public. Those are the exact large data sets you might want to train or fine tune an AI model with. You might also want to perform inference on the existing data or use it to process new data. The kinds of companies and organizations that need confidential computing today, are also looking at that same data as an advantage in the era of AI.

A Few Thoughts on Where We Go From Here

That brings us to who is leading the charge. These days, cloud providers see industries like financial services, healthcare, and government as key confidential computing users. Whether that is keeping financial transaction data safe, maintaining HIPAA or GDPR compliance, or keeping secrets safe, all of those are key use cases that are pushing for confidential computing to proliferate. Those are large industries, but the broader AI industry may be the largest of all.

Given the importance and size of the markets that need confidential computing, it makes sense why there is an industry effort behind this, and that modern hardware continues to support new confidential computing features. We expect cloud providers to offer confidential computing as the default eventually. Once the hardware supports it, and they have to do it for a class of customers, it is easier to roll out to everyone. If you recall, one of the big reasons Intel moved from its older SGX idea to TDX is similar to AMD’s idea all along. Confidential computing will eventually be everywhere, not just in small application enclaves.

Google Confidential VM SEV Enabled On AMD EPYC 7B12 CPU
Google Confidential VM SEV Enabled On AMD EPYC 7B12 CPU

I also think that we will see more innovations in this space. Security researchers are wildly creative. I remember seeing a demo in Austin, TX, back in 2017 of the original AMD EPYC 7001 “Naples” and its SEV and memory encryption. Folks then were talking about the possibility of freezing DRAM chips and pulling data off. That was just before the Spectre/Meltdown side-channel attacks. Side-channel attacks are some of the biggest threat vectors to confidential computing.

AMD EPYC SEV Capabilities By Generation
AMD EPYC SEV Capabilities By Generation

We are already seeing new use cases, such as the AMD EPYC 9005, adding support for Trusted I/O via TDISP (or what they had called SEV-TIO) to address confidential computing in the era of AI accelerators. My sense is that as systems get larger and confidential computing takes over, we will see future feature updates in future chips.

One of the bigger challenges is that providing the TEE and remote attestation is not a trivial task since you have to build a chain all the way back to the hardware providers. For a cloud provider, this is a capability they can build and then deploy everywhere. It also provides a way for the provider to show they are a service provider without knowledge of what their customers are doing on their platforms. For an organization with a rack full of virtualization servers, implementing this capability is very challenging.

Final Words

If you asked any customer a question like “Would you prefer your computing environments to be secure, or would you prefer if others, including the cloud provider and strangers, could interact with your data?” I think most will answer that they want secure and confidential computing. Likewise, if you were asked, “Do you want the AI agents you interact with to run on confidential and secure platforms, or not?” I think most would opt for that security and confidentiality. It just seems like the way things should be by default. We now have hardware that supports these features and a mechanism to verify that it is working.

This is likely one that we will see more of as traditional CPU compute servers are updated. If you are still running your cloud VMs on old Intel Xeon Cascade Lake processors, for example, the required hardware features are missing. Still, this is a technology that, as an industry, we talk about today in the context of regulated industries, governments, and so forth, but really, it is a capability we should expect from all of our computing environments.

1 COMMENT

  1. I wonder how much resources sacrificed (silicon, power, efficiency and headache for programmers) to ensure secure encryption for all computer parts. While still having really fast compute capabilities.

    Compare to old way computing that try squeeze every cycle just to make compute more faster.

    That’s how much progress we already had in half century.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.