Tesla Dojo AI Tile Microarchitecture

1
HC34 Tesla Dojo UArch D1 Die Cover
HC34 Tesla Dojo UArch D1 Die Cover

At Hot Chips 34, we got a look at the microarchitecture of Tesla Dojo. For those that do not know, Tesla is now consuming so many AI resources that not only does it have a giant NIVIDA GPU cluster, but it also is designing its own AI training infrastructure. At HC34, the company has two talks. This article is covering the microarchitecture, the next will cover the systems-level presentation.

Note: We are doing this piece live at HC34 during the presentation so please excuse typos.

Tesla Dojo AI System Microarchitecture at HC34

Tesla has its exascale AI-class system for machine learning. Basically, like its in-vehicle systems, Tesla has decided it has the scale to hire folks and build silicon and systems specifically for its application.

HC34 Tesla Dojo UArch What Is Dojo
HC34 Tesla Dojo UArch What Is Dojo

Tesla is looking at building the system from the ground up. It is not just building its own AI chips, it is building a supercomputer.

HC34 Tesla Dojo UArch Anatomy Of A Distributed System
HC34 Tesla Dojo UArch Anatomy Of A Distributed System

Each Dojo node has its own CPU, memory, and communication interface.

HC34 Tesla Dojo UArch Of The Dojo Node
HC34 Tesla Dojo UArch Of The Dojo Node

Here is the processing pipeline for Dojo’s processor.

HC34 Tesla Dojo UArch Processing Pipeline
HC34 Tesla Dojo UArch Processing Pipeline

Each node has 1.25MB of SRAM. In AI training and inference chips, a common technique is to colocate memory with compute to minimize data transfer since the data transfer is very expensive from a power and performance perspective.

HC34 Tesla Dojo UArch Node Memory
HC34 Tesla Dojo UArch Node Memory

Each node then is connected to a 2D mesh.

HC34 Tesla Dojo UArchNetwork Interface
HC34 Tesla Dojo UArch Network Interface

Here is the data path overview:

HC34 Tesla Dojo UArch Datapath
HC34 Tesla Dojo UArch Datapath

Here is an example of the list parsing that the chip can do.

HC34 Tesla Dojo UArch List Parsing
HC34 Tesla Dojo UArch List Parsing

Here is more on the DOJO instruction set, this is Tesla’s creation and is not your typical Intel, Arm, NVIDIA, or AMD CPU/GPU’s instruction set.

HC34 Tesla Dojo UArch Instruction Set
HC34 Tesla Dojo UArch Instruction Set

In AI, arithmetic formats are a big deal, and specifically which formats are supported by a chip. DOJO was an opportunity for Tesla to look at the formats commonly available such as FP32, FP16, and BFP16. These are common industry formats.

HC34 Tesla Dojo UArch Arithmetic Formats
HC34 Tesla Dojo UArch Arithmetic Formats

Tesla is also looking at configurable FP8 or CFP8. It has a 4/3 as well as a 5/2 range option. That is similar to what theĀ NVIDIA H100 Hopper does with FP8. We also saw the Untether.AI Boqueria 1458 RISC-V Core AI Accelerator focus on different FP8 types.

HC34 Tesla Dojo UArch Arithmetic Formats 2
HC34 Tesla Dojo UArch Arithmetic Formats 2

Tesla also has a different CFP16 format for higher precision. DOJO supports FP32, BFP16, CFP8, and CFP16.

HC34 Tesla Dojo UArch Arithmetic Formats 3
HC34 Tesla Dojo UArch Arithmetic Formats 3

These cores are then integrated into a die that is manufactured. The Tesla D1 die is fabbed at TSMC on 7nm. Each chip has 354 DOJO processing nodes per die and 440MB of SRAM.

HC34 Tesla Dojo UArch First Integration Box D1 Die
HC34 Tesla Dojo UArch First Integration Box D1 Die

These D1 dies are packaged onto a Dojo Training Tile. The D1 chips are tested, then are assembled into a 5×5 tile. These tiles have 4.5TB/s of bandwidth per edge. They also have a 15kW power delivery envelope per module or roughly 600W per D1 chip minus whatever is used by the 40 I/O dies. The tile also includes all of the liquid cooling and mechanical packaging. This is conceptually similar to what Cerebras does packaging its WSE-2 giant chip. One can also see why something like theĀ Lightmatter Passage would be attractive if a company did not want to design this.

HC34 Tesla Dojo UArch Second Integration Box Dojo Training Tile
HC34 Tesla Dojo UArch Second Integration Box Dojo Training Tile

The DOJO interface processors are located on the edges of the 2D mesh. Each training tile has 11GB of SRAM and 160GB of shared DRAM.

HC34 Tesla Dojo UArch Dojo System Topology
HC34 Tesla Dojo UArch Dojo System Topology

Here are the bandwidth figures for the 2D mesh connecting the processing nodes.

HC34 Tesla Dojo UArch Dojo System Communication Logical 2D Mesh
HC34 Tesla Dojo UArch Dojo System Communication Logical 2D Mesh

32GB/s links are available per DIP and host systems.

HC34 Tesla Dojo UArch Dojo System Communication PCIe Links DIPs And Hosts
HC34 Tesla Dojo UArch Dojo System Communication PCIe Links DIPs And Hosts

Tesla also has a Z-plane links for longer routes. In the next talk, Tesla talks about system-level innovation.

HC34 Tesla Dojo UArch Communication Mechanisms
HC34 Tesla Dojo UArch Communication Mechanisms

Here are latency boundaries to dies, and tiles and that is why they are treated differently in Dojo. The Z-plane links are needed because long routes are expensive.

HC34 Tesla Dojo UArch Dojo System Communication Mechanisms
HC34 Tesla Dojo UArch Dojo System Communication Mechanisms

Any processing node can access data across the system. Each node can push or pull data to SRAM or DRAM.

HC34 Tesla Dojo UArch Dojo System Bulk Communication
HC34 Tesla Dojo UArch Dojo System Bulk Communication

Tesla Dojo uses a flat addressing scheme for communication.

HC34 Tesla Dojo UArch Dojo System Network 1
HC34 Tesla Dojo UArch Dojo System Network 1

The chips can route around dead processing nodes in software.

HC34 Tesla Dojo UArch Dojo System Network 2
HC34 Tesla Dojo UArch Dojo System Network 2

That means that software has to understand the system topology.

HC34 Tesla Dojo UArch Dojo System Network 3
HC34 Tesla Dojo UArch Dojo System Network 3

DOJO does not guarantee end-to-end traffic ordering so packets need to be counted at their destination.

HC34 Tesla Dojo UArch Dojo System Network 4
HC34 Tesla Dojo UArch Dojo System Network 4

Here is how the packets are counted as part of the system synchronization.

HC34 Tesla Dojo UArch Dojo System Sync
HC34 Tesla Dojo UArch Dojo System Sync

The compiler needs to define a tree with nodes.

HC34 Tesla Dojo UArch Dojo System Sync 2
HC34 Tesla Dojo UArch Dojo System Sync 2

Tesla says one exa-pod has more than 1 million CPUs (or compute nodes.) These are large-scale systems.

HC34 Tesla Dojo UArch Summary
HC34 Tesla Dojo UArch Summary

Tesla built Dojo specifically to work at a very large scale. Some startups look to build AI chips for one or a few chips per system. Tesla was focused on a much larger scale.

Final Words

In many ways, it makes sense that Tesla has a giant AI training farm. What is more exciting is that not only is it using commercially available systems, but it is also building its own chips and systems. Some of the ISA on the scalar side is borrowed from RISC-V but the vector side and a lot of the architecture Tesla made custom so this took a lot of work.

Next, we are going to take a look at the Dojo system level design. If this seemed too low level, the next talk is the one you may be looking for.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.