At Hot Chips 34, we got a look at the microarchitecture of Tesla Dojo. For those that do not know, Tesla is now consuming so many AI resources that not only does it have a giant NIVIDA GPU cluster, but it also is designing its own AI training infrastructure. At HC34, the company has two talks. This article is covering the microarchitecture, the next will cover the systems-level presentation.
Note: We are doing this piece live at HC34 during the presentation so please excuse typos.
Tesla Dojo AI System Microarchitecture at HC34
Tesla has its exascale AI-class system for machine learning. Basically, like its in-vehicle systems, Tesla has decided it has the scale to hire folks and build silicon and systems specifically for its application.
![HC34 Tesla Dojo UArch What Is Dojo](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-What-is-Dojo.jpg)
Tesla is looking at building the system from the ground up. It is not just building its own AI chips, it is building a supercomputer.
![HC34 Tesla Dojo UArch Anatomy Of A Distributed System](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Anatomy-of-a-distributed-system.jpg)
Each Dojo node has its own CPU, memory, and communication interface.
![HC34 Tesla Dojo UArch Of The Dojo Node](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-of-the-Dojo-Node.jpg)
Here is the processing pipeline for Dojo’s processor.
![HC34 Tesla Dojo UArch Processing Pipeline](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Processing-Pipeline.jpg)
Each node has 1.25MB of SRAM. In AI training and inference chips, a common technique is to colocate memory with compute to minimize data transfer since the data transfer is very expensive from a power and performance perspective.
![HC34 Tesla Dojo UArch Node Memory](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Node-Memory.jpg)
Each node then is connected to a 2D mesh.
![HC34 Tesla Dojo UArchNetwork Interface](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArchNetwork-Interface.jpg)
Here is the data path overview:
![HC34 Tesla Dojo UArch Datapath](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Datapath.jpg)
Here is an example of the list parsing that the chip can do.
![HC34 Tesla Dojo UArch List Parsing](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-List-parsing.jpg)
Here is more on the DOJO instruction set, this is Tesla’s creation and is not your typical Intel, Arm, NVIDIA, or AMD CPU/GPU’s instruction set.
![HC34 Tesla Dojo UArch Instruction Set](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Instruction-Set.jpg)
In AI, arithmetic formats are a big deal, and specifically which formats are supported by a chip. DOJO was an opportunity for Tesla to look at the formats commonly available such as FP32, FP16, and BFP16. These are common industry formats.
![HC34 Tesla Dojo UArch Arithmetic Formats](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-arithmetic-Formats.jpg)
Tesla is also looking at configurable FP8 or CFP8. It has a 4/3 as well as a 5/2 range option. That is similar to what theĀ NVIDIA H100 Hopper does with FP8. We also saw the Untether.AI Boqueria 1458 RISC-V Core AI Accelerator focus on different FP8 types.
![HC34 Tesla Dojo UArch Arithmetic Formats 2](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-arithmetic-Formats-2.jpg)
Tesla also has a different CFP16 format for higher precision. DOJO supports FP32, BFP16, CFP8, and CFP16.
![HC34 Tesla Dojo UArch Arithmetic Formats 3](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-arithmetic-Formats-3.jpg)
These cores are then integrated into a die that is manufactured. The Tesla D1 die is fabbed at TSMC on 7nm. Each chip has 354 DOJO processing nodes per die and 440MB of SRAM.
![HC34 Tesla Dojo UArch First Integration Box D1 Die](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-First-integration-Box-D1-Die.jpg)
These D1 dies are packaged onto a Dojo Training Tile. The D1 chips are tested, then are assembled into a 5×5 tile. These tiles have 4.5TB/s of bandwidth per edge. They also have a 15kW power delivery envelope per module or roughly 600W per D1 chip minus whatever is used by the 40 I/O dies. The tile also includes all of the liquid cooling and mechanical packaging. This is conceptually similar to what Cerebras does packaging its WSE-2 giant chip. One can also see why something like theĀ Lightmatter Passage would be attractive if a company did not want to design this.
![HC34 Tesla Dojo UArch Second Integration Box Dojo Training Tile](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Second-Integration-Box-Dojo-Training-Tile.jpg)
The DOJO interface processors are located on the edges of the 2D mesh. Each training tile has 11GB of SRAM and 160GB of shared DRAM.
![HC34 Tesla Dojo UArch Dojo System Topology](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Dojo-System-Topology.jpg)
Here are the bandwidth figures for the 2D mesh connecting the processing nodes.
![HC34 Tesla Dojo UArch Dojo System Communication Logical 2D Mesh](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Dojo-System-Communication-Logical-2D-Mesh.jpg)
32GB/s links are available per DIP and host systems.
![HC34 Tesla Dojo UArch Dojo System Communication PCIe Links DIPs And Hosts](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Dojo-System-Communication-PCIe-Links-DIPs-and-Hosts.jpg)
Tesla also has a Z-plane links for longer routes. In the next talk, Tesla talks about system-level innovation.
![HC34 Tesla Dojo UArch Communication Mechanisms](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Communication-Mechanisms.jpg)
Here are latency boundaries to dies, and tiles and that is why they are treated differently in Dojo. The Z-plane links are needed because long routes are expensive.
![HC34 Tesla Dojo UArch Dojo System Communication Mechanisms](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Dojo-System-Communication-Mechanisms.jpg)
Any processing node can access data across the system. Each node can push or pull data to SRAM or DRAM.
![HC34 Tesla Dojo UArch Dojo System Bulk Communication](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Dojo-System-Bulk-Communication.jpg)
Tesla Dojo uses a flat addressing scheme for communication.
![HC34 Tesla Dojo UArch Dojo System Network 1](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Dojo-System-Network-1.jpg)
The chips can route around dead processing nodes in software.
![HC34 Tesla Dojo UArch Dojo System Network 2](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Dojo-System-Network-2.jpg)
That means that software has to understand the system topology.
![HC34 Tesla Dojo UArch Dojo System Network 3](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Dojo-System-Network-3.jpg)
DOJO does not guarantee end-to-end traffic ordering so packets need to be counted at their destination.
![HC34 Tesla Dojo UArch Dojo System Network 4](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Dojo-System-Network-4.jpg)
Here is how the packets are counted as part of the system synchronization.
![HC34 Tesla Dojo UArch Dojo System Sync](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Dojo-System-Sync.jpg)
The compiler needs to define a tree with nodes.
![HC34 Tesla Dojo UArch Dojo System Sync 2](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Dojo-System-Sync-2.jpg)
Tesla says one exa-pod has more than 1 million CPUs (or compute nodes.) These are large-scale systems.
![HC34 Tesla Dojo UArch Summary](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Tesla-Dojo-uArch-Summary.jpg)
Tesla built Dojo specifically to work at a very large scale. Some startups look to build AI chips for one or a few chips per system. Tesla was focused on a much larger scale.
Final Words
In many ways, it makes sense that Tesla has a giant AI training farm. What is more exciting is that not only is it using commercially available systems, but it is also building its own chips and systems. Some of the ISA on the scalar side is borrowed from RISC-V but the vector side and a lot of the architecture Tesla made custom so this took a lot of work.
Next, we are going to take a look at the Dojo system level design. If this seemed too low level, the next talk is the one you may be looking for.
“Open the pod bay doors HAL”.