Enflame DTU 1.0 AI Compute Chip at Hot Chips 33

2
Enflame CloudBlazer T11 OAM
Enflame CloudBlazer T11 OAM

Today, we get more on the Enflame DTU 1.0. This is Enflame’s AI compute chip meant for servers. We will note that the Enflame DTU 1.0 is a 2018/2019 chip so this is looking at an older design in the talk. We heard about Enflame previously in OCP China Day 2020 Interview with Bill Carter and Shen Rong of Inspur. Like other pieces, we are doing this live at Hot Chips 33, so please excuse typos.

Enflame DTU 1.0 AI Compute Chip at Hot Chips 33

The Enflame DTU 1.0 is a 12nm FinFET chip that has a PCIe Gen4 x16 interface along with 200GB/s interconnects.

HC33 2021 Enflame AI Compute Chip DTU 1.0 1
HC33 2021 Enflame AI Compute Chip DTU 1.0 1

The package itself includes HBM2 onboard, but we did not get the capacity.

HC33 2021 Enflame AI Compute Chip DTU 1.0 Package
HC33 2021 Enflame AI Compute Chip DTU 1.0 Package

The DTU 1.0 SOC has four clusters and 32 AI compute cores. It also has data transfer engines and high-speed interconnects for chip-to-chip communication.

HC33 2021 Enflame AI Compute Chip DTU 1.0 SoC
HC33 2021 Enflame AI Compute Chip DTU 1.0 SoC

This is the base look at the VLIW cores.

HC33 2021 Enflame AI Compute Chip GCU Care 1.0
HC33 2021 Enflame AI Compute Chip GCU Care 1.0

The cores have Tensor ALUs to accelerate the matrix/ vector operations.

HC33 2021 Enflame AI Compute Chip GCU Care 1.0 2
HC33 2021 Enflame AI Compute Chip GCU Care 1.0 2

One of the big aspects of Enflame’s architecture is exploiting sparsity. Enflame had a number of detailed slides (~9) on those concepts.

HC33 2021 Enflame AI Compute Chip GCU Care 1.0 3
HC33 2021 Enflame AI Compute Chip GCU Care 1.0 3

The key part here is that Enflame is able to skip instructions/ data that do not need to be executed due to sparsity.

HC33 2021 Enflame AI Compute Chip 256 Kernels In 32 Groups
HC33 2021 Enflame AI Compute Chip 256 Kernels In 32 Groups

Here is the flow in the data pipeline:

HC33 2021 Enflame AI Compute Chip GCU DARE 1.0
HC33 2021 Enflame AI Compute Chip GCU DARE 1.0

The interconnect is not cache coherent, but it is Enflame’s own interconnect that can directly connect up to four GPUs to one another and scale to 8-GPUs much like theĀ 3rd Generation Intel Xeon Scalable Cooper Lake 4P and 8P topologies.

HC33 2021 Enflame AI Compute Chip GCU LARE 1.0
HC33 2021 Enflame AI Compute Chip GCU LARE 1.0

These can be cabled between 8x DTU chassis to make bigger training clusters.

HC33 2021 Enflame AI Compute Chip GCU LARE 1.0 2
HC33 2021 Enflame AI Compute Chip GCU LARE 1.0 2

The training accelerator card comes in the CloudBlazer T10 for PCIe or the CloudBlazer T11 for OAM.

HC33 2021 Enflame AI Compute Chip Enflame Training Accelerator Card CloudBlazer T10 And T11
HC33 2021 Enflame AI Compute Chip Enflame Training Accelerator Card CloudBlazer T10 And T11

Notable here is that all of the system photos are of the PCIe version, not the OAM version. OAM and the UBB are designed to scale out to multiple systems.

HC33 2021 Enflame AI Compute Chip Enflame Training Solution
HC33 2021 Enflame AI Compute Chip Enflame Training Solution

Enflame says that it gets fairly linear training even with 160 accelerators or 20 chassis worth of accelerators.

HC33 2021 Enflame AI Compute Chip Large Scale Distributed Training Cluster
HC33 2021 Enflame AI Compute Chip Large Scale Distributed Training Cluster

Enflame has DTU 2.0 as of July, but is not sharing many details other than saying it has performance around FP32 and 3x memory bandwidth and 4x memory capacity. It says the new product will be shipping soon.

Final Words

We do not often get to see the Enflame solution. While most of the talks are looking at current or future technology, this is more of an older chip as a 2019 solution that is being shown. Still, it is interesting to see what the previous generation was an to get some sense of the next generation as well.

2 COMMENTS

  1. Would be really curious to see STH test the 2nd gen DTU – is there any chance on the horizon, where you might be able to put a server with DTU 2.0’s through its paces?

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.