New Meta AI Accelerator MTIA 2 Revealed

April 11, 2024

Meta MTIA 2 Dual Accelerator Card Shot Of One Accelerator And Memory

Meta has a new in-house AI accelerator for the company’s growing AI workloads. Dubbed the MTIA 2, or Meta Training and Inference Accelerator 2, the new chip has a different architecture than many of the newer parts we see, with some clear optimizations for scale-out.

New Meta AI Accelerator MTIA 2 Revealed

Here is the MTIA 2 die that is around 421mm2 on TSMC 5. Meta says that the die has around 2.35B gates.

The chip is laid out with an 8×8 matrix of processing elements, or PEs at the center. On the outer edge, there are other features like the host interface, a memory controller for the LPDDR5 memory that sits around the accelerator, and the fabric. We can also see a control core and decompression engine on the diagram. At STH, we often discuss how much of a die is used for non-compute tasks. This is a great example since it is easy to see the PE versus non-PE area. It also shows just how much of AI-focused designs center around memory and data movement.

The MTIA 2 accelerators are packed into chassis and are very different than what NVIDIA is doing with its accelerators. Each board holds two 90W accelerators so it can be air-cooled. With a PCIe Gen5 x8 connection to the host, the two accelerators can share a x16 edge connector. There are twelve boards in each chassis for 24 per chassis. Meta says that it deploys these in sets of three for 72 accelerators with an option for an RDMA NIC. Whereas we usually credit AWS Nitro for the interest in DPUs, Facebook/ Meta made a major innovation years ago, deploying multi-host adapters at scale to lower networking costs. Here, the accelerator-to-NIC ratio seems to be much lower than that of NVIDIA systems.

Here are the key specs from Meta on the accelerator’s performance:

GEMM TOPS

708 TFLOPS/s (INT8) (sparsity)
354 TFLOPS/s (INT8)
354 TFLOPS/s (FP16/BF16) (sparsity)
177 TFLOPS/s (FP16/BF16)

SIMD TOPS

Vector core:
- 11.06 TFLOPS/s (INT8),
- 5.53 TFLOPS/s (FP16/BF16),
- 2.76 TFLOPS/s (FP32)
SIMD:
- 5.53 TFLOPS/s (INT8/FP16/BF16),
- 2.76 TFLOPS/s (FP32)

Memory Capacity

Local memory: 384 KB per PE
On-chip memory: 256 MB
Off-chip LPDDR5: 128 GB

Memory Bandwidth

Local memory: 1 TB/s per PE
On-chip memory: 2.7 TB/s
Off-chip LPDDR5: 204.8 GB/s (Source: Meta)

Something that stands out is the memory capacity to AI accelerator watt ratio. The LPDDR5 memory may not offer huge amounts of bandwidth compared to HBM-based accelerators, it offers relatively high capacity. Meta has something like 128GB of memory / 90W TDP for about 1.42GB/W. Compare that to the Intel Gaudi 3 we showed this week at 128GB of HBM2E and a 900W TDP that is 0.142GB/W. It is clear that Meta is targeting a different memory capacity to compute capacity ratio than many other chips we see.

Final Words

Meta’s post on MTIA 2 talks about hardware and software co-design. The company has big enough applications where it can make dedicated accelerators. It feels a bit like showing this off was less focused on the impact on production and more focused on being a recruiting tool. Meta has been a leader in AI, and buys a lot of compute and memory, so it makes a lot of sense that they are exploring different architectures.

1 COMMENT

Yamamoto April 13, 2024 At 3:26 pm

Article says 708 TFLOPS/s for sparse INT8, while, for AI integer number tensor calculation arithmetic performance, TOPS is used.
TFLOPS means Tera FLoating point Operations Per Second, and it is used sorely for floating point number arithmetic. the last PS of TFLOPS means per second, so TFLOPS/s is redundant.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

New Meta AI Accelerator MTIA 2 Revealed

Final Words

RELATED ARTICLESMORE FROM AUTHOR

HP ZGX Nano G1n Review The HP Take on the NVIDIA GB10

BIG AI Cluster Little Power the 8x NVIDIA GB10 Cluster

Meta Buys Tens of Millions of AWS Graviton Arm Cores in a CPU Land Grab

1 COMMENT

LEAVE A REPLY

RELATED ARTICLES MORE FROM AUTHOR