Case Study on the Google TPU and GDDR5 from Hot Chips 29

0
Google TPU Board
Google TPU Board

At Hot Chips 29 (2017) Google presented a case study about their legacy TPU ASIC. The Google TPU is a custom PCIe accelerator for deep learning inferencing. The project was completed in around five quarters, very fast for custom silicon. Google discussed a key learning around memory bandwidth and what would have happened if they had used GDDR5 instead of DDR3.

Google TPU Background

The basic premise of the TPU is that it is a PCIe accelerator, much like we see GPUs or FPGAs when used as accelerators.

Google TPU Architecture And Implementation
Google TPU Architecture And Implementation

Specifically, it is designed for fast matrix multiply, important in deep learning applications.

Google TPU High Level Architecture
Google TPU High-Level Architecture

Here is a view of how much logic is dedicated in the chip to different functions:

Google TPU Neural Network Accelerator Chip
Google TPU Neural Network Accelerator Chip

To the programmer, Google is providing a view in terms of what using the TPU looks like.

Google TPU Architecture Programmer View
Google TPU Architecture Programmer View

The TPU has only 11 instructions, five are commonly used. It is an in order design that relies on software to handle complexity.

Google is focused on systolic execution in the TPU:

Google TPU Systolic Execution Diagram
Google TPU Systolic Execution Diagram

Memory bandwidth is extremely important in the architecture so the TPU is designed to efficiently move data around.

Google TPU Ignore Piipelining In Matrix
Google TPU Ignore Pipelining In Matrix

In this architecture, Google’s TPU team found that they are limited by memory bandwith. As a result, the team modeled what would have happened had they used GDDR5 instead.

What if there was a Google TPU (v1.1) Update GDDR5 for Bandwidth?

Google used DDR3 in its original design. One option was to move to DDR4 for higher bandwidth operation. Another potential design decision was to move to GDDR5 which is what Google showed at Hot Chips:

Updated Google TPU GDDR5
Updated Google TPU GDDR5

The company explained that they were not necessarily compute limited in their original design. DDR3 was too slow. Here is what Google is claiming if the TPU had GDDR5 in terms of performance per Watt:

Updated Google Revised TPU GDDR5
Updated Google Revised TPU GDDR5

The key here is that the GDDR5 version would have had more bandwidth and lower latency boosting performance. Instead of building the v1.1 version using GDDR5, Google leapfrogged GDDR5 and moved to HBM for its TPU v2 which could also be used for training, not just inference.

Cloud TPU v2 Architecture

How Google actually followed up the TPU is with the TPUv2. Instead of updating to higher bandwidth GDDR5, Google is using even faster HBM in 16GB capacity. That gives them 600GB/s of memory bandwidth. For comparison, that is about four sockets worth of AMD EPYC.

TPUv2 Architecture
TPUv2 Architecture

Here is what the Cloud TPU v2 Looks like with four of those blocks combined:

Google TPUv2
Google TPUv2

These are the much larger compute units with more TDP headroom that Google is making available to developers.

Final Words

While this is a study in an older generation product, it is still interesting from a design standpoint. One area that we would like to see Google’s TPU team move the ball forward on is their comparisons. While the K80 may have been excellent years ago, and is still widely deployed, it is now three generations old as NVIDIA Tesla V100 Volta is available. At the same time, the key takeaway is that the large hyper-scale players are building their own hardware and topologies well beyond simply adding a GPU to a server.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.