Graphcore Colossus Mk2 IPU at Hot Chips 33

1
Graphcore GC200 Chip At Launch
Graphcore GC200 Chip At Launch

Graphcore is presenting its new IPU at Hot Chips 33. We have covered the Mk2 in our Graphcore GC200 IPU launch. We have also covered how confusingly similar Intel’s new use of IPU in an adjacent space is to what Graphcore has been using for years. Finally, we have covered how Graphcore celebrated a stunning loss at MLPerf Training v1.0. Still, we wanted to cover the Hot Chips 33 talk on their progress. As with others, this is being done live so please excuse typos.

Graphcore Colossus Mk2 IPU at Hot Chips 33

Graphcore has a ton of slides, at around 1 slide per minute for its talk. As a result, we are going to let you read a few slides to keep pace for the rest of the day. One of those is the foundations of Graphcore’s IPU (as opposed to Intel’s IPUs that are really DPUs.)

HC33 Graphcore Colossus Mk2 IPU Foundations
HC33 Graphcore Colossus Mk2 IPU Foundations

The next two we will let you read with the context that Graphcore has been working on developing a software stack seemingly with more effort than its hardware stack so the software side is important.

HC33 Graphcore Colossus Mk2 IPU Software Abstraction
HC33 Graphcore Colossus Mk2 IPU Software Abstraction

As with many AI chips, Graphcore has many processors with many threads and a solution to feed those processors via high-speed memory.

HC33 Graphcore Colossus Mk2 IPU Hardware Abstraction
HC33 Graphcore Colossus Mk2 IPU Hardware Abstraction

For some reference, the number of processor tiles did not increase dramatically, but the shrink from TSMC16 to TSMC7 seems to have had a huge impact on SRAM size while also doubling compute.

HC33 Graphcore Colossus Mk2 Below Mk1
HC33 Graphcore Colossus Mk2 Below Mk1

Here are the key lessons learned. The PCIe power density constraint is worth highlighting since at STH we have been doing a lot of work reviewing SXM4 A100 systems like the Inspur NF5488A5 and Dell EMC PowerEdge XE8545 recently and have been covering OAM solutions like the Intel Ponte Vecchio Spaceship GPU. We have certainly seen the impact of PCIe constraints first-hand.

HC33 Graphcore Colossus Mk2 Lessons From Mk1
HC33 Graphcore Colossus Mk2 Lessons From Mk1

The M2000 IPU-Machine takes four Colossus Mk2 IPUs and puts them into a chassis with a controller and 512GB of local memory. There is also an interconnect to other nodes.

HC33 Graphcore Colossus Mk2 M2000 IPU Machine IPU POD512
HC33 Graphcore Colossus Mk2 M2000 IPU Machine IPU POD512

Here is the structural highlight between the Graphcore, NVIDIA, and Google TPU options. Most use different units so this is always hard to compare.

HC33 Graphcore Colossus Mk2 Structural Headlines
HC33 Graphcore Colossus Mk2 Structural Headlines

Here is a look at the IPU. This is the Colossus Mk2 IPU with over 59B transistors in TSMC 7nm. The 23/24 redundancy means that there is a spare tile for every 23 active tiles.

HC33 Graphcore Colossus Mk2 IPU Many Transistors
HC33 Graphcore Colossus Mk2 IPU Many Transistors

This is the basics behind the Graphcore IPU Tile Processor.

HC33 Graphcore Colossus Mk2 IPU Tile Processor
HC33 Graphcore Colossus Mk2 IPU Tile Processor

Here is a look at the threading.

HC33 Graphcore Colossus Mk2 IPU N Plus 1 Barrel Threading
HC33 Graphcore Colossus Mk2 IPU N Plus 1 Barrel Threading

Graphcore can maximize memory utilization with its sparse load/ store.

HC33 Graphcore Colossus Mk2 IPU Spartse Load Store
HC33 Graphcore Colossus Mk2 IPU Spartse Load Store

Graphcore above focused on its FP32 performance compared to the NVIDIA A100 and other chips. Here is how that works.

HC33 Graphcore Colossus Mk2 IPU IEEE F16 And F32
HC33 Graphcore Colossus Mk2 IPU IEEE F16 And F32

The chips can also generate random numbers.

HC33 Graphcore Colossus Mk2 IPU Random Numbers And Stochastic Rounding
HC33 Graphcore Colossus Mk2 IPU Random Numbers And Stochastic Rounding

The tiles execute instructions asynchronously and then data is synchronized when the exchange between chips must happen.

HC33 Graphcore Colossus Mk2 IPU Global Program Order
HC33 Graphcore Colossus Mk2 IPU Global Program Order

Here is how that exchange occurs.

HC33 Graphcore Colossus Mk2 IPU Exchange Mechanics
HC33 Graphcore Colossus Mk2 IPU Exchange Mechanics

On the chip power side, here is the power consumption of the chip. Specifically interesting is how much is used by the memory and transport functions.

HC33 Graphcore Colossus Mk2 IPU Chip Power
HC33 Graphcore Colossus Mk2 IPU Chip Power

Here is the TFLOP per Watt. Graphcore shows it is pushing more than NVIDIA and Google.

HC33 Graphcore Colossus Mk2 IPU System Power
HC33 Graphcore Colossus Mk2 IPU System Power

One of the big questions is why not use HBM. Graphcore says that it is expensive and capacity-limited. HBM2e is not the main reason that the NVIDIA A100 80GB is well over $10,000 each.

HC33 Graphcore Colossus Mk2 IPU Why No HBM
HC33 Graphcore Colossus Mk2 IPU Why No HBM

Effectively Graphcore can make smaller less expensive chips if they use traditional DDR4 memory instead of HBM2.

HC33 Graphcore Colossus Mk2 IPU DRAM Economics
HC33 Graphcore Colossus Mk2 IPU DRAM Economics

Graphcore needs to work around having a model of big SRAM and slower DDR memory instead of the more expensive HBM, so it needs to manage model state placement.

HC33 Graphcore Colossus Mk2 IPU Placing Model State
HC33 Graphcore Colossus Mk2 IPU Placing Model State

Using these techniques, Graphcore is able to have enough SRAM on die that it can minimize the performance penalty of going to off-die DRAM.

HC33 Graphcore Colossus Mk2 IPU On Die SRAM Less DRAM BW
HC33 Graphcore Colossus Mk2 IPU On Die SRAM Less DRAM BW

Of course, Graphcore is pushing its hardware-software work.

HC33 Graphcore Colossus Mk2 IPU HW Helping SW
HC33 Graphcore Colossus Mk2 IPU HW Helping SW

Graphcore is showing its performance gains from software. This is important since software gains can often be as big, if not bigger, than generational hardware gains.

Final Words

Overall, Graphcore has a ton of investment money. It had a poor MLPerf Training v1.0 showing with its hardware, unable to show it is clearly better than the industry standard NVIDIA A100. In this industry, one needs to be doing much better than NVIDIA to get ahead.

Also interesting, if you believe in Graphcore’s thesis of needing a lot of on-chip SRAM instead of HBM2(e) and scaling to many IPUs, then the next talk by Cerebras shows what is probably the (significantly) higher-end version of going down that path with its Wafer-Scale Engine 2.

1 COMMENT

  1. Interesting (or weird):

    The first transistor computer had 92 transistors.

    Colossus Mk2 transistors / World population =

    59,334,610,787 / 7,902,068,494 = 7.50874418

    92 / 7.508 = 12.25

    A dozen CPUs have enough transistors to build everyone a computer.

    That’s the result of 68 years of progress.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.