Graphcore v. NVIDIA HGX A100 Power Consumption
At this point, one may assume that perhaps the Graphcore IPU-POD16 is significantly better in terms of TCO because of lower power consumption. We looked up the specs, and that seems not to be the case.
The Graphcore IPU-POD16 is effectively a server plus four of the company’s IPU-M2000 units.
Here is an IPU-M2000 unit that makes up the four 1U chassis above:
Each of the four IPU-M2000 systems has four GC200 Mk2 IPUs (not to be confused with Intel’s confusingly similar IPU usage as its rebellion to DPUs.) While we could not find power consumption for the IPU-POD16, we could for the IPU-M2000.
Typical power is rated at 1.1kW. There are four of the IPU-M2000 units in an IPU-POD16 that Graphcore is using for MLPerf Training v1.0. That gives us 4.4kW typical power usage.
Above the four IPU-M2000’s is a 1U server. One of Graphcore’s big VC backers is Dell (those VCs may not be feeling too good about their investment after this piece, sorry.) The idea is that a reseller/ channel partner can use their own server here. One can also scale IPU-M2000 and compute servers asymmetrically, but the IPU-POD16 is what was submitted to MLPerf Training v1.0. Whether using higher-end Intel Xeon Ice Lake or AMD EPYC 7003 Milan servers, a current-generation 1U dual-socket server will use in the 800W – 1kW range these days. There are options for other lower-power configurations, but that range holds regardless of vendor given the CPU power, system component power, and cooling requirements. We test these servers all the time, and that is a very reasonable range with top-end CPUs, 16x DDR4-3200 DIMMs, and so forth.
That gives us 4.4kW for the Graphcore units and 0.8-1kW for the server which is 5.2kW to 5.4kW combined.
For comparison, we did a quick Linpack run on an HGX A100 320GB (8x 40GB) 400W GPU system with AMD EPYC CPUs and this is what we saw for power:
More on that system soon, but you can see that we hit 5.164kW when we ran this about an hour after the MLPerf results were announced. We did the 500W A100 system at the same time, and the power consumption is higher, but we are not considering that solution a direct competitor. This is not the Inspur NF5488A5 that was in MLPerf Training v1.0 and we will review next week, but the Inspur system is also very similar in terms of this 5.1kW to 5.3kW range. Since we have now tested both, this range seems directionally accurate for HGX A100 8 GPU 400W systems.
Comparing actual measured NVIDIA HGX A100 400W system power consumption with Linpack to the Graphcore “Typical” power consumption figures plus what we measure for 1U/2U dual-socket high-end servers is not perfect since we are not independently measuring the Graphcore figures. Power consumption also varies based on workload so this is far from a perfect representation. Still, it feels roughly similar.
We are going to quickly note that OEM systems with the HGX A100 8 GPU are typically 4U so Graphcore’s 5U against the DGX A100 6U is not a great comparison either. For example, we can see the Tesla Supercomputer with NVIDIA A100 80GB and Supermicro 4U servers shown clearly. Luckily, most racks run out of power and cooling before space with these systems so density is not as much of a factor in TCO for large portions of the market.
Still, at the end of the day, given the data we have, it seems like the Graphcore IPU-POD16 solution is close enough in power consumption to a NVIDIA HGX A100 system that they are roughly equivalent to the point where power is not the deciding factor. If we saw one at 10kW and the other at 5kW, we would be tempted to claim a major TCO impact. With this class of systems a few hundred watts most in the market will gladly burn to get more performance, better software support, or other benefits since it is small on a percentage basis.
Personally, I found the Graphcore MLPerf Training v1.0 submission to be borderline heartbreaking. I have wanted to see a legitimate challenger to NVIDIA to increase competition in the industry and frankly hoped that Graphcore would be it.
Instead, Graphcore showed that it has an IPU-POD16 solution that is roughly equivalent to the NVIDIA HGX A100-based servers in terms of list price and power consumption. Graphcore’s performance, given the price/ power context, was significantly worse than NVIDIA’s. Even allowing for customization by comparing an “Open” result to the standard “Closed” result, it was far behind in terms of performance.
Let us get frank for a moment. NVIDIA has been leading the training market for generations and has an expansive software stack. A startup does not win by equalling NVIDIA in price/ performance, or even being 10-30% ahead. Certainly not by being behind. NVIDIA has a huge value proposition on the software and ecosystem side that is very tough for any startup to match. Intel is an example of a company that can spend and catch up, but the market is not hospitable to those entering without a great software story. Graphcore has been putting in the work, but NVIDIA is effectively the default for most of the industry at this point.
This was a spectacular failure, and it will damage the industry. Graphcore’s poor showing in MLPerf Training v1.0 (along with the Intel-Habana Labs submission) will and should dissuade other companies from submitting to MLPerf in the future. I asked in our interview with Andrew Feldman CEO of Cerebras Systems and his response was to focus on customers, not MLPerf. That is something Graphcore probably should have followed as well since even burning through huge sums of VC money has not gotten Graphcore close to NVIDIA in MLPerf.
Real-world customer results may differ, but if that is the case submitting MLPerf results was a poor choice. The outcome of MLPerf Training v1.0 results for Graphcore should have triggered a decision not to submit those results. Using fantasy pricing numbers instead of publicly available HGX A100 numbers online to take its slower performance and try spinning it into a price/ performance story was illogical. Many online may not understand NVIDIA’s go-to-market model since it has rarely been detailed as seen in this piece. Those who are in the AI hardware industry and are building competitive products should know how the largest player sells systems. Customers certainly know. Using what is effectively fantasy pricing to make its systems look better is strange since this is an industry of highly-educated folks that know what real pricing is.
If anything, I fear such a poor showing in MLPerf will dissuade customers from considering Graphcore in the future which will decrease competition in the industry. It should also prove to be a cautionary tale to other AI training chip startups.