NVIDIA posted a video ahead of its GTC 2020 keynote as a teaser, but that is instructive as to what we will see from NVIDIA at the show. NVIDIA’s video was a fun nod to the shelter-in-place order that has kept shows such as GTC from happening in 2020. The company’s CEO, Jensen Huang has something cooking, and that is a new HGX-2 platform with NVIDIA Tesla A100 (GA100) GPUs. We have some details on this in light of our recent Inspur NF5488M5 Review where we just generated tens of gigabytes of images and video of our own of the Tesla V100 version of this board. Looking at differences, we can see some of the key specs emerge.
NVIDIA Tesla A100 Video by NVIDIA
In a fun nod to the stay the least, Jensen Huang is taking a big HGX-2 board out of the oven.
At STH, we have had pictures of this assembly for some time, but wanted to respect NVIDIA’s formal launch. Since the company has now shown off the assembly from different angles, we consider it fair disclosure that we can analyze.
NVIDIA Tesla A100 Video by NVIDIA
In the video, Jensen grunts as he lifts the assembly, which is for good reason.
In our recent Tesla V100 version review, we saw that the Tesla V100 HGX-2 assembly, with sheet metal around it, weighed over 50lbs. That is an awkward weight to lift out of an oven.
Looking at the unit, we first see the PCIe side.
Here we can see the much larger heatsinks on the PCIe and power input side of the HGX-2 baseboard.
Here are the Tesla V100 edition PCIe and power side heatsinks for comparison.
As you can see, the new NVIDIA Tesla A100 board has much more robust heatsinks on this side with more fins for more surface area. Since this is 2020 we think this is a good sign that the Tesla A100 will support PCIe Gen4 as the industry, except for Intel, has now moved to Gen4.
The next item we see is the NVSwitch side.
Something one will immediately notice is that the NVSwitch heatsinks are full tower coolers and much larger than the Tesla V100 version.
Here is the Tesla V100 version for comparison.
One can easily see that the new Tesla A100 version heatsinks are now full towers, not a split level design as we see on the Tesla V100 version. On the Tesla V100 version, we can see the three copper heat pipe tops on these heatsinks. On the Tesla A100 version, we can see the larger towers have a total of ten heat pipe tops.
These tower coolers are for the six NVSwitch chips on the HGX-2 baseboard. The fact that we are seeing larger coolers means that we are likely going to see an NVLink 3.0 with faster speeds. In the past, NVIDIA has looked to double NVLink bandwidth for generations, so we think that moving from 300GB/s to each SXM3 module on the Tesla V100 model is not enough for the Tesla A100 generation. Instead, with these large heatsinks, we expect that NVIDIA is going to continue its tradition offering at least twice the bandwidth with NVLink 3.0 to 600GB/s per card perhaps with NVSwitch 2.0.
On the subject of heat pipes, we will note that NVIDIA seems to have done some work on the Tesla A100 SXM3 heatsinks. We already noted in our recent review that the system was designed for “Volta-Next” GPUs that would reach 400W. So we, of course, expect the Tesla A100 GPUs will hit at least 400W, but likely more at some point as we saw with the NVIDIA DGX-2H Now with 450W Tesla V100 Modules. Here is the best we could re-construct using the angles in that video.
We can see that while the old version had the NVIDIA green shrouds over the SXM3 coolers with heatpipe ends popping through, the new “Terminator-themed” Tesla A100 coolers have a smooth finish without interruption of heat pipes.
There are a few bits we can see are missing from the promotional video NVIDIA put out. A good example of that is that each SXM3 cooler module, the Tesla GA100 modules normally use the top edges of heatsinks for labeling including part numbers and serial numbers. We often also see things like that they were produced at TSMC on those labels. We do not see that with the new version shown.
The launch of the GA100 or NVIDIA Tesla A100 is perhaps the worst kept secret in the industry. That is what happens when you have to use the part to bid supercomputer contracts and have a Tesla V100 that is nearing three years old at this point.
In terms of release date, the fact that this appears to be a PCIe Gen4 product, and the fact that NVIDIA tends to announce Tesla generations well ahead of availability, means that there is a good chance we will not see the Tesla GA100 in systems for some time. NVIDIA may do something similar with the Tesla V100 and announce the DGX system with the parts early, to capitalize on initial demand then releasing modules to other OEMs. We expect other vendors to have Tesla A100 SXM3 systems at the earliest in Q3 but likely in Q4 of 2020.
In the meantime, if you want to check out the current HGX-2 baseboard and Tesla V100 platform, here is a video overview that accompanied our review above: