In a standing room only talk just before the end of the OCP Summit 2018, Alibaba gave a talk about its project around immersion cooling at scale. The talk centered around the Alibaba immersion cooling hyperscale project for next-generation data centers. Alibaba has conducted hundreds of experiments and is looking to go into production with 100kW+ racks as soon as 2019.
Apologies in advance for the photos/ angles. It was a standing room only affair and we had to fight to get a spot to snap photos.
The Alibaba Immersion Cooling Project
Alibaba is working on a major immersion cooling project for next-generation high-density data centers. For those not familiar with immersion cooling, the concept is that you put electronics in non-conductive fluids and take advantage of the better thermal conductivity versus air. Alibaba is using 3M Fluids for their project and 3M has been pushing this idea for many years.
Here is the slide on air cooling versus cold plate (used in water cooling servers such as Baidu’s GPU liquid cooling) and Alibaba immersion cooling. The short summary is that immersion cooling is seen as the best option.
The reasons for Alibaba’s immersion cooling project were to increase efficiency. Cooling is a major data center cost so increasing efficiency has a direct impact on TCO.
In hotter regions, where free cooling is not practical, Alibaba is seeing significantly lower PUE numbers than with air or traditional liquid cooling setups with cold plates.
Alibaba said that while they are still in the first phase of the project, they see this as a key enabler towards 100kW and higher racks and Alibaba expects to achieve 3x or more density increases over air cooling.
Another area that Alibaba is researching is reliability impacts due to not having moving fans that also cause vibrations and move air over the assembly.
Some of the challenges with immersion cooling of existing systems were fairly interesting. An example of this is that the thermal interface materials between hot components like CPUs and GPUs is thermal paste in air-cooled scenarios. In immersion cooling, different strategies must be used to increase surface areas of heat sources. Alibaba said that they needed to keep the high-speed signaling in the middle PCB layers to stay away from the 3M fluid.
Another intriguing feature was that conventional optics and other components need to be sealed properly in order to function.
Some of the issues noted in tank design were novel as well. For example, the first tank design had issues with tiny cracks leaking fluid due to shipping damage. Another example was that in southern China when the ambient temperatures became too high condensation occurred on the tank walls.
Currently, Alibaba is in its development and testing phase of the project. It expects next-generation production deployment to happen in 2019.
The presentation at OCP Summit 2018 was interesting, but the last points made were important. Alibaba is the first, and probably only, cloud service provider to share their immersion cooling project at this point. Alibaba also stated the need for ecosystem partners to help with the heavy lifting of design and building the tooling necessary to turn immersion cooling into an industry standard rather than a leading edge science project.
Alibaba Immersion Cooling Project Final Words
There are many small-scale immersion cooling examples that have cropped up over the years, but this is intriguing. Alibaba is specifically looking into the 100kW+ rack space for density will still maintaining a low PUE of 1.05-1.07. There are major challenges though and the company has been investing in research to bring the technology to the next phase of deployment. In the talk, the company also discussed how this was for next-generation data centers. From the discussion, it seemed like the project required significant data center differences. A great example is how a typical rack today can be serviced by pulling gear out horizontally from vertical racks while with immersion cooling you generally pull gear out vertically from horizontal racks. If you think the two biggest barriers to adoption are the physical facilities to deploy at scale and the design research to ensure entire data centers can be maintained for useful lifetimes, then it takes a company of Alibaba’s scale to undertake such an ambitious project.