Exploring the NVIDIA HGX B200 Lambda AI Cluster at Cologix with Supermicro

7

Air Cooling an AI Cluster with Cologix

Cologix offers a number of data centers both in the Columbus, Ohio area, as well as elsewhere. It also has both liquid-cooled and air-cooled data centers. We just happened to take a look at a nice air-cooled location.

Supermicro Lambda Cologix Sign 1
Supermicro Lambda Cologix Sign 1

Likewise, Lambda has a number of clusters. This was just one that we could take a look at while it was both operating and also being expanded.

Supermicro Lambda Cologix Patrick 1
Supermicro Lambda Cologix Patrick 1

Most data centers and clusters I get to tour have something iconic about them. For me, the blue cooling walls are going to be the iconic part of this tour. The scale of these, combined with them being a contrasting blue, is just neat.

Supermicro Lambda Cologix Patrick 5
Supermicro Lambda Cologix Patrick 5

Behind the blue mesh are heat exchangers. Chillers mounted on the roof circulate fluid to what are effectively massive radiators like you would find in a car. The cooler room air in the cold aisles is sucked into the Supermicro GPU servers and through heatsinks where heat is removed from components and transferred into the air.

Supermicro Lambda Cologix Drive Bay 6
Supermicro Lambda Cologix Drive Bay 6

That warmed air is then contained in a hot aisle where it rises and is pulled around to the heat exchangers. My trips through the hot aisles were brief as they were both noisy and hot.

Supermicro Lambda Cologix Drive Bay 4
Supermicro Lambda Cologix Drive Bay 4

Those heat exchangers remove heat from the air, transferring the heat to the fluid loop. After that heat is removed, the air is then recirculated into the cold aisles of the data center.

Supermicro Lambda Cologix Patrick 4
Supermicro Lambda Cologix Patrick 4

There are a number of ways that folks exchange air in a data center. Technically, there is a liquid loop that is removing heat from the cluster and bringing it outside. In the industry, we do not call this liquid cooling AI clusters. Instead, this is air-cooling because we go from GPU/ CPU to air via heatsinks, then to the liquid loops for the data center.

Supermicro NVIDIA GB200 NVL72 Rack Installed For Lambda Rear
Supermicro NVIDIA GB200 NVL72 Rack Installed For Lambda Rear

We use liquid cooling, like we see on Lambda’s Supermicro GB200 NVL72 rack, to describe going from the GPU/ CPU to liquid cooling blocks.

Supermicro NVIDIA GB200 NVL72 Rack Installed For Lambda Front Bottom
Supermicro NVIDIA GB200 NVL72 Rack Installed For Lambda Front Bottom

Cooling is fun, but you are probably wondering about power. Let us get to that next.

Powering the Cluster at Cologix

The Cologix data center has its own power substation.

Supermicro Lambda Cologix Power Outdoors 4
Supermicro Lambda Cologix Power Outdoors 4

While this is not part of the AI cluster tour, some folks have never seen what this looks like.

Supermicro Lambda Cologix Power Outdoors 3
Supermicro Lambda Cologix Power Outdoors 3

We did not tour inside the fenced area for safety reasons. This is a 36MW facility.

Supermicro Lambda Cologix Power Outdoors 2
Supermicro Lambda Cologix Power Outdoors 2

For some sense of scale, here is the row of power containers with things like battery banks outside of the facility. Each of these is rated for something like 1.6MW. If you squint, you might be able to see me walking through this corridor between the data center and these pods.

Supermicro Lambda Cologix Patrick 3
Supermicro Lambda Cologix Patrick 3

That power is then brought inside the facility and is distributed via busbars/ busways. We looked at busbars/ busways last year in another video.

Supermicro Lambda Cologix Cable Management 2
Supermicro Lambda Cologix Cable Management 2

An advantage of using this type of setup is that one can use tap-off boxes to bring power of the right type to a rack via a movable overhead box.

Supermicro Lambda Cologix Cable Management 4
Supermicro Lambda Cologix Busway

You can see examples of this with different tap-off boxes being used for different racks, depending on the type of rack being provisioned.

Supermicro Lambda Cologix Misc 17
Supermicro Lambda Cologix Misc 17

This may seem like a small feature, but if an upgrade happens in the future and racks move or need different types of power, these tap-off boxes take a few minutes to swap and move.

We have looked at the GPU servers, networking, storage, cooling, and power so far. Still, there is quite a bit more to AI clusters that is often overlooked. Let us get to those next.

7 COMMENTS

  1. Article good. Video better. I’m not sure I’ve seen a dc video as fast paced as your xAI video since that one. It was like I was watching some gripping mission impossible action movie not some boring dc video. I don’t know how you did that, but keep doing more of it

  2. Eagerly awaiting the day when the AI bubble bursts after investors figure out that AI isn’t a magic black box that replaces human employees. Then some of this hardware can hit the secondary market for prices that hobbyists are willing to pay for hobbyist use of AI.

  3. AI isn’t some investor-fueled bubble. Companies are actively spending massive amounts of money on it. If the companies are collectively spending hundreds of billions of dollars per year on AI just to satisfy perceived investor interest then there is a much bigger problem in the marketplace than an AI bubble. If it is a bubble it is in the hopes and expectations of technology companies, not in the speculation of investors.

  4. @Matt it’s become so bad that the likes of Microsoft have started ‘trimming the fat’ despite record revenues, and are desperately tacking AI (‘copilot’) onto any popular product despite customer pushback that it doesn’t really add any value let alone justify a price increase.

  5. Looking at Cologix’s locations it seems that the most northerly location is Montreal Canada.

    Checking the Open Canada website for “Permafrost, Atlas of Canada, 5th Edition” we see that there’s so many better locations available for cooling a server farm (at the lowest possible price). Just as you probably don’t want to setup in southern California (because of the temperature and electricity costs) you wouldn’t setup in southern Canada (when you can move to northern Canada, near a dam).

    Not my hundreds of millions ….

  6. Within a year or two, we’ll be looking back and wondering how absolutely *everyone* seemed to think this was a great idea.

    I’d love to see this equipment put to work doing scientific research but I fear that it’s already too tightly optimized for AI work.

    Regardless, fascinating view into how these clusters come together. Thanks, Patrick!

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.