Exploring the NVIDIA HGX B200 Lambda AI Cluster at Cologix with Supermicro

7

The Forgotten Parts of the AI Cluster

When you look at this picture, you may be drawn to the Supermicro GPU servers, the network switches, firewalls, cold aisle containment, or overhead cable raceways. If you look a little closer, you might see some of the more forgotten parts of the AI cluster.

Supermicro Lambda Cologix Misc 11
Supermicro Lambda Cologix Misc 11

In the middle of that rack, there are traditional compute servers as well. Here are three Supermicro 1U servers.

Supermicro Lambda Cologix Misc 10
Supermicro Lambda Cologix 1U servers 10

We also found a number of 2U Supermicro servers around the cluster.

Supermicro Lambda Cologix Misc 14
Supermicro Lambda Cologix Misc 14

We discussed some of the other 1U storage servers, but there is a good reason we see so many traditional CPU compute platforms around. Lambda needs to operate its cluster. Customers need traditional compute resources located with the AI servers to orchestrate activity and also to provide other services. We are not going to provide an exact count, but if you watch the video and look for them, you will notice there are many traditional compute nodes in different form factors alongside the GPU servers.

Supermicro Lambda Cologix Misc 4
Supermicro Lambda Cologix Misc 4

Here is a quick screen capture from a running cluster where you can see the GPU nodes running, but also the CPU-based nodes.

Lambda 1 Click Clusters CPU And GPU Nodes
Lambda 1 Click Clusters CPU And GPU Nodes

There are also a ton of other components. For example, you might have noticed the red and blue power cables, the PDUs, different types of labeling (the Sharpie version was perhaps the most fun), cable management devices, and so forth. It is always fun to see just how many things must come together to make clusters work.

Supermicro Lambda Cologix Misc 25
Supermicro Lambda Cologix Misc 25

While investors may focus on the high-dollar items like the Supermicro servers, NVIDIA GPUs, storage vendors, network switches, optical modules, and so forth, that is only a part of the overall BOM of a cluster like this.

Supermicro Lambda Cologix Misc 20
Supermicro Lambda Cologix Misc 20

Instead, if these other pieces are missing, often the cluster cannot be brought online, and costly GPU servers sit idle not able to service customer workloads.

Supermicro Lambda Cologix Misc 15
Supermicro Lambda Cologix Misc 15

What is more, even seemingly small details can have impacts on the supply chain like the airflow direction on the switches.

Supermicro Lambda Cologix Misc 5
Supermicro Lambda Cologix Misc 5

Small things can make a difference. For example, if you have fiber bundles custom-built, but to the incorrect lengths, they then need to be redone and delay go-live. AI clusters have a surprising amount of surrounding infrastructure that needs to be in place for the clusters to operate.

There was one cool item that we missed, and that is a special “tool” for helping run cabling in the overhead raceways. Hopefully, we will get that in a future video.

Supermicro Lambda Cologix Cable Management 6
Supermicro Lambda Cologix Cable Management 6

One of my hopes with this article and video is that folks appreciate just how many things must come together to make AI clusters work. While in mainstream media these parts are often forgotten, this is STH, and I want to show our readers whenever we can.

Lambda needs to manage this not just for this cluster, but for all of its existing and future clusters. Let us get to growth next.

7 COMMENTS

  1. Article good. Video better. I’m not sure I’ve seen a dc video as fast paced as your xAI video since that one. It was like I was watching some gripping mission impossible action movie not some boring dc video. I don’t know how you did that, but keep doing more of it

  2. Eagerly awaiting the day when the AI bubble bursts after investors figure out that AI isn’t a magic black box that replaces human employees. Then some of this hardware can hit the secondary market for prices that hobbyists are willing to pay for hobbyist use of AI.

  3. AI isn’t some investor-fueled bubble. Companies are actively spending massive amounts of money on it. If the companies are collectively spending hundreds of billions of dollars per year on AI just to satisfy perceived investor interest then there is a much bigger problem in the marketplace than an AI bubble. If it is a bubble it is in the hopes and expectations of technology companies, not in the speculation of investors.

  4. @Matt it’s become so bad that the likes of Microsoft have started ‘trimming the fat’ despite record revenues, and are desperately tacking AI (‘copilot’) onto any popular product despite customer pushback that it doesn’t really add any value let alone justify a price increase.

  5. Looking at Cologix’s locations it seems that the most northerly location is Montreal Canada.

    Checking the Open Canada website for “Permafrost, Atlas of Canada, 5th Edition” we see that there’s so many better locations available for cooling a server farm (at the lowest possible price). Just as you probably don’t want to setup in southern California (because of the temperature and electricity costs) you wouldn’t setup in southern Canada (when you can move to northern Canada, near a dam).

    Not my hundreds of millions ….

  6. Within a year or two, we’ll be looking back and wondering how absolutely *everyone* seemed to think this was a great idea.

    I’d love to see this equipment put to work doing scientific research but I fear that it’s already too tightly optimized for AI work.

    Regardless, fascinating view into how these clusters come together. Thanks, Patrick!

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.