BIG AI Cluster Little Power the 8x NVIDIA GB10 Cluster

April 27, 2026

Building the 8x NVIDIA GB10 Cluster Monitoring and Firmware

Once we had been operating for some time, it became clear that we needed to better manage our cluster, just as you would any cluster. It all came to a head when we were forced to swap one GB10 out of the cluster and replace it with another unit we had running as a single node.

Claude Swapping GB10 Nodes In And Out Of Cluster

We built a small cluster monitoring setup with the following key features:

8x NVIDIA GB10 Cluster Monitoring Under Load

Monitor each GB10 node:
- CPU Utilization
- GPU Utilization
- LPDDR5X Usage
- Temperature
- Package Power Consumption
- Node Power Consumption
- Links to 10GbE and 200GbE networks
- Ports on the PDU
Monitor 200GbE networking:
- Ensure 200GbE was connected
- Monitor each port
- Monitor RDMA networking status on each port

Monitor 10GbE networking
- Ensure 10GbE was connected
- Monitor each port
Monitor WiFi networking
- Ensure WiFi is not connected
- Monitor each so a reboot does not change this

STH GB10 Cluster Manager Firmware Mismatch

Monitor Firmware and Update
- Check for OS kernel version
- Check NVIDIA driver
- Check Firmware of GB10
- Check Firmware of ConnectX-7

STH GB10 Cluster Manager Firmware Updates

Monitor Cluster Power
- Monitor power for the cluster
- Allow for remote power cycling of nodes

8x NVIDIA GB10 PDU Power Idle For Cluster

This might all sound boring at first, but keeping all of the nodes on the same firmware is a pain if you are managing each one individually. Also, if you are away from the cluster often (I did 98 flights in 2025 as an example), then having the ability to do remote diagnostics and power cycling is important. Having all of the links between individually switched and monitored PDU ports, switches, and nodes makes both troubleshooting and remote hands easier. Using this, I once had to guide Sam to remotely swap out a node, and it was easy because I had a way to verify the issue and then tell him how to validate that the replaced node was in the correct spot.

Next, let us get to the performance and optimization.

4 COMMENTS

hans.f April 27, 2026 At 2:26 pm

This is f*n awesome. Good on ya bro
lin81 April 27, 2026 At 2:54 pm

BEST piece you’ve done recently. Wow. It makes me only want a 4N not 8N
mashie April 27, 2026 At 3:54 pm

This is indeed one of your best articles in years. My 2-node cluster will keep me entertained for a long time. Maybe one day I will go up to 4 nodes, we shall see.
The MikroTik CRS804 is perfect for a 4-node cluster, 4x200G for the RoCE, 4x100G for storage, and the last 400G port facing the NAS.
jpmomo April 27, 2026 At 4:31 pm

Have you tried implementing TurboQuant on the cluster? I am curious to see how long context windows impact the available vram over time and how TurboQuant might help.

Great article especially showing how you leveraged ai to set everything up!

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Building the 8x NVIDIA GB10 Cluster Monitoring and Firmware

RELATED ARTICLESMORE FROM AUTHOR

TP-Link TL-SX1008 Review an 8-Port 10Gbase-T Switch

Meta Buys Tens of Millions of AWS Graviton Arm Cores in a CPU Land Grab

Google TPU 8i for Inference and TPU 8t for Training Announced

4 COMMENTS

LEAVE A REPLY

RELATED ARTICLES MORE FROM AUTHOR