Advertisement


Home AI BIG AI Cluster Little Power the 8x NVIDIA GB10 Cluster

BIG AI Cluster Little Power the 8x NVIDIA GB10 Cluster

13

Building the 8x NVIDIA GB10 Cluster Monitoring and Firmware

Once we had been operating for some time, it became clear that we needed to better manage our cluster, just as you would any cluster. It all came to a head when we were forced to swap one GB10 out of the cluster and replace it with another unit we had running as a single node.

Claude Swapping GB10 Nodes In And Out Of Cluster
Claude Swapping GB10 Nodes In And Out Of Cluster

We built a small cluster monitoring setup with the following key features:

8x NVIDIA GB10 Cluster Monitoring Under Load
8x NVIDIA GB10 Cluster Monitoring Under Load
  • Monitor each GB10 node:
    • CPU Utilization
    • GPU Utilization
    • LPDDR5X Usage
    • Temperature
    • Package Power Consumption
    • Node Power Consumption
    • Links to 10GbE and 200GbE networks
    • Ports on the PDU
  • Monitor 200GbE networking:
    • Ensure 200GbE was connected
    • Monitor each port
    • Monitor RDMA networking status on each port
NVIDIA GB10 Cluster Manager 200GbE RDMA
NVIDIA GB10 Cluster Manager 200GbE RDMA
  • Monitor 10GbE networking
    • Ensure 10GbE was connected
    • Monitor each port
  • Monitor WiFi networking
    • Ensure WiFi is not connected
    • Monitor each so a reboot does not change this
STH GB10 Cluster Manager Firmware Mismatch
STH GB10 Cluster Manager Firmware Mismatch
  • Monitor Firmware and Update
    • Check for OS kernel version
    • Check NVIDIA driver
    • Check Firmware of GB10
    • Check Firmware of ConnectX-7
STH GB10 Cluster Manager Firmware Updates
STH GB10 Cluster Manager Firmware Updates
  • Monitor Cluster Power
    • Monitor power for the cluster
    • Allow for remote power cycling of nodes
8x NVIDIA GB10 PDU Power Idle For Cluster
8x NVIDIA GB10 PDU Power Idle For Cluster

This might all sound boring at first, but keeping all of the nodes on the same firmware is a pain if you are managing each one individually. Also, if you are away from the cluster often (I did 98 flights in 2025 as an example), then having the ability to do remote diagnostics and power cycling is important. Having all of the links between individually switched and monitored PDU ports, switches, and nodes makes both troubleshooting and remote hands easier. Using this, I once had to guide Sam to remotely swap out a node, and it was easy because I had a way to verify the issue and then tell him how to validate that the replaced node was in the correct spot.

Next, let us get to the performance and optimization.

13 COMMENTS

  1. This is indeed one of your best articles in years. My 2-node cluster will keep me entertained for a long time. Maybe one day I will go up to 4 nodes, we shall see.
    The MikroTik CRS804 is perfect for a 4-node cluster, 4x200G for the RoCE, 4x100G for storage, and the last 400G port facing the NAS.

  2. Have you tried implementing TurboQuant on the cluster? I am curious to see how long context windows impact the available vram over time and how TurboQuant might help.

    Great article especially showing how you leveraged ai to set everything up!

  3. Repost from the STH Forums, not sure which spot is better for responses:

    Fantastic article. I’m running 4N right now, was/am still considering a drop to 2N since the extra 2N arent necessary for all models but your point about the fungibility of having more/spare nodes is excellent.

    Very interesting to see the callout on using a flash NAS for shared storage – would love to hear more on best practices for this, especially the situations where a little more GPU in the NAS makes sense. Next click stop for my cluster is the addition of a flash NAS based on the ARR 1U E1S you guys reviewed a while back, but I hadn’t even considered putting a GPU into it. Would love more details on this.

    Also if you’re taking requests a network diagram would be fantastic. I’m running 2x 804DDQs to handle the 4N + flash + uplinks + mgmt bit I’m nowhere near settled on the topology, would love to see how you ended up structuring the full cluster network including NAS & DDQ trunks / uplink.

    Lastly, WTH was Ubiquiti 10G switch you were showing?! It definitely wasn’t the Pro XG-8 PoE (no screen) and AFAIK there’s no other UI 10G switches that aren’t rack mounted. I’m pretty familiar with the UI lineup and the only unit I know that looks like what you showed was the Enterprise 8 PoE (Vintage) model!

  4. I thought your SM Xeon 6 SOC review was unreal good. This is even better. I think I’m more sold on a 4N not and 8N but the M3 Ultra’s prefill is doggy doo doo. I’m lovin’ your new articles Patrick.
    One that I wish you’d done is the QNAP 100G and 25G switch. If you’re only getting 140G then maybe it’d be better to just do 100G

    I don’t get the tube comments where they’re so dead set on being the permanent underclass.

  5. @El Porto & Peter – Patrick touched on it briefly in the video, but the general idea is the GB10 Connectx7 interfaces (and the CRS804 switch) are purely for inter-node coordination as they are running RoCE.

    NAS and management traffic are all on the 10G NICs.

    Deviating from that pattern would likely degrade model performance.

    Regarding cluster size, I’ve been pretty happy with a 2 node with a simple DAC. An alternative (cheaper) way to scale out could simply be to add a totally separate 2n cluster, and load balance the LLM API requests.

  6. I read about 8. Now I’m ordering a 4-node cluster.

    Look at high-capacity DIMM pricing. If you get the 1TB you’re paying for the 128GB, maybe the SSD, but then the CPU, GPU, and NIC are free. These aren’t getting any cheaper. You aren’t going to get a better deal on this much VRAM.

    Apple stopped selling the studio 512 because the spot pricing of the memory alone is approaching $15k.

  7. Awesome writeup. STH is crushing it!

    Has anyone been able to PXE boot their DGX spark? I want to provision it with MAAS but it always hangs. Is it possible to PXE from the connectx instead of the realtek?

  8. Hi!
    “On the MikroTik CRS812 DDQ, set up MTU, PFC, ECN, and all of the QoS bits needed.”
    What is your RouterOS version? AFAIK PFC and ECN are not yet supported in v7.21.4 (long term) or v7.22.2 (stable), and are mandatory (?) functions for lossless RDMA connections.
    Are you using v7.23b/rc?
    Thank you

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.