BIG AI Cluster Little Power the 8x NVIDIA GB10 Cluster

April 27, 2026

Building the 8x NVIDIA GB10 Cluster Key Setup Steps

Perhaps the coolest thing about using a system like this is that it uses NVIDIA’s NCCL and other libraries. On the one hand, that makes everything straightforward to set up. On the other hand, the level of complexity of setting this up is well beyond operating a single GPU, or a few GPUs, in a single system.

Gigabyte AI TOP ATOM NVIDIA GB10 Front 1

Here, we are just going to have some of the key steps you need to do:

Physically connect all of the systems and switches.
- Ensure that you are using the same ConnectX-7 ports on the back of each GB10 as that will make your life much easier when it comes to managing the networking.
- If you are not using WiFi, turn those radios off
- Document all of the connections
Update all of the firmware across the cluster.
- Firmware changes can change performance significantly, especially if one or more nodes are on a different firmware.
- Once you are running the cluster, updating a node or nodes and rebooting will cause model downtime
On the MikroTik CRS812 DDQ, set up MTU, PFC, ECN, and all of the QoS bits needed. This still requires going into the CLI
Ensure that the ConnectX-7 NICs on each node are set up and can use RDMA and NCCL and also have the correct MTU to match the MikroTik switch
Speed test the RDMA networking.
- If you are getting ~10Gbps that is likely because a node is going out over the 10GbE network.
- Make sure to do bi-directional testing
Ensure the 10Gbase-T NICs are handling management
Set up shared NAS storage.
- You can also cluster storage onboard each node, but given model sizes, it will be easier to just use a NAS
- Make sure each node tests access to the NAS and shared model storage directories
- Ensure you are getting the expected speed for the NAS
Get vLLM working (or another serving platform that you can cluster) on all of the nodes
- Containers can make this a lot easier
- Make sure each node is running the same vLLM version
- Test connectivity for vLLM between nodes, and ensure it is using the ConnectX-7 path, not the 10Gbase-T path
- Smoke test running the model
Set up some kind of monitoring solution. We will have more on this in the next section.
Document everything as you will need it later.

Those are the high-level steps to get a cluster like this working. Many of our readers are going to look at that punch list and think, “That is a Saturday job, no problem.” Others are going to go into panic mode because there is a lot there and it crosses domains like servers, networking, storage, and AI platforms. Here is the fun bit: in 2026, you do not need to know any of that.

Instead, you can take the list above, give Claude Code, Codex, or even OpenClaw/ Hermes using a decent model behind it, and just have an AI agent do it for you. I know, giving 1TB of memory, 160 Arm cores, several TB of storage, and eight Blackwell GPUs to an AI agent sounds a bit like how Skynet started. That is fair, but it worked. Given the pace of AI evolution, this was possible in February 2026, and would be much easier now that there are more models that are better with tool calling.

The last part is the monitoring, so let us get to that next.

14 COMMENTS

hans.f April 27, 2026 At 2:26 pm

This is f*n awesome. Good on ya bro
lin81 April 27, 2026 At 2:54 pm

BEST piece you’ve done recently. Wow. It makes me only want a 4N not 8N
mashie April 27, 2026 At 3:54 pm

This is indeed one of your best articles in years. My 2-node cluster will keep me entertained for a long time. Maybe one day I will go up to 4 nodes, we shall see.
The MikroTik CRS804 is perfect for a 4-node cluster, 4x200G for the RoCE, 4x100G for storage, and the last 400G port facing the NAS.
jpmomo April 27, 2026 At 4:31 pm

Have you tried implementing TurboQuant on the cluster? I am curious to see how long context windows impact the available vram over time and how TurboQuant might help.

Great article especially showing how you leveraged ai to set everything up!
Peter Drayton April 27, 2026 At 5:45 pm

Repost from the STH Forums, not sure which spot is better for responses:

Fantastic article. I’m running 4N right now, was/am still considering a drop to 2N since the extra 2N arent necessary for all models but your point about the fungibility of having more/spare nodes is excellent.

Very interesting to see the callout on using a flash NAS for shared storage – would love to hear more on best practices for this, especially the situations where a little more GPU in the NAS makes sense. Next click stop for my cluster is the addition of a flash NAS based on the ARR 1U E1S you guys reviewed a while back, but I hadn’t even considered putting a GPU into it. Would love more details on this.

Also if you’re taking requests a network diagram would be fantastic. I’m running 2x 804DDQs to handle the 4N + flash + uplinks + mgmt bit I’m nowhere near settled on the topology, would love to see how you ended up structuring the full cluster network including NAS & DDQ trunks / uplink.

Lastly, WTH was Ubiquiti 10G switch you were showing?! It definitely wasn’t the Pro XG-8 PoE (no screen) and AFAIK there’s no other UI 10G switches that aren’t rack mounted. I’m pretty familiar with the UI lineup and the only unit I know that looks like what you showed was the Enterprise 8 PoE (Vintage) model!
Kris Leslie April 27, 2026 At 7:13 pm

What tool or service did you use to get the stats for all the power and servers
El Porto Verde April 27, 2026 At 8:14 pm

I thought your SM Xeon 6 SOC review was unreal good. This is even better. I think I’m more sold on a 4N not and 8N but the M3 Ultra’s prefill is doggy doo doo. I’m lovin’ your new articles Patrick.
One that I wish you’d done is the QNAP 100G and 25G switch. If you’re only getting 140G then maybe it’d be better to just do 100G

I don’t get the tube comments where they’re so dead set on being the permanent underclass.
Tex April 27, 2026 At 11:29 pm

Use basic punctuation in your titles. I had a stroke trying to read it.
Joel April 28, 2026 At 5:00 am

@El Porto & Peter – Patrick touched on it briefly in the video, but the general idea is the GB10 Connectx7 interfaces (and the CRS804 switch) are purely for inter-node coordination as they are running RoCE.

NAS and management traffic are all on the 10G NICs.

Deviating from that pattern would likely degrade model performance.

Regarding cluster size, I’ve been pretty happy with a 2 node with a simple DAC. An alternative (cheaper) way to scale out could simply be to add a totally separate 2n cluster, and load balance the LLM API requests.
Tyler NYC April 28, 2026 At 12:23 pm

I read about 8. Now I’m ordering a 4-node cluster.

Look at high-capacity DIMM pricing. If you get the 1TB you’re paying for the 128GB, maybe the SSD, but then the CPU, GPU, and NIC are free. These aren’t getting any cheaper. You aren’t going to get a better deal on this much VRAM.

Apple stopped selling the studio 512 because the spot pricing of the memory alone is approaching $15k.
Chad April 29, 2026 At 9:33 am

Awesome writeup. STH is crushing it!

Has anyone been able to PXE boot their DGX spark? I want to provision it with MAAS but it always hangs. Is it possible to PXE from the connectx instead of the realtek?
sirca April 29, 2026 At 11:50 am

Hi!
“On the MikroTik CRS812 DDQ, set up MTU, PFC, ECN, and all of the QoS bits needed.”
What is your RouterOS version? AFAIK PFC and ECN are not yet supported in v7.21.4 (long term) or v7.22.2 (stable), and are mandatory (?) functions for lossless RDMA connections.
Are you using v7.23b/rc?
Thank you
m4r1k May 5, 2026 At 5:55 am

Amazing work!!
Jacob Johnson July 4, 2026 At 7:43 pm

GB10 follow-up: standard RDMA benchmark falsely reports broken fabric on GB10 — characterized with independent probe, 24 GB/s NCCL on 3-node dual-rail MikroTik CRS804 build

Hi STH team,

Your 8x GB10 cluster article and the ConnectX-7 networking deep-dive were the reference points for our 3-node GB10 build (Dell Pro Max FCM1253, dual-rail RoCEv2, 2× MikroTik CRS804 on RouterOS 7.23.1). Along the way we found something your readers building GB10 clusters will hit: ib_write_bw deterministically reports a hard 64 KiB RDMA WRITE ceiling on this platform — responder-side local protection errors, 72/72 failing cells across every node pair and RDMA device — that does not actually exist.

We characterized it fully before concluding: an independent minimal libibverbs probe (with responder-side content verification) passes the identical MR/WR geometry at every size, and NCCL acceptance on the same fabric sustained 24.0 GB/s busbw (3-node all_reduce, full sweep, zero validation errors) with per-MAC byte counters balanced to 0.02% across both rails. Kernel version, IOMMU mode, firmware currency, MR flags, ODP, and relaxed ordering were each eliminated by dedicated experiment. Two unresolved NVIDIA forum threads from 2023 (CX-5) and 2024 (CX-7/KVM) show the same boundary and syndrome — this has been biting people quietly for years.

Full writeup, elimination matrix, and probe source: https://github.com/linux-rdma/perftest/issues/394
Community PSA with GB10 cluster-builder notes (NCCL topology trap, counter methodology, 4-subnet/ARP-discipline config that produced perfectly symmetric MAC loading): [YOUR-FORUM-POST-URL]

Happy to share the dual-rail CRS804/RouterOS 7.23.1 configuration details or run comparison tests if useful for a follow-up piece.

Jacob Johnson

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Building the 8x NVIDIA GB10 Cluster Key Setup Steps

RELATED ARTICLESMORE FROM AUTHOR

This Senao SA9832v2 is an Intel Amston Lake-Powered Cloud SASE Gateway

Diving Deeper on NVIDIA’s Vera CPU: New Architectural Details and SPEC CPU 2026 Benchmarks

Intel Foundry Nabs A Custom ASIC Win with Fortinet

14 COMMENTS

LEAVE A REPLY

RELATED ARTICLES MORE FROM AUTHOR