Many years ago at STH, we had a brief Cluster-in-a-Box series. This was back in ~2013 when it was just me working part-time on the site and we did not have the team and resources we have at STH today. It was also before concepts like kubernetes were around and the idea of a full-fledged Arm server seemed far off. Today, it is time to show off a project that has been months in the making at STH: the 2021 Ultimate Cluster-in-a-Box x86 and Arm edition. We have over 1.4Tbps of network bandwidth, 64 AMD-based x86 cores, 56 Arm Cortex A72 cores, 624GB of RAM, and several terabytes of storage. It is time to go into the hardware.
This project, or more specifically, why it is being released today, is because it is being done in collaboration with Jeff Geerling. I often hear him referred to as “the Raspberry Pi guy” because he does some amazing projects with Raspberry Pi platforms. While I was in St. Louis for SC21, you may have seen me at the Gateway Arch. We filmed a bit there and then Jeff dropped me off at the Anheuser-Busch plant that made a cameo in the recent Top 10 Showcases of SC21 piece. Our goal was simple: build our vision of our cluster-in-a-box.
As a bit of background, the cluster I had was a project from late May 2021 that you may have seen glimpses of on STH. The goal was to test some new hardware in a platform that was easy to transport since I knew that the Austin move would happen soon. Over the course of the move, we never had the chance to show this box off. Jeff had the new Turing Pi 2 platform to cluster Raspberry Pi’s, so it seemed like a challenge. He went for the lower-cost version, I went for what is probably one of the best cluster-in-a-box solutions you can make in 2021.
The ground rules were simple. We needed at least four Arm server nodes and it had to fit into a single box with a single power supply. To be fair, I knew Jeff’s plan before we filmed the collaboration, but he did not know what I had sitting in a box in Austin ready for this one.
For those who want to see, here is the STH video on the ultimate cluster in a box:
Here is Jeff’s video using the Turing Pi 2:
These are great options to watch to see both the higher-end of what is possible today along with the lower-end. As always, it is best to open these in a new browser window, tab, or app for the best viewing experience.
Building the Ultimate x86 and Arm Cluster-in-a-Box
Let us get to the hardware. First, on the x86 side. For this, we are using the AMD Ryzen Threadripper Pro platform. Usually, this platform has the Threadripper PRO 3995WX that is a Rome-generation 64-core, 128 thread “WEPYC” or workstation EPYC. Sadly, the only photos I had were with the 3975WX and 3955WX in this motherboard before the cooling solution arrived.
The system is using 8x 64GB Micron DIMMs for 512GB of DDR4-3200 ECC memory. This was the platform where one decided to pull an Incubus “Pardon Me” and burst into flames. Micron, to its credit, saw that on Twitter and sent a replacement DIMM.
The motherboard being used is the ASUS Pro WS WRX80E-SAGE SE WiFi. This is an absolutely awesome platform for the Threadripper Pro, or basically anything unless you were looking for a low-power, inexpensive, and low-cost platform. This is meant for halo builds.
The motherboard is fitted in the Fractal Design Define 7 XL. This is an enormous chassis, but that is almost required given how large the motherboard is. Even with that, it was actually a more difficult install than one would imagine due to the motherboard size.
The cooling solution for the CPU is the ASUS Rog Ryujin 360 RGB AIO liquid cooler. The reason this is using that cooler is that in the grand scheme of the overall system cost, getting a slightly more interesting cooler seemed like it was not a big deal. It was more expensive, but in May 2021, this is also what I could get on Amazon with one-day shipping.
Here is a look with more components installed:
As a quick note, the original plan was for this to be a 6x DPU cluster with extra storage via Samsung 980 Pro SSDs on the Hyper M.2 x16 Gen4 card. However, if one is doing a cluster in a box, one gets more nodes by adding an extra DPU and using the onboard M.2 slots for storage. If you have seen this on STH, that is the reason. In practice, this is actually the card that replaces the 7th DPU when the system is being used.
The DPUs are Mellanox NVIDIA BF2M516A units. A keen eye will note that we have two different revisions that have slight differences in the system. Each card has eight 8-core Arm Cortex A72 chips running at 2.0GHz. These, unlike the Raspberry Pi, have higher-end acceleration for things like crypto offload. We recently covered Why Acceleration Matters on STH. Other important specs for these cards are that they have 16GB of memory and 64GB of onboard flash for their OS. Since STH uses Ubuntu, these cards are running Ubuntu and the base image of our cards includes Docker so we can run containers on them out of the box.
The cards themselves have two 100Gbps network ports. Our particular cards are VPI cards. We covered what VPI means in our Mellanox ConnectX-5 VPI 100GbE and EDR InfiniBand Review. Basically, these can run either in 100GbE mode or as EDR InfiniBand. The way the cards can be configured is one of two ways. The Arm chip can either be placed as a bump-in-the-wire between the host system and the NIC ports.
One may do this for firewall, provisioning, or other applications. How we actually use them, because the eight Arm cores usually have a negative impact on network performance, is that both the host and the Arm CPU have access to the ports simultaneously. That does not put the Arm CPU in the path from the host to the NIC and usually increases performance. BlueField-2 feels very much like an early product.
We are going to quickly mention that there are lower-power and low-profile 25GbE cards as well. We did not have the full-height bracket or this one may have been used.
Here is a look at the seven cards stacked in the system:
Here is the docker ps and the network configuration on a fresh BlueField-2 card. One can see that there is also an interface back to the host system as well
One other interesting point is just how much networking is on the back of this system. Here is a look:
Tallying this up:
- 2x 10Gbase-T ports (ASUS)
- 14x QSFP56/ QSFP28 100GbE ports (7x BlueField-2 cards)
- 7x Management ports (7x BlueField-2 with a shared ASUS port on a 10G NIC)
- WiFi 6
All told, we have a total of 24 network connections to make on the back of this system. That is a big reason behind why one of the next projects on STH you will see is running 1700 fibers through the home studio and offices.
With the mass of fiber installed, it is trivial to connect a system like this without having to put a loud and power-hungry switch in the studio.
Tallying Up the Solution
All told we have the following specs, minus the ASPEED baseboard management controller running the ASUS ASMB10-iKVM management:
Processors: 120 Cores/ 184 Threads
- 1x AMD Ryzen Threadripper Pro 3995WX with 64 cores and 128 threads
- 7x NVIDIA BlueField-2 8-core Arm Cortex A72 2.0GHz DPUs
- 512GB Micron DDR4-3200 ECC RDIMMs
- 7x 16GB on DPU RAM
- 2x Micron 7400 3.84TB M.2 SSDs
- 7x 64GB DPU Storage
- 2x 10Gbase-T ports from the ASUS motherboard
- 14x 100G ports from BlueField-2 DPUs
- 8x out-of-band management ports (one shared)
- WiFi 6
There is a lot here, for sure and this is several steps beyond the 2013 Mini Cluster in a Box series.
Of course, this is not something we would suggest building for yourself unless you really wanted a desktop DPU workstation. This is more of an “art of the possible” build, much like the Ultra EPYC AMD Powered Sun Ultra 24 Workstation we did. Still, this is effectively well over a 10x build versus what we did using the Intel Atom C2000 series in 2013. It is also something I personally wanted to do for a long time, so that is why it is being done now.
Again, I just wanted to say thank you to Jeff for the collaboration. I am secretly jealous of the Turing Pi 2 platform since I never manage to be able to buy one. It was very fun to do this collaboration with him, and without his nudging, this project would have been delayed even more.