Building Our Office Storage for the NVIDIA GB10 Agent AI Cluster

0

Going Fast(er) with an All-Flash Local NAS for our AI Agent Machines

As a quick recap, a big reason to use this particular QNAP TS-h1290FX NAS was that it supports high-end U.2 SSDs, like the Solidigm D5-P5336. It also has an onboard NVIDIA ConnectX-6 dual port 25GbE NIC. From a noise perspective, it is one of, if not the only, system in this class that has this level of performance, but is also quiet enough we can have it in a studio. For any organization that is running lots of small AI agent machines in an office setting, there is a huge benefit to not having a big and loud system.

QNAP TS H1290FX Rear Fan 1
QNAP TS H1290FX Rear Fan 1

Inside, the system uses an AMD EPYC processor, but as an exciting point, it also has various PCIe Gen4 card slots.

QNAP TS H1290FX Inside 9
QNAP TS H1290FX Inside 9

We also reviewed the QNAP QXG-25G2SF-CX6. We are using an extra dual port 25GbE card with NVIDIA ConnectX-6 onboard to give us a total of 4x 25GbE ports, and that is our base configuration. 100GbE total is roughly a PCIe Gen5 x4 SSD worth of speed. Since we are using the 30.72TB Solidigm D5-P5336 SSDs which are PCIe Gen4 x4, it gives us roughly twice the interface bandwidth of the SSDs.

QNAP QXG 25G2SF CX6 Front Angled 1
QNAP QXG 25G2SF CX6 Front Angled 1

What is more, we found that we could push the cards to actually run full 25GbE speeds using our high-end Keysight CyPerf setup, even when running a mix of real-world application traffic, not just generic packets.

QNAP QXG 25G2SF CX6 Keysight Cyperf Performance 25G 512 Users Education
QNAP QXG 25G2SF CX6 Keysight Cyperf Performance 25G 512 Users Education

Our original plan for this (and we should note it works) was to use a pair of low-power edge switches like the QNAP QSW-M7308R-4X. These have four 100GbE ports and eight 25GbE ports.

QNAP QSW M7308R 4X Front 1
QNAP QSW M7308R 4X Front 1

Key to setting up performant AI cluster storage, even on smaller nodes, is to get RDMA and RoCEv2 working. For that, you need features like Priority Flow Control, or PFC.

QNAP QSW M7308R 4X QSS Pro System Port Management PFC
QNAP QSW M7308R 4X QSS Pro System Port Management PFC

ECN or Explicit Congestion Notification was another feature that this particular switch has, that we need for RoCE v2 networking.

QNAP QSW M7308R 4X QSS Pro QoS ECN
QNAP QSW M7308R 4X QSS Pro QoS ECN

This actually worked decently well. We had four 25GbE ports connected between the NAS and the switch. Another four of the QSFP28 ports for connecting the NVIDIA GB10 systems like the Dell Pro Max with GB10 and NVIDIA DGX Spark. The GB10 networking is really funky as we went over in detail in The NVIDIA GB10 ConnectX-7 200GbE Networking is Really Different. We could use a second switch for 100GbE RDMA networking between the GB10’s for an East-West GPU communication network, and then the original switch for the more North-South storage network. There is a lot of wisdom to setting this as a single network, but this setup allowed us to also hook up four additional systems using 25GbE adapters to the storage network, and just use that as mostly a storage back-end network to hold models.

NVIDIA GB10 ConnectX 7 Ibdev2netdev
NVIDIA GB10 ConnectX 7 Ibdev2netdev

To be clear here, we were very happy with this, $2000 of switches, a few DACs and low-cost optics, and we got great performance to access our 360TB pool of storage. Then we started to get the itch to do a bit more with our setup.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.