Today we are showing off a build that is perhaps the most sought after deep learning configuration today. DeepLearning11 has 10x NVIDIA GeForce GTX 1080 Ti 11GB GPUs, Mellanox Infiniband and fits in a compact 4.5U form factor. There is also an important difference between this system and DeepLearning10, our 8x GTX 1080 Ti build. DeepLearning11 is a single-root design which has become popular in the deep learning space.
At STH, we are creating more deep learning reference builds not just as theoretical exercises. These machines are either being used by our team or our clients. We have done a number of smaller builds including DeepLearning01 and DeepLearning02 that we published. While those builds were focused on an introductory, getting your feet wet with frameworks, DeepLearning11 is a completely different end of the spectrum. We know this exact framework is used by a top 10 worldwide hyper-scale / deep learning company.
If we had asked NVIDIA would probably have been told to buy Tesla or Quadro cards. NVIDIA specifically requests that server OEMs not use their GTX cards in servers. Of course, this simply means resellers install the cards before delivering them to customers. As an editorial review site, we do have tight budget constraints so we bought 10x NVIDIA GTX 1080 Ti cards. Each NVIDIA GTX 1080 Ti has 11GB of memory (up from 8GB on the GTX 1080) and 3584 CUDA cores (up from 2560 on the GTX 1080.) The difference in price for us to upgrade from GTX 1080’s was around $1,500. We did purchase cards from multiple vendors.
Our system is the Supermicro SYS-4028GR-TR2 which is one of the mainstay high-GPU density systems on the market. The -TR2 is significant as it is the single root version of the chassis and different from DeepLearning10’s -TR dual root system.
Like with the DeepLearning10 build, DeepLearning11 has a “hump” bringing the total system size up to 4.5U. You can read more about this trend in our Avert Your Eyes from the Server “Humping” Trend in the Data Center piece.
This hump allows us to use NVIDIA GeForce GTX cards in our system with their top facing power ports.
We are using a Mellanox ConnectX-3 Pro VPI adapter that supports both 40GbE (main lab network) as well as 56Gbps Infiniband (deep learning network.) We had the card on hand but using FDR Infiniband with RDMA is very popular with these machines. 1GbE / 10GbE networking simply cannot feed these machines fast enough. We have installed an Intel Omni-Path switch in the lab which will be our first 100Gbps fabric in the lab.
In terms of CPU and RAM we utilized 2x Intel Xeon E5-2628L V4 CPUs and 256GB ECC DDR4 RAM. We will note that dual Intel Xeon E5-2650 V4 are common chips for these systems. They are the lowest-end mainstream processor that supports the 9.6GT/s QPI speed. We are using the Intel Xeon E5-2628L V4 CPUs since single root design bestows another important benefit, no more inter-GPU QPI traffic. Although we have heard one can use a single GPU to power the system, we are still using two for more RAM capacity using our inexpensive 16GB RDIMMs. These systems can take up to 24x DDR4 LRDIMMs for massive memory capacity.
We are going to do a single-root piece soon, but for those deep learning folks using building blocks such as NVIDIA nccl, a common PCIe root is important. That is also a reason that many deep learning build-outs will not switch to higher PCIe count but higher latency/ more constrained designs like AMD EPYC with Infinity Fabric.
In terms of a cost breakdown, here is what this might look like if you were using Intel E5-2650 V4 chips:
The striking part here is that the total cost of about $16,500 has a payback period of under 90 days compared to AWS g2.16xlarge instance types. We will include hosting costs below to show how that compares on a TCO basis.
Comparing the DeepLearning11 10x GPU example to DeepLearning10 with its 8x GPUs, you can see that the ~25% performance bump comes at a relatively little expense in terms of overall system cost:
As one may imagine, adding more GPUs means that the overhead of the rest of the system is amortized over more GPUs. As a result, if your application scales well, get 10x GPUs per system.
DeepLearning11: Environmental Considerations
Our system has four PSUs, which are necessary for the 10x GPU configuration. To test this, we let the system run with a giant (for us) model for a few days just to see how much power is being used. Here is what power consumption of the 10x GPU server looks like as measured by the PDU running our Tensorflow GAN workload:
Around 2600W is certainly not bad. Depending on where the model was in training we saw higher sustained power consumption in the 3.0-3.2kW range on this machine without touching the power limits on the GPUs.
You will notice there was an enormous 5278W peak, during some password cracking, more on this in a future STH piece. The peak during a few weeks using different problems and frameworks in the deep learning field was just under 4kW. Using 4kW as our base, we can calculate colocation costs for such a machine easily.
As you can see, over 12 months, the colocation costs start to dwarf the hardware costs. For these, we are using our actual data center lab colocation costs. If you want to build a similar cost model, we are happy to provide contact information with who we use so you can replicate the above. These are not theoretical numbers, we are actually spending around $1k/ month to run this system in the data center.
Compare the above to DeepLearning10 with 8x GPUs and you can see the impact of adding ~500W of additional compute:
Adding additional GPUs adds operational costs in-line with system costs compared to DeepLearning10. Moving into subsequent years, colocation costs will far exceed hardware costs.
DeepLearning11: Performance Impact
We wanted to show just a bit about how much performance we gained out of the box with this new system. There is a large difference between a $1600 system and a $16,000+ system so we would expect the impact to be similarly large. We took our sample Tensorflow Generative Adversarial Network (GAN) image training test case and ran it on single cards then stepping up to the 10x GPU system. We expressed our results in terms of training cycles per day.
This is a great example of how adding $1400 or so more to the purchase price of the system yields tangible results. Whereas a single NVIDIA GeForce GTX 1080 Ti allows us to train the model once every eight hours, a 10x GPU box lets us train on a greater than hourly cadence. If you want to make progress in a work day, a big box, or a cluster of big boxes helps.
As one may imagine, DeepLearning10 and DeepLearning11 use a lot of power. Just those two servers alone are averaging over 5kW of consistent power draw with spikes going much higher. That has major implications for hosting as the “hump” adding 0.5 RU is not significant in many racks. Most colocation racks cannot deliver 25kW+ of power and cooling per rack to fill them with GPU servers. We often see these hosted with 2 GPU compute machines per 30A 208V rack so placement and blanking become important.
In the end, we wanted to have a significant single root system in the lab and we have that with DeepLearning11 and its 10x NVIDIA GTX 1080 Ti 11GB GPUs. Since we advocate scaling up the GPU size first, then the number of GPUs per machine, then to multiple machines, DeepLearning11 is both a great top-end single machine but also as a platform to scale out to multiple machines based on the design. There are some features such as GPUDirect using RDMA that are great on this platform assuming your software and hardware stack can support them. We are practically limited by budget so we got the best cards we could afford, the GTX 1080 Ti.