NVIDIA Tesla T4 AI Inferencing GPU Benchmarks and Review


NVIDIA Tesla T4 Deep Learning Benchmarks

As we continue to innovate on our review format, we are now adding deep learning benchmarks. In future reviews, we will add more results to this data set.

ResNet-50 Inferencing Using Tensor Cores

ImageNet is an image classification database launched in 2007 designed for use in visual object recognition research. Organized by the WordNet hierarchy, hundreds of image examples represent each node (or category of specific nouns).

In our benchmarks for Inferencing, a ResNet50 Model trained in Caffe will be run using the command line as follows.

nvidia-docker run --shm-size=1g --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm -v ~/Downloads/models/:/models -w /opt/tensorrt/bin nvcr.io/nvidia/tensorrt:18.11-py3 giexec --deploy=/models/ResNet-50-deploy.prototxt --model=/models/ResNet-50-model.caffemodel --output=prob --batch=16 --iterations=500 --fp16

Options are:
–deploy: Path to the Caffe deploy (.prototxt) file used for training the model
–model: Path to the model (.caffemodel)
–output: Output blob name
–batch: Batch size to use for inferencing
–iterations: The number of iterations to run
–int8: Use INT8 precision
–fp16: Use FP16 precision (for Volta or Turing GPUs), no specification will equal FP32

We can change the batch size to 16, 32, 64, 128 and precision to INT8, FP16, and FP32.

The results are in inference latency (in seconds.) If we take the batch size / Latency, that will equal the Throughput (images/sec) which we plot on our charts.

We also found that this benchmark does not use two GPU’s; it only runs on a single GPU. You can, however, run different instances on each GPU using commands like.
```NV_GPUS=0 nvidia-docker run ... &
NV_GPUS=1 nvidia-docker run ... &```

With these commands, a user can scale workloads across many GPUs. Our graphs show combined totals.

We start with Turing’s new INT8 mode which is one of the benefits of using the NVIDIA RTX cards.

NVIDIA Tesla T4 ResNet 50 Inferencing Int8
NVIDIA Tesla T4 ResNet 50 Inferencing Int8

Here we did not get down to INT4, but INT8 is becoming very popular. Using INT8 precision is generally faster for inferencing than using floating-point. There is significant research that shows in many situations INT8 is accurate enough for inferencing making it an accurate enough and lower computational power choice for the workload.

We are going to discuss inferencing results after we show the FP16 and FP32 numbers so let us look at FP16 and FP32 results.

NVIDIA Tesla T4 ResNet 50 Inferencing FP16
NVIDIA Tesla T4 ResNet 50 Inferencing FP16
NVIDIA Tesla T4 ResNet 50 Inferencing FP32
NVIDIA Tesla T4 ResNet 50 Inferencing FP32

These results were somewhat shocking at first, then seemed logical. In all three tests, we see the NVIDIA Tesla T4 perform below the GeForce RTX 2060 Super. One can extrapolate and put two Tesla T4’s at about the performance of a GeForce RTX 2070 Super or NVIDIA GeForce RTX 2080 Super.

If we look at execution resources and clock speeds, frankly this makes a lot of sense. The Tesla T4 has more memory, but less GPU compute resources than the modern GeForce RTX 2060 Super.

On the other hand, this is NVIDIA’s premiere AI inferencing card that costs around $2000-$2500 in many servers. That is a fairly steep premium for something that performs at the level of a $349-$399 gaming GPU at in its targeted application domain.

ResNet-50 Training using Tensor Cores and Tensorflow

We also wanted to train the venerable ResNet-50 using Tensorflow. During training the neural network is learning features of images, (e.g. objects, animals, etc.) and determining what features are important. Periodically (every 1000 iterations), the neural network will test itself against the test set to determine training loss, which affects the accuracy of training the network. Accuracy can be increased through repetition (or running a higher number of epochs.)

The command line we will use is:

nvidia-docker run --shm-size=1g --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v ~/Downloads/imagenet12tf:/imagenet --rm -w /workspace/nvidia-examples/cnn/ nvcr.io/nvidia/tensorflow:18.11-py3 python resnet.py --data_dir=/imagenet --layers=50 --batch_size=128 --iter_unit=batch --num_iter=500 --display_every=20 --precision=fp16

Parameters for resnet.py:
–layers: The number of neural network layers to use, i.e. 50.
–batch_size or -b: The number of ImageNet sample images to use for training the network per iteration. Increasing the batch size will typically increase training performance.
–iter_unit or -u: Specify whether to run batches or epochs.
–num_iter or -i: The number of batches or iterations to run, i.e. 500.
–display_every: How frequently training performance will be displayed, i.e. every 20 batches.
–precision: Specify FP32 or FP16 precision, which also enables TensorCore math for Volta and Turing GPUs.

While this script TensorFlow cannot specify individual GPUs to use, they can be specified by
setting export CUDA_VISIBLE_DEVICES= separated by commas (i.e. 0,1,2,3) within the Docker container workspace.

We will run batch sizes of 16, 32, 64, 128 and change from FP16 to FP32. Our graphs show combined totals.

Some GPUs like the new Super cards as well as the GeForce RTX 2060, RTX 2070, RTX 2080 and RTX 2080 Ti will not show higher batch size runs because of limited memory.

NVIDIA Tesla T4 ResNet 50 Training FP16
NVIDIA Tesla T4 ResNet 50 Training FP16

Using the two NVIDIA Tesla T4’s in the same space as one full-sized GPU’s we find the NVIDIA Tesla T4 achieves near the NVIDIA RTX 2080 Ti results at lower power. This is a good result.

Moving to FP32:

NVIDIA Tesla T4 ResNet 50 Training FP32
NVIDIA Tesla T4 ResNet 50 Training FP32

One can see that with the 16GB of onboard memory, the NVIDIA Tesla T4 can train using a batch size of 128 here, and gets a performance boost from that. At the same time, it is only giving a 5-6% benefit and performance is unable to match our GeForce RTX 2060 results.

Deep Learning Training Using OpenSeq2Seq (GNMT)

While Resnet-50 is a Convolutional Neural Network (CNN) that is typically used for image classification, Recurrent Neural Networks (RNN) such as Google Neural Machine Translation (GNMT) are used for applications such as real-time language translations.

The command line we use for OpenSeq2Seq (GNMT) is as follows.

nvidia-docker run -it --shm-size=1g --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v ~/Downloads/OpenSeq2Seq/wmt16_de_en:/opt/tensorflow/nvidia-examples/OpenSeq2Seq/wmt16_de_en -w /workspace/nvidia-examples/OpenSeq2Seq/ nvcr.io/nvidia/tensorflow:18.11-py3

We then open the en_de_gnmt-like-4GPUs.py and edit our variables.

vi example_configs/text2text/en-de/en-de-gnmt-like-4GPUs.py

First, edit data_root to point to the below path:
data_root = "/opt/tensorflow/nvidia-examples/OpenSeq2Seq/wmt16_de_en/"

Additionally, edit the num_gpus, max_steps, and batch_size_per_gpu parameters under
base_prams to set the number of GPUs, run a lower number of steps (i.e. 500) for
benchmarking, and also to set the batch size:
base_params = {
"num_gpus": 1,
"max_steps": 500,
"batch_size_per_gpu": 128,

We also edit lines 44 and below as shown to enable FP16 precision:

#"dtype": tf.float32, # to enable mixed precision, comment this
line and uncomment two below lines
"dtype": "mixed",
"loss_scaling": "Backoff",

We then run the benchmarks as follows.

python run.py --config_file example_configs/text2text/en-de/en-de-gnmt-like-4GPUs.py --mode train

The results will be Avg. Objects per second trained which we plot.

We should note that other GPUs we tested, such as the RTX 2060 (Super), RTX 2070 (Super), RTX 2080 (Super), and RTX 2080 Ti could not complete this benchmark due to the lack of installed memory. To enable this benchmark to finish on these GPU’s one might need to lower the batch size to smaller values like 32, 16, 8. We tried this but had no luck. Using a batch size 4 could be run but it was decided that this was not a very usable size.

As the NVIDIA Tesla T4 has 16GB of installed memory it is the first GPU we have tested to break into the OpenSeq2Seq (GNMT) benchmark graph, no other graphics card that we have tested could run this test aside from the single and dual NVIDIA Titan RTX configurations.

NVIDIA Tesla T4 OpenSeq2Seq FP16 Mixed
NVIDIA Tesla T4 OpenSeq2Seq FP16 Mixed
NVIDIA Tesla T4 OpenSeq2Seq FP32
NVIDIA Tesla T4 OpenSeq2Seq FP32

The NVIDIA Titan RTX is a dual-slot, longer, and higher power card. On the other hand, it would take more than three NVIDIA Tesla T4’s to equal the same performance as a similarly priced GPU cousin.

Next, we are going to look at the NVIDIA tesla T4 power and temperature tests and then give our final words.



Please enter your comment!
Please enter your name here