AMD announced support for ROCm in conjunction with Tensorflow 1.8 (see this blog post.) We applaud that AMD is pushing its TensorFlow support forward. At the same time, we cannot help but note that the gap between AMD and NVIDIA experience and efforts is widening. The latest announcement is that the company is maintaining a Tensorflow 1.8 ROCm enabled stack. It is also releasing a Docker container with Tensorflow 1.8 and Python2. Since there are a lot of industry analysts who will look facially at this announcement as closing the gap, we wanted to take a second and show that the gap is still enormous. AMD needs to do this work, but they also need to speed up efforts by a large amount.
AMD Announces a ROCm Tensorflow Docker Container
Containers are extremely popular in deep learning. One of the major problems they solve is keeping environments packaged and working. AMD announced a packaged TensorFlow with ROCm solution.
alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx -v /data/imagenet/tf:/imagenet'
For comparison, here is the NVIDIA packaged TensorFlow launch command:
nvidia-docker run -it --rm -v local_dir:container_dir nvcr.io/nvidia/tensorflow:<xx.xx>-py<x>
NVIDIA is on its second generation nvidia-docker integration. That basic building block means that NVIDIA’s base implementation can be used for more than just TensorFlow. NVIDIA also supports a wide variety of frameworks and applications with maintained containers.
AMD Needs to Pick-up the Pace
Putting it frankly, AMD needs to pick up the pace of its software support for its GPUs in deep learning. This announcement happened on August 27, 2018, which was a bit strange. TensorFlow 1.10 was released several weeks prior (see here.) TensorFlow 1.8 was released April 27, or about four months earlier. Tensorflow 1.9 was released July 10, 2018. This industry moves so fast that a four-month lag time is enormous.
We know teams in the deep learning training community want AMD to compete. NVIDIA is charging huge premiums for its deep learning GPUs. AMD needs to pick up the pace if it wants to become competitive.
Working to Move Upstream
AMD said it is planning to upstream its work into Tesnorflow. AMD is committing to releasing future updates. Here is the excerpt from the blog post:
“In addition to supporting TensorFlow v1.8, we are working towards upstreaming all the ROCm-specific enhancements to the TensorFlow master repository. Some of these patches are already merged upstream, while several more are actively under review. While we work towards fully upstreaming our enhancements, we will be releasing and maintaining future ROCm-enabled TensorFlow versions, such as v1.10.”
In order to gain momentum, AMD needs more. AMD needs to state which versions of TensorFlow they will support (e.g. TensorFlow 1.9 which is not mentioned.) AMD needs to give a timeline. Will they be four months behind in the future? Finally, and more importantly, AMD needs their work to be mainline so data scientists can leverage investments in AMD GPUs immediately upon a new TensorFlow release.
I asked our Editor-in-Chief, and STH has AMD Vega cards in our lab, even working in conjunction with AMD EPYC. The general industry perception is that a healthy AMD is good for the industry and customers. AMD has some promising hardware, despite NVIDIA’s focus on specific Tensor Cores. As an industry, we need AMD’s software ecosystem to deploy.