AMD has hit the reset button on ROCm and quietly launched v5.0 this week. I could not be more excited. With it, they have added support for the MI200 server accelerator announced last November as well as enhanced support for RDNA2. It also seems to be doing a better job of providing a more robust experience. There is even a timely mention of FPGAs.
AMD Hits the ROCm Reset Switch with 5.0 – Get Excited
For those that do not know ROCm (just ROCm now, not an acronym), this is AMD’s accelerated computing framework. You are most likely to encounter it when using AMD GPUs for compute and it is in many ways AMD’s approach to counter something like NVIDIA CUDA. To put it bluntly, NVIDIA has a head-start in the space with CUDA, and AMD is catching up.
The MI25 is deprecated and will be continued to be supported under the 4.5 branch till Nov 2022. This is important for several reasons. MI25 has a small deployment base, and it does not support the hardware features needed to continue development or would do so at great performance penalties. Secondly, they are not dropping support for it in the driver, they are simply moving forward as NVIDIA does, with new hardware, a new CUDA version.
The new ROCm installer supports multi-version installs, uninstalls, as well as updates.
With this release, AMD is in a much better place than it was just 6 months ago but challenges persist. The most notable is their relatively fragile software recipe. This has to do with communication with OS partners, their bug tracking, tester hardware resources, and documentation fragmentation.
To address their documentation issues AMD has launched a new documentation hub: https://docs.amd.com/. This supports ROCm V4.5 and 5.0. For older versions, https://rocmdocs.amd.com/ remains active. While this increases the fragmentation, it is also a fresh start and renewed focus on accurate documentation.
Each release of ROCm has a supported kernel list and I have encountered a broken recipe less than a week after previous ROCm releases. Even now on V5.0 they list support for Ubuntu 20.04.3 5.11 HWE, even as Ubuntu has just rolled to 5.13. In the past, this has broken things, today it has not. I did a quick install to test 5.13 HWE as well as RDNA2 support and was pleasantly surprised I did not have to drop back to the 5.8 kernel which would be the default recommendation in this situation. HWE kernels are moving targets so this is a big improvement. This is growth for AMD in terms of the stability of recipes but an area of communication they need to develop with Canonical.
Add another reason to be excited. AMD dropped this easter egg in the documentation.
With the upcoming acquisition of Xilinx comes FPGAs and they appear to be driving to support them under the same ROCm framework as their GPU-based accelerators. Truly exciting.
Evolving ROCm is part of AMD’s charge when it comes to getting its accelerators adopted in the data center. With two Exascale systems coming from AMD and using its GPUs, AMD needs to keep investing in a software platform that scales.
In a few days, AMD-Xilinx will close after a statutory waiting period. At that point, AMD will have CPU, GPU, and FPGA compute resources, along with a number of acceleration technologies. NVIDIA has its CUDA base, and Intel has its OneAPI. The big question is how AMD will integrate ROCm and Xilinx Vitis going forward.