Gigabyte G242-P32 Topology and Running
Here is the system topology. One can see that a number of the x16 slots are actually only x8 electrical. Still, the Ampere Altra provides more PCIe connectivity than the Ice Lake generation Xeons in single socket configurations.
This system looks a lot like what an AMD EPYC server would look like versus an Intel Xeon server with a PCH. The CPU is the center of the whole system instead of also offloading low value functionality to a PCH.
Once the server is running, here is the lscpu output with the Arm Neoverse-N1 cores. Since we are using the Ampere Altra Max 128-core CPU we get 128 cores. These are all cores without the use of SMT/ Hyper-threading.
Here is a screenshot of nvidia-smi with a single GPU.
As a quick note, we simply had to download and install drivers similar to how we do for NVIDIA-x86 systems and featurs like nvidia-smi worked immediately. We also installed NVIDIA for Docker and ran NVIDIA NGC containers quickly.
Gigabyte G242-P32 Performance
A big part of this is “why Ampere?” The Ampere Altra and Altra Max are very interesting parts. In some tasks, software is well optimized and Ampere is able to take advantage of using high core counts. At the same time, the hyper-scale deployments for Oracle Cloud, Google Cloud, Microsoft Azure have used the Altra not the Altra Max and lower core counts than we are using here. Just in general, the Ampere Altra line is designed to have smaller cores that prioritize fairness for cloud VMs over absolute single-core performance.
NVIDIA came out with a number of benchmarks showing Arm v. x86 with the NVIDIA A100.
Part of the Ampere Altra CPUs value proposition is a focus on integer performance. Afterall, if you want faster floating point performance for AI or HPC, one may as well offload to the GPU at that point.
We came pretty close to NVIDIA’s ResNet-50 numbers so it seems reasonable. Also, NVIDIA is probably better at tuning its MLPerf submissions than we are, but our results are close even on the Arm platform.
The Ampere platform is also using less power by ~80W compared to what we would expect to see from an EPYC platform in this configuration.
Ubuntu installed without a hitch. We could compile and run benchmarks like STREAM quickly. There is a bit more tuning to do here, but decent results for sure.
When I discuss the Ampere Altra Max CPUs, I often say they are somewhere between an AMD EPYC 7763 and 7773X. Here are a few example of that. NGINX is a highly-optimized workload. Better put, this is perhaps one of the best workloads for Arm and is why we see it in every benchmark. This is running STH’s website access traces and one can see for our hosting, Ampere would be great.
In our MariaDB pricing analytics workload (deal desk analytics via a sanitized transaction list from a major enterprise OEM) we see that the AMD EPYC 7773X performs exceedingly well because of the larger caches. Usually this is a workload that the Arm part performs well in, but the AMD 3D V-Cache is a huge benefit. As a result, it falls between 64C AMD parts.
Since really the focus here is on GPU/ PCIe performance, here is the Ampere Altra Max performance compared to other PCIe Gen4 platforms we have tested with a Kioxia CM6 PCIe Gen4 NVMe SSD. One can see better performance than we saw on the Huawei Ascend 910 with the Kunpeng 920 arm CPUs.
The best benchmark we found was actually our c-ray 8K benchmark where simply having 128 cores allowed the Ampere Altra Max to be a big winner.
With all of that said, a pretty large portion of our traditional benchmarks are x86 or x86 optimized. That is something we are going to fix in our next-generation benchmarks we will unveil with Genoa later this quarter. Still, these are not anemic CPUs by any means and the single socket platform actually has some advantages over the dual socket Altra configuration.
Next, I wanted to discuss the market impact of this kind of system.