Intel AMX (Advanced Matrix Extensions) with Sapphire Rapids
While we cannot share the full lscpu output, we will simply say that there are three primary Intel AMX flags in the test system: amx_bf16, amx_tile, and amx_int8. The idea of having additional matrix math instructions beyond 3rd Gen Xeon Scalable in the CPU we expect to be a differentiator compared to what AMD will offer with Genoa. Still, we are in a bit of a hard spot since we can also not show off AMD EPYC Genoa numbers. Instead, we had to use an AMD EPYC 7763 machine for ResNet-50.
As a quick refresher, Intel’s strategy with on-chip acceleration is not new. Over the past few years, Intel has been working on getting AI acceleration into frameworks like Tensorflow so that instead of spending a lot of time on software enablement, using VNNI just works.
To show some impact of AI inference acceleration using VNNI and why it matters for CPUs, we can look back to our Stop Leaving Performance on the Table with AWS EC2 M6i Instances piece from 10 months ago. The first chart is looking at the impact of transitioning from a base case using FP32 and without Intel’s AIkit integration, then using the software enablement, then switching to INT8. This is the impact of just using VNNI in low batch size inference.
Imagine there is an application where you are doing various other work but then need to do one or a handful of AI inference tasks. This is not to take away the need for AI inference accelerators completely, but it is for lower-end inference loads.
Aside from the latency, that benefits from not having to go out over the PCIe bus, and we can also get more throughput with VNNI. Here is an example using BS=16.
Of course, AMD’s Zen 4 has VNNI, but we cannot share Genoa numbers yet. What we will simply say is that one can expect VNNI to be the baseline. AMX is the next step.
We have numbers for AMD EPYC Milan on this one, but it felt strange since FP32 to INT8 gives a bit over 2x the performance in terms of throughput and latency, but VNNI’s INT8 is more like 3-4x. Genoa is coming out soon after all. Instead, we are just going to show what we got with Sapphire so you can compare it to what we got with Ice Lake above.
With latency, AMX actually did better than we expected, but our base case to VNNI looked a lot closer than what we saw on the AWS M6i instances.
In terms of throughput, we get a lot better performance.
On the flip side, the max throughput of AMX is probably not what everyone is after since AMX is done on the cores, like with AVX-512. That means one is using the CPU just for AI inference, and it is probably cheaper to do it on accelerators if bulk AI acceleration is what you are after. Still, Intel would argue that at least one does not need an accelerator, but Intel sells AI inference hardware as well.
To be clear, in my discussions with AMD, this is intentional. AMD’s strategy is to allow Intel to be the first with features like VNNI and AMX. Intel does the heavy lift on the software side, then AMD brings those features into its chips and takes advantage of the more mature software ecosystem. Frankly, it feels strange showing off the AMX feature when AMD will be shipping Genoa with VNNI before AMX is out.
Next, let us get to the big one, Intel QuickAssist technology.