Intel Xeon Max 9480 Performance
Presenting the performance of the Intel Xeon Max 9480 gave us absolute fits because we tried boiling the ocean at some point. The reason for that is what we saw in our previous section. There are six different configurations, and using the HBM Flat Mode options often requires playing with data placement to deal with different memory pools.
Six options may not sound bad, but realistically, this is a HPC-focused part. In the HPC realm, many codes do not benefit, and can potentially run slower with SMT or Intel Hyper-Threading. All of the six cases in the matrix above have another dimension of Hyper-Threading ON and OFF bringing our total to at least 12.
Here are Intel’s official numbers for the Xeon Max on a number of popular workloads. Intel’s performance team does a lot of this tuning to get these numbers, and so we are going to say look to this if you want a more official performance expectation on a given workload.
Instead, we just wanted to show something cool and beyond just running STREAM and at least provide some thoughts.
First, let us point out the obvious. If your workload runs almost entirely in cache, then the HBM2e memory onboard is not a benefit. HBM2e packages use power, so clock speeds on the Xeon MAX part are slightly lower than the Intel Xeon Platinum 8480+. Of note here, one may see the EPYC 9684X runs into a similar challenge with the extra L3 cache not mattering and therefore lower clock speeds hurt performance here.
Since we know folks are going to ask, here are the SPEC CPU2017 figures for the parts. Intel focused the Xeon Max on floating point performance, not on integer performance, and that shows. Still, something like the SPECrate2017_int_base score is not going to be impacted by HBM2e but it is impacted by slightly lower clock speeds.
The other takeaway here is that this is a great example of where a widely used benchmark in server RFPs will not even utilize the big performance boost of HBM2e memory.
We left Hyper-Threading on, ran a few different workloads, and found some really interesting results. For several of them, we are just going to say we are directionally similar to what Intel saw, but Intel is doing more tuning so we would use Intel’s numbers.
Running HBM2e only or adding DDR5 to run the chips in a caching mode made the chips certainly have an impact. There was more to this than one might think. Something that we have not heard many others talk about, but we found fairly quickly, was the impact of problem size. A great example is our pricing analytics workload that seeks to build a discounted deal pricing, according to regional revenue recognition rules, for a given deal based on any BOM that a data center OEM may use. We found that running the application on the chips really did not provide a huge difference. Then, we changed the test and ran four copies, one for each compute tile, and the results were much better.
As a quick aside, this is not uncommon these days. In our primary server CPU reviews, we had to split our Linux Kernel Compile benchmark into running multiple copies on a CPU in order to keep cores utilized.
We even ran our KVM virtualization testing on these parts and saw something similar. Despite having lower clock speeds, and being an integer performance-dominated test, Xeon Max actually did well when it hit a sweet spot of HBM2e usage in caching mode.
Of course, the elephant in the room is AMD EPYC. There are many workloads where having chips like the AMD EPYC 9684X with huge caches and 96 cores is very good. There are others where having HBM2e helps keep cores fed.
Perhaps the best way to think about this, especially if we had the same core counts, is that the working data set size and perhaps more importantly, the size of the hot portion of a data set is an enormous determining factor here. The chart above has a different OpenFOAM test case that is less sensitive to HBM and more sensitive to Genoa-X cache due to the problem size just to show how sensitive this stuff is. In most cases, for that workload, HBM helps a lot, but we wanted to show some of the sensitivity. Note, we swapped this chart to show both cases to make this a bit more clear. It shows the problem set sensitivity better anyway.
The other side goes back to our original discussion in this section. There are effectively twelve primary configurations HT ON/OFF, HBM2e only/ Cache Mode/ Flat Mode, NPS=1 or NPS=4. For AMD, one can turn SMT ON/OFF and split the CPU into quadrants as well so the number of permutations for AMD is exciting as well.
The bottom line is that adding HBM2e to a CPU is not always going to make it the fastest, but sometimes it can add an enormous amount of performance. Our ending recommendation here is simple: If you are in the market, go try Xeon MAX. That is not just for HPC customers either. There are likely a number of non-HPC customers that can benefit from HBM2e onboard that likely have no idea these chips even exist, even though they are drop-in replacements to many existing Xeon sockets. A great example of this is Numenta recently discussed doing AI inference on Xeon MAX faster than GPUs using its software and Intel AMX. Servers are becoming like cars or mattresses where it pays to try before you buy.
Next, let us talk about power consumption.