Update 2019-11-06 3:50PM Pacific: Update to the Intel Xeon Platinum 9282 GROMACS Benchmarks Piece – Please Read.
Today something happened that many may not have seen. Intel published a set of benchmarks showing its advantage of a dual Intel Xeon Platinum 9282 system versus the AMD EPYC 7742. Vendors present benchmarks to show that their products are good from time-to-time. There is one difference in this case: we checked Intel’s work and found that they presented a number to intentionally mislead would-be buyers as to the company’s relative performance versus AMD.
For years, even through the 2017 introduction of Skylake Xeon and Naples EPYC parts, on the server side, the company has been relatively good about getting a balanced view. In late 2018, the company brought on a new team, allegedly to look over the performance benchmarks the company produced. That has culminated in the “Performance at Intel” Medium blog. This is described as:
“Intel’s blog to share timely and candid information specific to the performance of Intel technologies and products.” In only its seventh post, it has betrayed that motto.
Here is the post in question. HPC Leadership Where it Matters — Real-World Performance
Just to be clear, I know and personally like the Intel performance labs folks as well as the folks on their new performance strategy team. This is just a gaffe that needed to be pointed out since, in theory, Intel has taken to do more diligence now than when they were doing a good job in 2017.
Misleading with Benchmarks and Footnotes
First, here is a chart Intel produced as part of the story to show that it has superior performance to AMD, and we are going to highlight one of the results, the GROMACS result:
The reason we highlighted this result is because it looked off to us. A 400W 56 core part seemed a bit strange that is was 20% faster here.
The footnote on configuration details we followed eventually leading us to the referenced #31 corresponding to this result. Here is the configuration for the test:
GROMACS 2019.3: Geomean (5 workloads: archer2_small, ion_channel_pme, lignocellulose_rf, water_pme, water_rf):
Intel® Xeon® Platinum 9282 processor: Intel® Compiler 2019u4, Intel® Math Kernel Library (Intel® MKL) 2019u4, Intel MPI 2019u4, AVX-512 build, BIOS: HT ON, Turbo OFF, SNC OFF, 2 threads per core;
AMD EPYC™ 7742: Intel® Compiler 2019u4, Intel® MKL 2019u4, Intel MPI 2019u4, AVX2 build, BIOS: SMT ON, Boost ON, NPS 4, 1 threads per core. (Source: Intel)
We split the paragraph on the source page into three lines and will discuss the first, followed by the last two.
Using a Zen 2 Disadvantaged GROMACS Version
The first line is damning. Intel used GROMACS 2019.3. To be fair, they used the same version which makes it a valid test. GROMACS 2019.3 was released on June 14, 2019, just after the 2nd Gen Intel Xeon Scalable series. On October 2, 2019 the GROMACS team released GROMACS 2019.4. Keep in mind that it is over a month before Intel published its article.
The AMD Zen 2 architecture is now detected as different from Zen 1 and uses 256-bit wide AVX2 SIMD instructions (GMX_SIMD=AVX2_256) by default. Also the non-bonded kernel parameters have been tuned for Zen 2. This has a significant impact on performance. (Source: GROMACS Manual)
In the industry, it is or should have been well known that older versions of GROMACS were not properly supporting the new “Rome” EPYC architecture. We cited this specifically in our launch piece since we found the issue. We even specifically call it out on every EPYC 7002 v. Xeon chart we have produced for GROMACS since the results did not meet expectations:
That is just one of our test cases which is considered a “small” case which is frankly too small for the size of these nodes. Still, the data was very easy to spot something was awry.
What Intel perhaps did not know, is that we also had one of the lead developers on GROMACS, a popular HPC tool, on our dual AMD EPYC 7002 system to address some of the very basic optimizations for the 2019.4 release. I believe there may be more coming, but this is one where we found the lack of optimization, and actually helped ever so slightly in getting it fixed.
By Intel using the post-2nd generation Intel Xeon Scalable version of GROMACS but the pre-AMD EPYC 7002 series which had been out for over a month, Intel’s numbers are highly skewed for the Platinum 9282 which only has a 20% lead.
Again, technically this was a valid test by using the same version. On the other hand, Intel specifically used a version that was prior to the package getting any AMD Zen 2 optimizations.
Test Configuration Discussion
Moving to the test configuration lines for Intel and AMD, here are the lines in table form for easier comparison:
Intel used its compiler, MKL, and MPI for this test. In the 2017 era, Intel tested Xeon and EPYC with a variety of compilers and picked the best one. We are going to give their lab team that runs the tests the benefit of the doubt here that Intel’s compiler and MKL/ MPI implementation yield the best results. Indeed, it is better that AMD does well than Arm for Intel since a customer staying on x86 is a much easier TAM to fight back against in 2021 for Intel.
The AVX status we addressed in the section above. Using AVX2 on GROMACS 2019.3 would have disadvantaged the AMD EPYC parts.
On both CPUs we see that there are two threads per CPU which means 56 cores/ 112 threads on the Platinum 9282 and 64 cores/ 128 threads on two AMD EPYC 7742 CPUs.
Then things change. Turbo was enabled on the EPYC 7742, but not on the Xeon Platinum 9282. In GROMACS, transitions in and out of AVX-512 code can lead to differences in boost clocks which can impact performance. We are just going to point out the delta here.
SNC is off for Intel but NPS=4 is set for AMD. Sub-NUMA clustering allows for each memory controller to be split into two domains. On a standard Xeon die, that means two NUMA nodes per CPU. Assuming it works the same on the dual-die Platinum 9282, it would be four NUMA nodes per package.
The AMD EPYC NPS setting is very similar as it allows one to go from one NUMA node per socket and instead select two or four. Here, Intel is running four NUMA nodes per socket, or eight total for the dual AMD EPYC 7742 system versus only two NUMA nodes per socket or four total on the Platinum 9282 system. SNC/ NPS usually increases memory bandwidth to cores by localizing memory access. What is slightly interesting here is how Intel characterizes GROMACS as being compute versus-memory bound.
Finally, threads per core. On the Intel platform, it is 2. On AMD, it is 1. That means Intel is using 224 threads on 112 cores for the Xeon Platinum 9282 and 128 threads on 128 cores with 256 threads on the system. Putting the translation of configuration words into a table this is what Intel did with the test configurations:
What we do not know is whether Intel needed to do this due to problem sizes. GROMACS can error out if you have too many threads which is why we have a STH Small Case that will not run on many 4P systems and is struggling, as shown above, on even the dual EPYC 7742 system. It does require very solid thread pinning in this scenario of one GROMACS thread on a two-thread core otherwise performance can go poorly quickly with this configuration.
Even assuming that Intel software tooling is superior, Intel changed the boost setting, added more NUMA nodes for AMD, and used fewer threads per core than with Intel. Perhaps that is how Intel got the best results using an older version of GROMACS, but that is a fair number of changes.
Interesting Test Configuration Points
One of the other, very interesting points here is that Intel tested on a Naples generation test platform.
Here is the Intel configuration:
Intel® Xeon® Platinum 9282 processor configuration: Intel “Walker Pass” S9200WKL platform with 2-socket Intel® Xeon® Platinum 9282 processors (2.6GHz, 56C), 24x16GB DDR4-2933, 1 SSD, BIOS: SE5C620.86B.2X.01.0053, Microcode: 0x5000029, Red Hat Enterprise Linux* 7.7, kernel 3.10.0-1062.1.1. (Source: Intel)
Here is the AMD EPYC configuration:
AMD EPYC™ 7742 processor configuration: Supermicro AS-2023-TR4 (HD11DSU-iN) with 2-socket AMD EPYC™ 7742 “Rome” processors (2.25GHz, 64C), 16x32GB DDR4-3200, 1 SSD, BIOS: 2.0 CPLD 02.B1.01, Microcode: 830101C, CentOS* Linux release 7.7.1908, kernel 3.10.0-1062.1.1.el7.crt1.x86_64. (Source: Intel)
Intel is using 16GB DIMMs versus 32GB DIMMs for EPYC. They have different numbers of memory channels so we can let that pass. One item was very interesting, the test server. Intel used its S9200WKL which we covered in Intel Xeon Platinum 9200 Formerly Cascade Lake-AP Launched.
What is more interesting is the AMD EPYC 7742 configuration. Here, Intel is using the Supermicro AS-2023-TR4 that it shows is built upon the Supermicro HD11DSU-iN which is similar to the motherboard we reviewed in our Supermicro AS-1123US-TR4 Server Review server.
Most likely, it has to be a Revision 2.0 motherboard to support the EPYC 7002 generation and DDR4-3200 speeds. Being a H11 platform, it will only support PCIe Gen3, not Gen4. Again, this is an off-the-shelf configurable system that the socketed EPYC 7742 allows for versus the Intel-only Platinum 9200 solution. We explained why that is an extraordinarily important nuance in Why the Intel Xeon Platinum 9200 Series Lacks Mainstream Support.
The base cTDP of the EPYC 7742 is 225W as Intel notes in its article. Technically, the Supermicro server using a Rev 2.0 board is capable of running the AMD EPYC 7742 series at a cTDP of 240W. If someone was to compare, on a socket-to-socket basis, an EPYC 7742 to a 400W Platinum 9282, one may expect that pushing the cTDP up to 240W would be a common setting. Of note, in our testing, even with a cTDP of 240W the power consumption to TDP ratio for EPYC 7742 chips is much closer than the Platinum 8280’s AVX-512 power consumption to TDP ratio is. Extrapolating, there is a 100% chance that the EPYC 7742 is using considerably less power here to the point that cTDP should have been set to 240W.
In the main article’s text, Intel states 225W for the part and cTDP was not mentioned in the #31 configuration details. We are also going to note that there is a 280W AMD EPYC 7H12 part available, but it is unlikely Intel had access to this for internal lab testing at this point (STH has not been able to get a pair either.) That would have at least been a somewhat better comparison.
Finally, Intel is using a 3.10.0-1062.1.1 CentOS/ RHEL kernel. Newer Linux kernels tend to perform better with the newer EPYC chips but it is valid that Intel is being consistent even if it potentially disadvantages AMD.
That was around 1800 words on a single benchmark that Intel presented. Should the text leave any doubt, personally I tend to give the benefit of the doubt to the folks in Intel’s performance labs since they did a fairly good job in the 2017 era. However, now those folks have a performance strategy team sitting above them that is publishing articles like this that have misses that a reasonably prudent performance arbiter should see.
One can only conclude that Intel’s “Performance at Intel” blog is not a reputable attempt to present factual information. It is simply a way for Intel to publish misinformation to the market in the hope that people do not do the diligence to see what is backing the claims. Once one does the diligence, things fall apart quickly.
The fact that Intel documented their procedures means that they had valid tests. It is just that the tests presented, and validated by the Performance at Intel team were clearly conducted in a way to misinform a potential customer about the current state of performance. This GROMACS example has been publicly known for almost four months and has not been current state for over a month. AMD does the same things as part of marketing e.g. AMD EPYC Rome NAMD and the Intel Xeon Response at Computex 2019 so it is, perhaps, par for the course. So perhaps the best course of action is to ignore these claims.
It was once told to me, “you only lose your reputation once in the valley.” With this, the “Performance at Intel” blog just had that moment. Perhaps in marketing, one gets multiple attempts.