4th Gen Intel Xeon Scalable Sapphire Rapids: Performance
We are in a transition period between benchmarking CPUs. We still have some single instance workloads, but realistically, there are increasingly fewer CPUs that run single workloads across 60 or 96-core processors. HPC may be an example, but in virtualization, a majority of VMs are 8 vCPUs or less. Containers and microservices have further increased diversity. Later in 2023, we will switch over to a more mixed environment than we even have today, as scaling single workloads is a challenge as they get more complex.
Intel is highlighting a lot of accelerated workloads in its marketing materials for Sapphire Rapids. We wanted to see what the performance looks like on our existing workloads since the majority of workloads that will transition to Sapphire Rapids in the next 24 months will not take full advantage of new instructions and accelerators.
Python Linux 4.4.2 Kernel Compile Benchmark
This is one of the most requested benchmarks for STH over the past few years. The task was simple, we have a standard configuration file, the Linux 4.4.2 kernel from kernel.org, and make the standard auto-generated configuration utilizing every thread in the system. We are expressing results in terms of compiles per hour to make the results easier to read.
Here, the DDR5 transition, along with the newer core architectures, are helping Intel a lot. This is a workload where we were a bit surprised with how well the Intel parts fared. Something that is going to be a theme is that we have the dual 96-core AMD EPYC 9654 figures in our charts. This is a benchmark example of where we had to split the workload up above 64 cores into two instances because huge portions of the chip were sitting idle. That goes back to the need to run a mix of workloads on modern chips to really show what they can do.
c-ray 1.1 Performance
We have been using c-ray for our performance testing for years now. It is a ray tracing benchmark that is extremely popular to show differences in processors under multi-threaded workloads. Here are the 8K results:
Our c-ray 8K benchmark behaves much like a Cinebench on Windows. AMD typically performs very well here due to the way its cache is structured. The new Intel P-cores seem to be handling this slightly better.
7-zip Compression Performance
7-zip is a widely used compression/ decompression program that works cross-platform. We started using the program during our early days with Windows testing. It is now part of Linux-Bench. We are using our legacy runs here to show scaling even without hitting accelerators.
One area that is fun is that we have some quad CPU benchmarks in these charts. One of Intel’s main areas for selling new Xeons is to its existing Xeon installed base. For those with even quad 3rd Generation Intel Xeon Scalable “Cooper Lake” systems, one can get 2:1 consolidation ratios. That is very powerful.
SPEC CPU2017 Results
First, we are going to show the most commonly used enterprise and cloud benchmark, SPEC CPU2017’s integer rate performance.
Here, Intel is trailing AMD by quite a bit, even in the 60-64 core range. Our estimated figures tend to be lower than server OEMs that post official numbers because OEMs have teams that spend more time optimizing. Our tested numbers are directionally in-line with what we have seen previously. Published dual-socket numbers are 1790 for the EPYC 9654’s and 991 for the Platimum 8490H. That has Intel at about 55-56% of AMD in officially published SPEC results on an integer side.
Here is the floating point chart:
Something really interesting is just how close the Platinum 8490H and Platinum 8480 are on both SPECrate2017_int_base and SPECrate2017_fp_base. On the floating point side, Intel can close the gap significantly between its 60-core part and AMD’s 64-core part in this generation. Again, we see more performance from 2x 56 = 112 cores of Platinum 8480 than we did with 4x 28 = 112 cores of Platinum 8380H “Cooper Lake” parts here, at a net lower TDP and system power.
While Intel is saying its customer workloads no longer mirror SPEC CPU2017, these are still important figures. In the enterprise and government space, this is a common RFP metric. While Intel may be pushing away from this metric, it is also one that is widely used.
Just as a quick note, we had been running systems 24×7, but given how long the benchmark scripts take to run after warming up servers, this chart was not in the article when it was published. We had to update the article with these numbers several hours after the piece went live once the runs were complete.
STH nginx CDN Performance
On the nginx CDN test, we are using an old snapshot and access patterns from the STH website, with DRAM caching disabled, to show what the performance looks like fetching data from disks. This requires low latency nginx operation but an additional step of low-latency I/O access, which makes it interesting at a server level. Here is a quick look at the distribution:
Transitioning to a more real-world workload, this is one that hits home for us. We are using STH’s actual website data to see how new chips perform. With this test, we can see the impact of DDR5 and the newer architecture. Intel is moving far ahead of its previous generation.
MariaDB Pricing Analytics
This is a personally very interesting one for me. The origin of this test is that we have a workload that runs deal management pricing analytics on a set of data that has been anonymized from a major data center OEM. The application effectively is looking for pricing trends across product lines, regions, and channels to determine good deal/ bad deal guidance based on market trends to inform real-time BOM configurations. If this seems very specific, the big difference between this and something deployed at a major vendor is the data we are using. This is the kind of application that has moved to AI inference methodologies, but it is a great real-world example of something a business may run in the cloud.
Here is a really interesting view. Intel is doing very well against AMD in this example, with one outlier: Milan-X. The enormous caches of the AMD EPYC 7773X really help that part in this test.
Something to keep in mind here is that in 2017 when Skylake came out and EPYC 7001 Naples also arrived, AMD was expecting to launch the EPYC 7003 series “Milan” against a Sapphire Rapids generation. That is one of the big reasons the parts look closer than one may expect.
STH STFB KVM Virtualization Testing
One of the other workloads we wanted to share is from one of our DemoEval customers. We have permission to publish the results, but the application itself being tested is closed source. This is a KVM virtualization-based workload where our client is testing how many VMs it can have online at a given time while completing work under the target SLA. Each VM is a self-contained worker. This is very akin to a VMware VMark in terms of what it is doing, just using KVM to be more general.
Many of the larger VM sizes are actually memory bound. Since we fill all memory channels for each platform, Intel is at a disadvantage on the larger VM sizes simply due to having 1TB of memory using 64GB DIMMs (8 channels, 2 processors, 64GB per channel), whereas AMD has 1.5TB in 12 channels.
When we get to the smaller VM sizes, Intel actually performs quite well, to the point that we can consolidate sockets at a >2:1 clip over quad 3rd Generation “Cooper Lake” platforms even at the same core count with the Xeon Platinum 8480. That result alone is quite astonishing.
The gap between the Platinum 8380 and the Platinum 8480H is enormous in many of these, owing to 50% more cores, an updated microarchitecture, and the DDR5 transition.
The 96-core EPYC 9654 we are leaving on the chart because it is a modern CPU, but it is distorting the view of these chips. The Platinum 8480H with 60 cores is outpacing the EPYC 7773X and EPYC 7763 64 core parts. Remember, Intel is focused more on per-core performance.
At this point, some are probably wondering what is going on with the Xeon Platinum 8490H and EPYC 9554 not just here, but as an overall theme. Here is the easy explanation:
- The EPYC 9554 has 4 more cores for 6.7% more than the Platinum 8490H
- The EPYC 9554’s base clock is 3.1GHz. The Platinum 8490H is a 1.9GHz base CPU with an all-core turbo of 2.9GHz. AMD is running its cores at a substantially higher frequency. We validated this trend on three different OEM platforms for both Intel and AMD just to ensure it was not a platform-specific feature
- The EPYC 9554 has 10W more TDP headroom. We are using the default 360W TDP, but if a server supports it, the EPYC can go up to 400W for a ~3% performance increase over what we are showing here
- The EPYC 9554 has 12x DDR5-4800 memory channels versus 8x for the Platinum 8490H
- AMD still has 2x the L3 cache of Intel, but that is also helping increase its IPC a lot
The sum of all of those parts is that while the EPYC 9554 may seem like it is “only” a four-core upgrade over the Platinum 8490H, there is a lot more going on than a 6.7% core count increase.
PCIe Gen4 NVMe SSD Performance
In 2023, we expect most SSDs on new servers to be PCIe Gen4. While PCIe Gen5 is supported, most vendors are telling us that PCIe Gen4 will be the dominant NVMe SSD platform still. Since we had data from several other platforms and we tested this with Genoa, we wanted to test Sapphire Rapids as well.
Intel seems to have done a good job here. We are getting slightly better performance with our Kioxia CM6 NVMe SSD on the Intel platform. Intel spends a lot of time on its PCIe validation, so this should perhaps be expected. On the other hand, it was good to see that we did not lose a lot of performance here with the Gen5 lanes clocking down to Gen4.
That brings us to the next section, acceleration.
Wow … that’s a lot of caveats. Thanks for detailing the issues. Intel could really do with simplifying their SKU stack!
Not sure what to think about power consumption.
Phoronix has average power consumption reported by sensors that is ridiculously high, but here the peak power plug consumption is slightly less than Genoa.
Someone needs to test average plug power on comparable systems (e.g. comparable nvme back-end).
This is like BMW selling all cars with heated seats built into them and only enabling it if you pay extra.
Intel On Demand is a waste of engineering, of silicon, of everything, to please shareholders.
I’ve only made it to the second page but that SKU price list is straight up offensive. It feels like Intel is asking the customer to help offset the costs of their foundry’s missteps for the past four years.
The segmentation is equally out of control. Was hoping Gelsinger was going to reign it in after Ice Lake but I got my answer loud and clear.
New York Times: “Inside Intel’s Delays in Delivering a Crucial New Microprocessor
The company grappled with missteps for years while developing a microprocessor code-named Sapphire Rapids. It comes out on Tuesday.”
– NOT how you want to get free publicity for a new product!
I was so focused on Intel having fewer cores than AMD with only 60 I forgot that there’s still a big market for under 205W TDP CPUs. That’s a good callout STH
Intel did similar things when they lost track versus RISC/AMD back in the day. Itanium, Pentium IV (netburst), MMX and SSE were the answers they used to stay relevant.
P4’s overheated all the time (think they have this solved today with better cooling, but power is still a heavy draw).
MMX and SSE were good accelerations, complicating compilers and developers lives, but they existed on every Intel CPU, so you had a guaranteed baseline for all intel chips. Not like this mess of sku’s and lack of predictability. QAT has been around a while, and lots of software support, but the fact it’s not in every CPU holds it back.
The one accelerator that doesn’t need special software is HBM yet they limit that to too few SKUs and the cost is high on those.
This is not a win for Intel…this is a mess.
I’ve just finished reading this after 90min.
THAT is someone who’s got a STRONG understanding of the market. Bravo.
Where’s the video tho?
There is soomething wrong with the pricing for these products.
Especially with accelerators there is a price thing going on:
-QAT can’t compete with DPUs; as you mentioned those cost $300 more than a NIC
-AMX on $10k+ CPUs (with 56 or 60 cores) can’t compete with a $1500 GPU while consuming much more power than a CPU with less workload plus the GPU.
These sticker prices might not be end-prices. High core Genoa is also available now ~20% under MSRP from european retailers. I don’t really trust MSRP for this generation.
@Lasertoe – What we’re seeing here is the first step towards the death of the DPU. What is going to be ending it is when Intel integrates networking fabrics on package and thus you can dynamically allocate cores towards DPU tasks. This provides the flexibility, bandwidth and latency that dedicated external cards will quickly disappear.
Intel isn’t doing themselves a favor by having their on-die accelerators behind the On-Demand paywall.
I suspect you will earn lots of money if you could monetize your Intel SKU excel sheet 🙂
How on Earth I can pick the best CPU for my workloads ?
Are there any tools that could identify which accelerations might be helpful for my workloads ?
Whole concept of the On Demand is kinda rotten.
I deploy the platform, I migrate the workloads, I realize that maybe some additional accel will be beneficial (how ?), I purchase the extra feature (and it won’t be cheaper if purchased from the get go), and then I need to trigger workload wide software refresh into acceleration enabled version ?
Hard to see that.
Sorry if the accelerators are meant to be decision factors there need to be widely adopted, they need to be a must, a no brainer. And they need to have guaranteed future.
I’m extremely confused how NONE of the “Max” SKUs are being offered with ANY of the onboard accelerators! (other than DSA, which seems like the least helpful accelerator by far.)
Is that a typo? The Max SKUs don’t even offer “on demand”?
I don’t think that will happen. I think Intel and AMD will both integrate DPU-like structures into their server CPUs.
Allocationg cores “towards DPU tasks” is already possible when you have an abundance of cores like Genoa (and even more with bergamo). The DPU advantage is that those (typically ARM) cores are more efficient, don’t need a lot of die area and don’t share many resources with the CPU (like caches and DRAM).
I can see a future where efficient coress with smaller die area like Zen4c or Atom (or even ARM/RISC-V) work along high-performance cores for DPU tasks but they need independent L3 caches and maybe DRAM.
Well, have to admit, I didn’t think there would be anything below the $1,500 mark. Granted, there’s not much, but a few crumbs. Now to see if you can actually get those SKUs.
Not buying the power levels until I see some actual test results. Frankly the lack of accelerators on so many of the high end SKUs definitely raises a few doubts as well. Why leave the thing you’ve been hyping up all this time from so many SKUs, and does this mean that there are, 4-5 different chip lines being manufactured? Thought one of the main angles was that they could just make a single line and bin those to make your variations and offer the unlocks to all the models?
Just waiting for all the “extras” to become a recurring subscription. You want the power efficiency mode turned on? That’s $9.99/hr/core.
“4th Gen Intel Xeon Scalable Sapphire Rapids Leaps Forward in Lead Times” Fixed the title for you 😉
Can anyone explain the difference between the Gold 5000 and Gold 6000 series? I can’t find any rhyme or reason to the distinction.
Adding to the confusion, the Gold 5415+ actually appears to be substantially worse than the Silver 4416+, and the Silver 4416+ costs $110 more. Why would a Silver processor cost more than a Gold processor and be better? There’s a pretty meaningless-looking distinction in base clocks, but given where the all-core turbo is at, I would bet that loading 8 cores on the 4416+ would yield clock speeds that aren’t far off from the all-core turbo clock speed of the 5415+… and then you still have another 12 cores you can choose to load up on the 4416+, with over 50% more cache!
The SKU matrix doesn’t seem very well considered. I also agree with Patrick’s comments on the confusing state of the accelerators; I think Intel should have enabled 1 of every accelerator on every single SKU, at a minimum. If they still wanted to do “On Demand”, that could allow users to unlock the additional accelerators of each type, but even having 1 would make a significant performance difference in workloads that can use them, and it would be an effective way to draw the customer into buying the licenses for additional accelerators once they are already using them.
Long. Superior article.
Intel should hire you to re-do Xeons products.
Will be interesting to see The hedt platform later how it Will perform campare to rapid lake ryzen and of course threadripper and also IF they have some new things outside of pci-e5 ddr5 or IF they cripple it as they did with x266
What an absolute mess. The naming has been awful since the whole “Scalable” marketing debacle but this is taking it to the next level. Was hoping they would sort it this generation. Sigh.
Patrick, any chance of testing a “fully populated” supermicro SYS-681E-TR? The mind boggles…
Accelerators have a chicken vs. egg adoption challenge. Intel hedged their bet with “on demand,” which makes adoption failure a self-fulfilling prophecy
I don’t know if anyone noticed, but in the chart on page 12 where Intel basically denounces the SPEC benchmarks they put “Gaming” twice in the “Customer workloads” set in relation to the release of a Xeon line.
A lot of games require servers for multiplayer gaming, don’t they? Then of course you have cloud gaming, which is much smaller, I’d imagine.
It does seem odd that they selected two customers with gaming workloads when there aren’t so many total.
“On Demand” is bullshit. It’s nothing more than artificial scarcity, a.k.a the Comcast model. I would be very angry if I paid for all of those transistors and over half of them were locked behind an additional paywall.
Thanks for the nice article. Unfortunately on general purpose computing it seems Intel is still trying to catch AMD and not successfully.
I’m using phoronix benchmarks geometric means (from set of benchmarks) comparison here with the specified CPU TDP, E.g. benchmark number / TDP = X. so this basically shows efficiency of processing in comparison with declared TDP. Higher number, better efficiency.
Intel 8280: 1.35
Intel 8380: 1.46 — looks like 14nm -> 10nm transition was moderately successful
Intel 8490H: 1.7 — again 10nm -> Intel 7 although it should be basically same, it looks like Intel did their homework and improved quite a lot.
AMD 9554: 2.3 — and this is from completely different league. TSMC simply rocks and AMD is not even using their most advanced process node.
Not sure if I get it right. It does seems like 8490H and 8468H had all accelerators enabled from the table you compiled
I don’t find these particularly compelling vs. AMDs offerings. The SKU stack is of course super complicated, and the accelerator story doesn’t sound very compelling – also raises the question if one can even use these with virtualization. And I don’t think most software supports the accelerators out of the box with the possible exception of QAT. The on-demand subscription model also bears the risk that Intel might not renew your subscription at some point.
Those SPECint numbers are ******* BRUTAL for Intel. I’m sure that’s really why they’re saying it’s not a good benchmark. If it’d been reversed, Intel will say it’s the best.
I’d agree on the speccpu #’s.
I read this. It took like 2hrs. I couldn’t decide if you’re an intel shill or being really critical of intel. I then watched the video, and had the same indecision.
I’d say that means you did well presenting both sides. There was so much garbage out there at least there’s one place taking up the Anandtech mantle.
Amazing review. It’s by far the most balanced on the Internet on these. I’ll add my time, it took me about 1.25 hours over 3 days to get through. That isn’t fast, but it’s like someone sat and thought about the Xeon line and the market and provided context.
Thx for this.
I think Intel is on the wrong path.
They should be making lower powered CPU’s.
Their lowest TDP CPU is 125W and its a measly 8 core, with a 1.9Ghz max boost frequency – I think something is wrong in Intel’s development department.
1.9Ghz boost frequency should not require 125W TDP.
What a hack was then when I tels Server market share was something like 97%
-what nonsense it still is ore than 90%
So how that affects anything
Data Center and AI (DCAI) $4.3 billion
Data Center $1.7 billion
That is Miikka maximum copium.
Patrick’s SKU tables show the 8452Y as MCC, but that’s clearly impossible since it has 36 cores. It should be XCC (which would also match Intel’s table).
I didn’t try to check all the others. 🙂