A Quick Word on Power
We just want to note here that we typically would use the following for our lab planning based on what we observed under loads with just memory and a single SSD installed.
- Dual Intel Xeon Gold 6330: 700-745W
- Dual Intel Xeon Platinum 8352Y: 725-750W
- Dual Intel Xeon Platinum 8380: 900-950W
We may not have hit maximum power, but these ranges we could hit in different systems and we have confirmed on platforms from at least two vendors.
So the key takeaway here is that technically a dual Xeon Platinum 8380 we would say uses less power than an AMD EPYC 7763. However, at a similar general-purpose performance/ core count level, Intel is using (very) roughly 30-40% more power due to having to scale to additional nodes. AMD also has chips like the EPYC 75F3 which is a high TDP 32-core part.
We did not test the Platinum 8362 which is Intel’s closest competitor
A Challenging View of Performance
We are going to get to this later, but Intel offered an update comparing AMD EPYC 7763 to Xeon platinum 8380. Effectively it is saying that if AMD EPYC 7003 “Milan” parts find data in first or second level caches, AMD will be faster. It then says that if AMD cores can find data in local L3 caches AMD is faster, but if it has to go to remote dies then it is potentially slower.
Likewise, Intel says it has better memory latency.
We just wanted to call this out since our readers may see this on other sites after Intel distributed it. The challenge with this way of thinking is simple. In most AMD EPYC 7003 SKUs, there is 32MB of local L3 cache to any given core. There is additionally up to 256MB of L3 cache per socket and 512MB on full SKUs across both sockets. This 256MB figure can go down, but on the performance SKUs, AMD even has 256MB with 8 cores where each core has a dedicated 32MB of L3 cache.
Saying a cache misses can be slower when AMD has 8 core parts with 256MB of cache and 4MB/ core or more is strange when Intel has a maximum of 60MB L3 cache and that scales down with core counts. Intel’s 8-core parts have 12MB L3 cache with the Gold 6334 at 18MB. Comparing latencies when a single AMD core can have ~2-3x the entire cache of Intel’s 8-core chips seems strange.
That also brings us to the bigger challenge with this mental model. AMD can scale to 128 cores/ system and 160 PCIe Gen4 lanes per system. If Intel-based systems need to scale out to make up these large deficits, then the latency incurred is not local within a box, but the latency of going to the network card, over a cable, to a switch, and then to another node which is an order of magnitude slower. In Intel’s example, cores 1-80 may follow the above, but cores 81-120 would add a hop to an external node, and likely 121-128 would add a hop to an external node plus a cross socket hop.
With microbenchmarks, the model Intel offers can make some sense, but it is comparing AMD’s chips designed to have a bigger radix which is the true value.
Intel also offers that it has new instructions which is fair.
We are going to note that the “Blockchain, Bitcoin” may be fair, but AMD EPYC CPUs are far superior for CPU-based miners that currently utilize even AVX-512 (and it is not close.)
On the HPC, Cloud, and AI performance, we can see that Intel is focused on comparisons where it is using AVX-512, DL boost, and crypto accelerators.
Again, if one is not changing to utilize accelerators, then the story flips so one must keep that in mind. CloudXPRT one can see is being accelerated by AVX-512. That is a benchmark that is created by Principled Technologies which Intel funds. Workloads such as NGINX work (very) well on Arm processors that do not have AVX-512 and are a key reason that cloud providers are designing their own Arm chips or using those from Ampere.
Next, we are going to discuss the market impact, followed by our final words.