Advertisement


Home News SPEC Consortium Releases SPEC CPU 2026 Benchmark Suite: The Next Decade of...

SPEC Consortium Releases SPEC CPU 2026 Benchmark Suite: The Next Decade of CPU Benchmarking

4

What Is New in SPEC CPU 2026

General SPEC CPU information aside, what is new in SPEC CPU 2026? In short: a lot, but also not as much as you might think.

In the last 9 years, since SPEC CPU 2017 was released, computers have continued to scale up in performance and memory capacity. And in the server space, Intel’s x86 monopoly has been broken by AMD, hyperscalers, and others developing their own Arm-based chips. Even RISC-V has a college science project that has evolved into a complete ISA now extensive enough to build high-performance processors. So there has been a lot of change in which architectures are driving the world’s computers, never mind the evolution of those architectures.

From a high-level perspective, it is a period of very limited change. While SPEC CPU 2017 needed to address all the changes in computing hardware over the previous decade, mainly the end of Dennard Scaling and the resulting shift towards CPUs with more cores rather than just faster cores, the nine years between 2017 and 2026 have not seen any similar shift. As a result, while the consortium had previously needed to retarget large aspects of SPEC CPU to keep up with changes in CPU design, that has not been the case for SPEC CPU 2026. So while the benchmark suite has been modernized in multiple ways, it has not undergone the same kind of major changes that underpinned the release of SPEC CPU 2017.

For SPEC CPU 2026, the focus is on a broader set of benchmarks that reflect modern workloads in 2026, while also keeping up in terms of size and compatibility.

SPEC_CPU_2026_Whats_New
SPEC_CPU_2026_Whats_New

The 2026 edition of the benchmark suite has 52 benchmarks in all, 9 more than the 2017 suite. Of those, 38 are all new benchmarks. Only 14 benchmarks were kept from the 2017 suite, particularly evergreen software such as GCC, LLVM, and various data compression utilities, and even then, all of those benchmarks have been updated to both use their latest code and to use newer and deeper workloads.

With 52 benchmarks in all, there is more to cover here than there is time to cover them. Notably, Perl, x264, and Blender have all been removed from the 2026 suite. In its place are new benchmarks such as CPython, FLAC, and SQLite. There are also plenty of computational science workloads, as well as some new industry workloads, such as FPGA place and routing (VPR).

The total number of lines of code has more than doubled, going from around 7.1 million to about 16.7 million. Most of that code belongs to GCC, LLVM, and FemFlow, a finite-element fluid-dynamics simulation.

SPEC CPU 2026 Integer Benchmarks
SPECrate SPECspeed Languages Description
801.xz_s C++,C Data compression
706.stockfish_r C++ Game / AI (chess)
707.ntest_r 807.ntest_s C++ Game / AI (othello)
708.sqlite_r C SQL compiler/interpreter and database
710.omnetpp_r C++,C Discrete event modeling
714.cpython_r C Python interpreter
817.flac_s C++,C Lossless audio compression
721.gcc_r 821.gcc_s C++,C C language optimizing compiler
723.llvm_r 823.llvm_s C++,C C/C++ language optimizing compiler
727.cppcheck_r 827.cppcheck_s C++ Static analysis of C/C++ code
729.abc_r 829.abc_s C++,C Sequential logic synthesis and formal verification
734.vpr_r 834.vpr_s C++,C FPGA place and route
735.gem5_r 835.gem5_s C++,C Computer architecture simulation
838.diamond_s C++,C Bioinformatics – metagenomics and protein sequencing
846.minizinc_s C++,C Constraint programming
750.sealcrypto_r C++,C Homomorphically Encrypted (HE) query
753.ns3_r 853.ns3_s C++ Discrete event network simulator for internet systems
854.graph500_s C Graph analytics
777.zstd_r C Data compression/decompression
SPEC CPU 2026 Floating Point Benchmarks
SPECrate SPECspeed Languages Description
800.pot3d_s Fortran Solar physics: finite difference method, conjugate gradient solver
803.sph_exa_s C++ Astrophysics – Smoothed Particle Hydrodynamics (SPH)
709.cactus_r 809.cactus_s C++,C Astrophysics – relativity, finite difference method, time integration
811.tealeaf_s C High energy physics
816.nab_s C Molecular modeling
820.cloverleaf_s Fortran Explicit hydrodynamics
722.palm_r 822.palm_s Fortran Atmospheric science
731.astcenc_r C++ Image compression – Adaptive Scalable Texture Compression (ASTC)
736.ocio_r C++ Color management for visual effects and animation
737.gmsh_r C++,C Finite element mesh generation
748.flightdm_r C++ Flight dynamics models for aeronautics
749.fotonik3d_r 849.fotonik3d_s Fortran Computational Electromagnetics (CEM)
857.namd_s C++ Classical molecular dynamics simulation
765.roms_r 865.roms_s Fortran Regional ocean modeling
766.femflow_r C++ Fluid dynamics: high-order finite element method
767.nest_r 867.nest_s C++ Neuroscience simulator for spiking neural network models
772.marian_r 872.marian_s C++ Neural machine translation for written language
782.lbm_r C Computational fluid dynamics, Lattice Boltzmann Method
881.neutron_s C Physics simulation of neutron transport in nuclear reactors

As you might expect, the latest edition of the suite updates the benchmark suite to use much newer language standards as well. Whereas SPEC CPU 2017 was based around C99, C++03, and Fortran 2003, SPEC CPU 2026 benchmarks are based around C18, C++17, and Fortran 2018 – all of which are around 15 to 20 years newer in age. So the constituent benchmarks all have access to many newer language features, most notably C++ threading (std::thread) and Fortran concurrency (DO_CONCURRENT). The latter changes primarily impact the SPECspeed benchmarks, as SPECrate explicitly runs multiple copies of a single program rather than using multithreading within a program.

SPEC_CPU_2026_Languages
SPEC_CPU_2026_Languages

The hardware requirements have also increased somewhat, largely to keep pace with the increasing amount of RAM available within a system. SPECrate still requires 2GB of RAM per instance, which means the memory requirements for that suite of benchmarks increase rapidly with the number of CPU cores/threads in play. In practice, this means that a modern, high-end desktop CPU needs 64GB of RAM (enough to cover all 24 cores of Arrow Lake or all 32 SMT threads of Granite Ridge). Coincidentally, the memory requirements for SPECspeed have also jumped to 64GB, reflecting the workload sizes and the heavier use of multithreading there. Just as a note, we tried running it on a 128GB AMD Ryzen Threadripper 9980X system, and the run failed due to running out of memory.

Finally, it is interesting to note that the SPEC CPU group has yet again been able to maintain its penchant for selecting unusual architectures for its reference scores. For SPEC CPU 2026, the reference machine is a Lenovo ThinkSystem HR330A, which is based around a 3.0GHz Ampere eMAG 8180, a 32-core ARMv8 AArch64 processor from 2018 that uses the Skylark CPU core. This ends the long run of SPARC processors as the reference CPU, though it continues the trend of using something other than a widely adopted CPU core (e.g. Intel or AMD x86, Arm Cortex) as the reference.

And with the highlights of SPEC CPU 2026 out of the way, let us go ahead and take a look at the benchmark performance.

4 COMMENTS

  1. If I understand the “Scaling, fprate” chart correctly, the “ideal Nx” line is nonsense for the Intel and nVidia systems. The *actual* ideals for those systems is variously 8x, 16x, or 10x for their different core types. And by adding the aggregates from both P and E cores, you have a good number for total system performance, but an utterly useless number for measuring scaling for a particular core type, which is what a first glance at that chart suggests (incorrectly) it’s about. Which is too bad, because there is an interesting idea there.

    I mean, really, in what way is it useful to know that 8P+16E cores can achieve 6.68x the performance of a single E core? It would be much more interesting to know how 16 E cores measure up against 1 E core.

  2. Hey, justsomeguy – I had a similar thought. We have updated charts, but they did not make it into this before it went live because it was a bit of a reflection that led to the thought. Hopefully, it makes it into the next set of content. Here is the idea:

    Really, the ideal for heterogeneous processors is the maximum performance of each type multiplied by the number of cores in that type, and summing those. That is still not perfect, but it is closer to what you are thinking.

    One other one to call out is that on the AMD side, you have 1c/1t and then multiply it by the number of threads, so 16 cores, but 32 threads. I do not think anyone expects SMT to be the same performance as a physical core, so saying 32 times that 1c/1t result is also a bit odd for “ideal” if you start going down a weighted ideal for heterogeneous cores. Is the base then the 1c/2t number? Is it then 1c/2t minus 1c/1t for an ideal SMT thread and calling that a “core” for our weighting?

    All good thoughts and feedback. Another very valid point is that we are running these at lower compiler optimization levels than in the official runs. That is actually by design, but there is always a fair question of whether we should do more optimization.

  3. Patrick – My first message originally had some thoughts about threads vs. cores but I removed them before posting as I wanted my main point to be clear and unelaborated. But since you bring it up…

    I like your notion of 1c2t-1c1t as a base, but I think that you need to take a step back first and ask, what are you actually trying to do? Do you want to understand an architecture and maybe learn something about whether the choices made in it were good ones? Or do you want to build as good a picture as you reasonably can of what a machine’s performance profile is? These are two not entirely compatible goals, though there is clearly a lot of overlap. Once you decide, that answers some of these questions, or at least narrows down the answers.

    In particular, in the second case, it really doesn’t matter what those computations come out to. All that really matters is what you actually measure.

  4. Hey justsomeguy and Patrick, I think in the end what matters is what performance one should expect from a processor and also what one expects from a certain architecture. And here one should add with really bold letters – at what price and at what power level. Considering the last remark, the scores depicted here should also account for the fact that you tested thermally constrained systems – you didn’t test open air systems with an unrestricted thermal solution. Therefore you also test the capabilities of an overall system manufacturing and its cooling solution. To go back to the conversation of the performance expectations, one should also consider why we have p and e cores and of course multithreading. The concept of heterogeneity is to better handle some aspects indirectly related to performance – it is all about efficiency. If you take efficiency (perf/watt) out of the scope, then I find the expected scaling basically wrong. If you take a workload that has the characteristics mostly suited for a p-core and then use it in multiple cores, it does not make real sense to me to consider the scaling beyond the maximum of p-cores – what you may gain from a number of e-cores running alongside is just an added bonus (which may also not be ideal in certain use cases). In the same context, multithreading is used to exploit certain characteristics of the workloads to use the same underutilised resources for multiple processes/threads. In that context, I would consider the 1c/1t as the baseline, the theoritical maximum to be the number of cores and if a workload manages to achieve performance above that level, then this is to be a welcome bonus. Anyway, since you provide the raw data, anyone can make his/her own readings and take your analysis as one aspect that he/she may accept or not. In the end, taking out the dell laptop from the equation, I find that the other two platforms are generally not…. general purpose. So one may focus on the exact applications that they are intended for and not look at the overall results to make what the numbers mean. For example, I would not expect anyone to buy an nVidia Spark system to run gem5 or use it as a database server, although I would expect that the performance of Python and AI related stuff are very important.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.