At Hot Chips 29 (2017) Qualcomm Centriq 2400 was a hot topic. For those who have not been following since Q4 2016, Qualcomm is making its way into the mainstream server business with an ARMv8 64-bit CPU. We saw a live demo of the CPU and saw hardware in-person from Microsoft at the OCP Summit 2017. At Hot Chips 29 we received a fairly in-depth architectural overview without performance or power information.
Qualcomm Centriq 2400 Overview at Hot Chips 29
Here is the high-level SoC overview. Key aspects are that it has up to 48 cores, 6 channels of DDR4 (up to 2 DIMMs per channel), 32x PCIe Gen 3 lanes, and a bidirectional ring bus. Qualcomm did not disclose L3 cache sizes, clock speeds, power consumption and so forth. on the other hand, Qualcomm gave a fairly deep architecture overview given not disclosing those critical facts.
L3 QoS is software to monitor the use of L3 cache and determines how the L3 can be optimally partitioned.
One of the more intriguing features is memory bandwidth compression. The Qualcomm Centriq 2400 memory controller can access writes/ reads that are going to/ from memory and determine if a line can be compressed.
The last point is around Secure Boot. In this latest generation of processors, this type of feature is table stakes. Each implementation will have some differences that a customer may prefer, but it is an item that Centriq needs.
The next section will be a deeper dive into Qualcomm Centriq 2400 architecture. We know that many of our readers care about performance, cores, RAM and sockets so we wanted to get those details up front. After the architecture section is what we saw of the hardware at OCP 2017, the competitive landscape, and our closing thoughts.
Deeper Dive into Qualcomm Centriq 2400 Architecture
At Hot Chips 29, Qualcomm gave an unexpectedly deep dive into the architecture. We had pre-read the slides twice before the talk, listened to the talk, and reread the slides after. The slides cover the vast majority of the details and we know many of our readers care more about TCO and total system performance. Therefore, we are going to present the slides with only a bit of commentary.
There are a few big ones on this slide. First, this is a design meant to deliver high IPC performance. We have seen some ARM server designs that focused on I/O and power but decided not to compete in single thread performance. Qualcomm stressed that it is a high performance per thread design (without revealing performance numbers.)
Another key aspect here is that the AArch64 only support. Qualcomm eschews legacy 32-bit execution in order to extract better efficiency. On the x86 side, there is a huge demand to run legacy 32-bit code so that trade-off is not possible. Using ARM most users will need to either start with new development projects or port existing x86 data center code anyway. Putting a stake in the ground and saying it is 2017, use 64-bit in your go-forward data center efforts makes sense in the high-performance segment.
Qualcomm focused a good portion of the discussion around power management. Down to where we heard how different choices in cache and topology impact power consumption. Coming from a power rationing mobile background, one would expect Qualcomm to bring the knowledge and perspectives from low power design into their server chips. Qualcomm claims that this is going to be the first 10nm server CPU and that it has the experience to extract maximum benefit out of that process.
The Falkor Duplex has two Falkor ARMv8 cores a shared L2 cache and a bidirectional ring bus architecture.
The Qualcomm Falkor pipeline overview was a bit shocking in that we were not expecting this level of detail. One of the more interesting aspects of the pipeline architecture is that it is fully heterogeneous. If you trace through the pipeline diagram, no two paths are exactly alike.
As you would expect, Qualcomm has what it claims is a strong branch predictor. We were pleasantly surprised that Qualcomm did not label this an “AI” feature like some other manufacturers.
The more intriguing pieces of the Flakor cache is the fact that there is a exclusive L0 and L1 I-cache. We were told this is a trick to further reduce power consumption.
Here, the part that we wanted to see was the out of order execution. Some architectures, e.g. the Cavium ThunderX (1) have been in-order designs.
On the integer and branch execution, one can clearly see the heterogeneous execution pipeline.
We double checked the presentation we received and the green segment VX/ VY was not covered.
Here you can see that the L2 cache is shared between two Falkor cores. A slide earlier discussed the distributed L3 cache and Qualcomm’s advancements in that area.
Overall, for an initial entry in the market we are excited. Although we are still well before general availability, we did get to see the chips in person and in action a few months ago at OCP.
Qualcomm Centriq 2400 from OCP
The last time we saw Qualcomm Centriq 2400 was at OCP Summit 2017. We were able to see one of the Microsoft OCP sleds in-person. Here it is:
During the OCP Summit, Microsoft showed off an internal only Windows Server version running on ARMv8:
You can check out that piece for more information.
Qualcomm has the advantage of having far reaching relationships with various vendors. Also, the fact that we saw a major customer running hardware earlier this year is a good sign. At the same time, Centriq 2400 will be competing in the mainstream market with products like the AMD EPYC 7000 series, Intel Xeon Scalable Processor Family, and the Cavium ThunderX2 (now Broadcom Vulcan based.)
All three of its competitors run in either dual socket or single socket modes. Offerings from AMD and Intel are x86 compatible and will run years of software out-of-the-box without emulation. As we noted with the original ThunderX (1) switching to 64-bit ARM is not a pain-free experience.
Whereas we evaluated ThunderX in a world where the only other viable option was Intel, times have changed. First, the ARM infrastructure on the server side is light years better than what it was in Q1/ Q2 2016. Second, AMD has a viable x86 alternative allowing companies who want an alternative supplier to source without major software changes. Cavium itself has had a major change. By bringing in the Broadcom Vulcan program Cavium has gone from a lower performance ARMv8 core with lots of networking in 2016 to a high performance (read HPC focused) ARM design in 2017. AMD, Cavium, and Intel are also shipping their next-gen platforms several months before Qualcomm.
Qualcomm has two more legitimate competitors in the market aside from Intel, even before getting to lower visibility players such as APM X-Gene 3. We do think that Qualcomm’s brand recognition, partner ecosystem, executive team and ability to execute in the ARM ecosystem will help it find success in the data center.
Qualcomm Centriq 2400 is going to make waves. Key to gaining market adoption is going to be getting the product out quickly. It is also going to have to find specific niches that it can play in. Hearing from customers of ThunderX (1) and AMD EPYC shows that there is a pent up ABI (Anything But Intel) market demand. While large customers and appliance manufacturers will support ARM in addition to x86, many enterprise vendors will restrict purchasing based on “what can I live migrate VMware VMs to/ from.”
The bottom line here is that the Qualcomm Centriq 2400 looks promising. The fact that large hyper-scale customers like Microsoft are publicly showing it off is a great sign. Now all that is left is launching this generation and getting a clear roadmap out with future generations so customers can plan purchases accordingly. We cannot wait to see more from Qualcomm.