On the calendar for this week, we had a piece set highlighting the complexity of Intel Optane DIMMs. As many have seen, at STH we have been using Optane SSDs as database storage for years, and the next step was going to be Intel Optane DIMMs. As we started going from slides and presentations, to actually operating Intel Optane DCPMM (or PMem 100) and with PMem 200 being a key driver for Cooper Lake, it was time to bridge slideware to practical experience. Then, hours before that piece was going live, Micron announced that it was exiting the 3D XPoint business and selling its Lehi, Utah fab that makes the media.
As a result, we pulled the original piece to augment it with what and why Micron we think is making that announcement. In hindsight, it probably should have been two articles. At the same time, we are getting into a level of detail around Intel Optane Persistent Memory that is beyond most slideware, but also above many detailed whitepapers. As such, the complexity we planned to show in the original piece actually helps explain part of Micron’s decision.
Since we have a lot to share, we also have a video version accompanying this article. A few mid-week edits later, and it mostly matches what we will cover in this piece, just in video form for those who prefer to listen.
As always, we suggest opening this in a separate browser tab to get the best viewing experience.
The Quick Intel Optane / 3D XPoint Background
For a bit of background, with Intel Optane (the company’s marketing name for 3D XPoint-based products) we effectively get a cross somewhere between DRAM and NAND. Typically, this is described as byte-addressable non-volatile memory. The basic idea is that it is a persistent storage solution, similar to NAND flash that is in your SSD, and DRAM memory that is the RAM in your PC. When you hear a PC has say 16GB of RAM and a 256GB SSD, you can think of Optane as a cross between the two, but that is a very simplistic view, so let us get into a bit of what is going on. In practice, it operates differently than many who have not used Optane PMem understand so we wanted to clear that operations part up.
First off, the 3D XPoint media has a cell density between that of DRAM and NAND. As a result, the idea is that we can get higher DIMM capacities than with DRAM. At the same time, the structures storing data are larger than NAND, so we get less capacity than NAND. This has a direct relationship with the manufacturing cost. As an in-between product, 3D XPoint needs to stay somewhere in the middle since it is not always superior to DRAM (DRAM is faster) and it is also not always superior to NAND (NAND is higher capacity.) As a result, as a direct replacement to either, 3D XPoint needs to be somewhere in the middle unless a unique value proposition is offered.
Second, in terms of endurance and performance, Intel’s Optane based data center products are rated for effectively unlimited endurance. This is an exaggeration in some ways, but the endurance ratings are so high that, unlike with NAND, endurance is not a concern. Furthermore, an Optane SSD typically writes directly to media, instead of a write going first to DRAM then to media. That intermittent write to DRAM is basically why the industry cares about power-loss protection for data center SSDs. If the power fails mid-write, then DRAM loses contents which means data written to a NAND SSD is not safe until it makes it to the NAND. For write logs, such as a ZFS ZIL/ SLOG and databases, this is absolutely magic.
The byte-addressability is a big difference compared to today’s common QLC NAND. In terms of performance, compared to NAND, 3D XPoint is much faster in solutions, especially at low queue depths, and has a better quality of service because it is byte-addressable.
At the same time, performance-wise, 3D XPoint is still an in-between solution, so not everything is perfect. It is slower in two ways compared to DRAM typically found in your DIMM sockets. First, it has a higher latency because it is writing to the persistent 3D XPoint instead of DRAM. The second is one that not many discuss. The first two generations of Intel Optane DCPMM or PMem 100 and PMem 200 operate at DDR4-2666 speeds with Cascade Lake and Cooper Lake.
That is extremely important. Once you add PMem to a server, the memory speed drops to DDR4-2666. So on Cascade Lake or the 2nd generation Intel Xeon Scalable that means we go from 6x DDR4-2933 per socket to 6x DDR4-2666. The same happens with the DDR4-3200 3rd Gen Intel Xeon Scalable CPUs as well. Compared to modern CPUs such as the AMD EPYC 7002/ 7003 series and the Ampere Altra as examples, those are both 8x DDR4-3200. So theoretically, even when we compared 2nd Gen Xeon Scalable with PMem 100 to without, we only lost around 9-10% of bandwidth in our conceptual model in the table below:
Theoretical bandwidth plummets when using DCPMM/ PMem compared to the 8-channel DDR4-3200 solutions that have been in-market since at least 2019 with AMD EPYC 7002 “Rome”. Effectively, the Intel-only DCPMM / PMem modules added to a system create a memory bandwidth deficit with Cascade Lake/ Cooper Lake CPUs. This, combined with almost dizzying memory channel population requirements (so long we removed from this piece and an 8-minute segment from the already long video) means that just adding the PMem to a system has impacts beyond those of DRAM and NAND.
Very few folks discuss the memory speed impact, but it is real, and it is very easy to understand what is happening.
So why would one lose memory bandwidth, and get less capacity storage, to add Optane? The answer is to get a mixture of both. To look at this, we need to understand what Optane is effectively doing when we add PMem modules.
Intel Optane DC Persistent Memory (PMem) Real-Life Modes
The name of this section is chosen specifically because this is how Intel often simplifies Optane to show what is going on. Intel shows Memory Mode and App Direct mode. That sounds great, but in the application, it is full of glorious complexity. When I first saw the slides, then saw the configuration settings on systems, it took a while to figure out what was happening. To Intel’s credit, they do have documentation on all of this, and it is more complex than what we are going into here, but I wanted to give some sense of another layer down of complexity.
First, one can see Memory Mode and App Direct Mode. Memory Mode is the one that is perhaps the easiest to understand and sell. Indeed, I heard a lot of the early adoption was for memory mode, and we are now using this in the STH hosting cluster, but we are going to come back to that.
Memory Mode: Conceptually Easy, Complex Implementation
Intel says Memory Mode uses the Optane PMem as memory, effectively as a substitute for adding more expensive DRAM. That is a high-level abstraction, and it somewhat works. I am going to offer another model which fits with the slide above but may seem strange to those who have not gotten into the details. Instead of thinking of PMem as DRAM, think of PMem like a storage array where you have data on SSDs, but then cache the frequent data in a systems DRAM or RAID controller DRAM. That means frequently accessed data hits fast DRAM, then is periodically flushed to the underlying storage. If you are worried at this point about storing data on “memory modules” then Intel’s answer is that there is a cryptographic key that is reset every reboot so the data on the modules is so encrypted one cannot just take the modules out of a system and access the data. Everything is effectively gone every reboot.
As a result, one needs to add memory alongside Optane. For performance, Intel recommends a 4:1 ratio so effectively there is around 25% as much DRAM as Optane. It turns out that there are a number of research papers by hyper-scalers that show that often servers only have maybe 20-40% of data as hot, and the rest is relatively cold. In other words, around a quarter of data needs to be in DRAM while the remainder can be in something a bit slower, in this case, Optane PMem.
That ratio is a big challenge though. For example, if one wants to use 128GB DIMMs with a 4:1 ratio, then one needs to populate a 32GB RDIMM and a 128GB PMem module in the same memory channel. As a result, the cost of the 32GB RDIMM plus 128GB PMem needs to be less than 2x 64GB which would offer the same capacity and higher performance. In our hosting nodes, we tend to use 16GB + 128GB because that tends to work well and is a lot less expensive when scaled across 24 DIMM slots in a system. Just for a sense, we save a few thousand dollars per node with little impact on our hosting performance with this solution.
Intel’s challenge and this gets to the profitability, is that for Memory Mode pricing is constantly pressured by DRAM pricing, especially when DRAM prices fall. Memory pricing is more volatile, Optane PMem pricing needs to be more volatile as well. Memory Mode may be easy to understand, and easy for adoption, but it is constantly bound by needing to provide a discount to DRAM-only configurations.
The Persistent Bit: App Direct Mode
The way Intel gets to more value is with App Direct Mode. The easiest way to think of App Direct Mode is that you are turning each of the PMem modules into a SSD. Then comes the complexity. One way App Direct Mode can be used is via APIs with applications that support PMem. If an application is aware, such as SAP HANA, then the application can write to the PMem modules as if it is a super-fast storage tier. If the application is unaware, then there is a fallback.
When one starts Optane PMem modules and wants to expose them as a normal block device in Storage Mode, it is actually quite analogous to a SSD. Each CPU has its DCPMM/ PMem Set of modules. The modules in each Set can be grouped into Regions. Namespaces can then be created either through CLI tools or usually, there are BIOS settings to do this. Those namespaces can then be exposed to the OS as block devices but there is another option for direct access (DAX) in some OSes that bypasses the traditional block set. You can look up terms such as ndctl and fsdax/ devdax for a bit more on this as the documentation is fairly good and those search terms will get you started.
The tricky part is that there is another layer of complexity that does not get discussed often. That is how you create the regions and namespaces. Optane for App Direct is roughly organized into a hierarchy:
- Set – this is a “set” of DCPMM/ PMem modules that sit on a CPU socket
- Region – definable grouping across one or more Optane modules
- Namespace – this is similar to a NVMe SSD namespace, but sits atop a region and allows one to define storage to transact on
Your options are basically that you can use individual PMem modules or you can interleave them into a region. If you interleave the modules, it is like creating a large RAID 0 array of Optane P4800X/ P5800X SSDs. That is again a bit simplified, but the challenge is that data is written across modules to gain performance like a RAID 0 array. If one fails, then you lose data.
To counteract this, there is another option. One can expose a PMem module as its own region/ namespace/ block device. Then do things such as create RAID 1 arrays with them as if they were SSDs which gets in-system redundancy. Of course, doing this has a performance penalty versus the persistent interleaving modes. Also, if one mirrors across sockets, then that has to traverse the UPI links adding to latency until it is complete.
So the challenge is that to keep the module fast, we would want to interleave using a region across a CPU’s Set of PMem modules. Then, we are at risk of losing data so we then need to go to a lower performance state to get redundancy within a system. We can mirror data and transactions across a network to other systems, but that is another layer of complexity especially at the speed of Optane DIMMs.
In App Direct/ Storage Mode, the DRAM functions as normal DRAM, and we get fairly pure access to the Optane DIMMs. Our capacity is close to the Optane DIMM capacity as persistent storage (App Direct) and memory which is close to the DRAM capacity in the system. There are overheads, but that is the conceptual model you can use. Then, things get more complex.
Glorious Complexity: Mixed Mode
There is one other mode called “Mixed Mode.” This is exactly how it sounds and allows one to access all of the complexity of Memory Mode while also not missing out on the complexity of App Direct mode. In other words, you can do both, on the same system, at the same time. One can provision say a third of our capacity for App Direct/ Storage while the remaining two-thirds are used for Memory Mode. In our 12x 16GB + 12x 128GB DCPMM example, we would then have around 0.5TB per socket reserved for Memory Mode for around a 5:1 DRAM to Optane ratio. The remaining 0.25TB per CPU can now be a set used to create regions and namespaces.
How one does this, is by setting a “slider” where one sets how much of the capacity is set aside for Memory Mode, with the remainder being set aside for App Direct use. Software determines how much goes to each pool and takes care of the alignment/ rounding. One will note here that we are showing this using Supermicro BIOS, but there are CLI tools that handle all of this and can do more granular controls (you can see “Create goal config for Platform” as an example.)
Mixed Mode is a cool feature, but now we are using the Optane DIMMs for the cold portion of main memory plus the storage so there can be a performance impact. The first application where we used this was to carve out ZIL/ SLOG space which had a small capacity impact but meant we could have another drive free and get the benefits of a fast ZIL/ SLOG device. Since that has modest capacity requirements, it not only freed a drive bay but also saved us from having to spend hundreds of dollars per machine to get drives like the Intel Optane DC P4801X 100GB U.2 NVMe SSD Log Option. We have found that Optane, in general, is an awesome SLOG/ ZIL solution for those running ZFS arrays.
Sure it is complex, but it is also extremely useful. Using Mixed Mode to add mirrored write cache devices while simultaneously getting lower-cost memory expansion is immensely powerful, leading to our statement around “glorious complexity.”
With that background, let us get to the Micron announcement where they are discontinuing the 3D XPoint program.