ZFS is the popular storage system that was born out of Sun Microsystems (now Oracle.) If you are looking for a piece of software that has both zealots for and against, ZFS should be at the top of your list. Today we are seeing many storage systems predicated on flash storage but ZFS was born of a different era. ZFS was born in an era disk-based storage. That is an important piece of background information because with disk-based arrays performance is always a concern due to the slow media. While reads can be easily cached in RAM, writes need to be cached on persistent storage to maintain the integrity of the storage. The ZFS ZIL SLOG is essentially a fast persistent (or essentially persistent) write cache for ZFS storage.
In this article, we are going to discuss what the ZIL and SLOG are. We are then going to discuss what makes a good device and some common pitfalls to avoid when selecting a drive. As we have hinted at STH, we just did a bunch of benchmarking on current options so we will have some data for you in follow-up pieces. We are going to try keeping this at a high enough level so that a broad audience can understand what is going on. If you want to help by coding new features for OpenZFS, this is not the ultra-technical guide you need.
Background: What happens when you write data to storage?
In the original draft of this article, we started with ZFS. Instead, we wanted to provide a bit of background as to what happens when you write data, at least at a high-level. The absolute basics are required to understand what the ZFS ZIL SLOG does. We are going to use a simple example of a client machine, say a virtual machine host, writing to a ZFS storage server. We are going to exclude all of the fun network stack bits, and impacts of technology like RDMA. Remember, high-level.
Let us say that you have a VM running on a VM host. That host needs to save data to its network storage so it can be accessed later by that VM or another VM host. We essentially have three major operations that need to happen. The transmission of the data from the client VM host. Once the data reaches the other side of the network at the storage server, that data needs to be received by the storage server. Finally, the storage server needs to acknowledge that data has been received. Here is the illustration:
That acknowledge is an important step in the process. Until the client receives the acknowledgment, it does not know that data has been successfully received and is safe on storage. That acknowledgment is important because it can have a dramatic impact on synchronous write performance
Synchronous v. Asynchronous Writes
Synchronous and asynchronous writes may seem simple, yet the difference has a profound impact. At its essence, a system will send data to storage to be written and with synchronous write, it will wait until it receives the acknowledge from target storage. With the asynchronous write, a system will send data and then continue to do the next task before it receives the acknowledge.
On the data security side of things, asynchronous writes are generally considered less “safe” because a disruption in the storage or network can mean that the system doing the write thinks that the data is written and safe on persistent storage when in fact it is not. If you think about writing a check thinking you have money in the bank, and you do not, bad things happen (e.g. your check bounces.) Modern systems are generally reliable, but asynchronous writes can cause issues. If you look at client systems such as laptops, it is not uncommon to see asynchronous writes and data loss or corruption. On the server side, this is not a desirable scenario for anyone who wants to keep a job.
Synchronous writes (also known as sync writes), are safer because the client system waits for the acknowledgment before it continues on. The price of this safety is often performance, especially when writing to slow arrays of hard drives. If you have to wait for data to be written to a slow array of disks and get a response back, it can feel like timing storage with a sundial.
Given its Sun heritage, ZFS is designed to keep data secure and provide storage to many clients. You get a bad reputation as a server/ storage vendor if power and network disruptions cause thousands of machines to lose data. As a result, the ZFS engineers implemented the ability to have a fast write cache. This write cache allows data to make it on the target system’s persistent storage and an acknowledgment to be sent back faster than if the chain was awaiting a slow pool of hard drives to confirm the write occurred.
What is the ZFS ZIL?
ZIL stands for ZFS Intent Log. The purpose of the ZIL in ZFS is to log synchronous operations to disk before it is written to your array. That synchronous part essentially is how you can be sure that an operation is completed and the write is safe on persistent storage instead of cached in volatile memory. The ZIL in ZFS acts as a write cache prior to the spa_sync() operation that actually writes data to an array. Since spa_sync() can take considerable time on a disk-based storage system, ZFS has the ZIL which is designed to quickly and safely handle synchronous operations before spa_sync() writes data to disk.
What is the ZFS SLOG?
In ZFS, people commonly refer to adding a write cache SSD as adding a “SSD ZIL.” Colloquially that has become like using the phrase “laughing out loud.” Your English teacher may have corrected you to say “aloud” but nowadays, people simply accept LOL (yes we found a way to fit another acronym in the piece!) What you would be more correct is saying it is a SLOG or Separate intent LOG SSD. In ZFS the SLOG will cache synchronous ZIL data before flushing to disk. When added to a ZFS array, this is essentially meant to be a high speed write cache.
There is a lot more going on there with data stored in RAM, but this is a decent conceptual model for what is going on.
What is commonly used as a SLOG device?
Traditionally there have been solutions using small RAM-based drives to act as the ZFS ZIL / SLOG device. For example, the SAS based 8GB ZeusRAM was the device to get for years. That changed with NVMe SSDs. Once the Intel DC P3700 hit it was clear RAM + NAND devices were going to take over the market. When Intel Optane SSDs came out in early 2017, they quickly became a solid option. Intel Optane drives combine low latency, and high bandwidth at low queue depth performance, more like RAM, but with data persistence like NAND.
Here are three scenarios we are going to discuss in terms of SLOG device options:
We are going to use a base assumption that for any write cache device you want something with high durability, high reliability and oftentimes you will want to mirror devices. These are base assumptions for any device in this role.
In cases of products like the ZeusRAM and DDRdrive, a sync write happens and the SLOG device stores data in DRAM. DRAM is not persistent storage which means upon power failure, you would lose the data stored, much like main system memory. What these devices generally do is have batteries or capacitors as well as onboard NAND that allows the RAM to either persist through a short outage or to write data out to onboard NAND in the event of a power emergency.
The ZeusRAM went through a SAS controller, so there was a PCIe to SAS controller hop as well as one or more SAS controller to SAS device hops which add latency.
Other options are enterprise SSDs. To maximize the life of NAND cells, NAND based SSDs typically have a DRAM-based write cache. Data is received, batched, and then written to NAND in an orderly fashion. This brings up the same concern we saw with the ZeusRAM based devices where power loss, while data is in DRAM, can cause data loss. In enterprise drives with Power Loss Protection or PLP, onboard capacitors allow the SSD to have enough power on power loss to write out data from the DRAM cache to the NAND.
Since these enterprise drives can treat data as “safe” if it is in the DRAM cache, they can acknowledge that data is saved securely for sync write operations. Most consumer drives use DRAM write cache to achieve high performance, but do not have this power loss protection. That is why their sync write speeds are generally low. Here is the internal view of an Intel DC P3700 SSD. You can see the large capacitor on the right side of the image.
The costs for those capacitors are not included in consumer SSDs as there is a race to have the lowest BOM cost.
Intel Optane is the relative newcomer and perhaps the most interesting. Remember, Intel Optane stores data on its packages without requiring a RAM-based write cache buffer. There are other benefits such as not needing the same garbage collection algorithms and such that we see on NAND based SSDs. As a result, writes to the drive go directly to the persistent storage media.
Architecturally, Optane is fascinating and there are performance benefits as well. Optane can handle mixed workloads extremely well. Likewise, it can handle low queue depth performance well. Finally, it has high endurance features which make it ideal for a ZFS ZIL SLOG device.
In this article, we hope you learned the basics of what the ZFS ZIL does, what a SLOG is and why to use it. You should also have learned about some of the common ZFS ZIL SLOG devices. As you delve deeper into what is happening, there is a lot more going on in terms of when things hit RAM, how flushes happen, and etc. On the flip side, this should give a good enough overview to understand why one may want a SLOG device in a ZFS array.
It is also important to note that these models conceptually can be used elsewhere. For example, RAID cards such as the New Microsemi Adaptec Smart Storage Adapter SAS3 Controllers can use their DRAM write cache and a capacitor to flush data to NAND on a power loss event.
In a follow-up piece, we have some results for the Intel Optane product performing writes and flushes as it may experience as a SLOG device. This is different than a typical mixed workload or a 100% write workload figure. We have not just numbers for the Intel DC P4800X, but also for lower-end products including the Intel Optane 900p and Optane Memory M.2 devices. We also have Intel NVMe SSDs along with a few devices from other vendors along the NVMe, SAS3 and SATA ranges to compare. Data is generated, charts and article are in progress. If you cannot tell by the general tone of this article, the Intel Optane drives are a new category killer.