Intel Optane A Practical Application Review

7
Intel Optane 900p AIC In Hosting Node
Intel Optane 900p AIC In Hosting Node

A few weeks ago, we performed a fun experiment. Bolstered by our Intel Optane ZFS findings, we went one step further. We hosted the STH main site on Optane for an entire day to see if it is a worthwhile upgrade. In short, we were pleasantly surprised to see that the solution worked extremely well.

The Path to Hosting STH on Optane

Changing hosting infrastructure to a new technology can be scary but we knew from our Intel Optane P4800X testing in early 2017 we could expect great things from the technology.

STH WP NGINX CDN Intel Optane Performance 1
STH WP NGINX CDN Intel Optane Performance 1

In that earlier testing, we ran a synthetic test on an internal network using a CDN node to access files we host but we left the database to purely synthetic testing.

Going into this experiment, we had a reasonable assumption that Optane would perform well. It is one thing to test in the lab. It is another thing to test it in production.

We put copies of the entire STH main site on a dual Intel Xeon E5-2698 V4 system that is identical to another system we have set up in the hosting cluster save that this node had 2x Intel Optane 900p 280GB AICs, 2x Intel DC P3600 1.6TB NVMe AICs, and 2x Intel DC P3700 400GB 2.5″ SSDs. In our standard hosting cluster’s NVMe tier we also use Samsung XS1715 drives but we were limited by drive bays.

Intel Optane 900p AIC In Hosting Node
Intel Optane 900p AIC In Hosting Node

For anyone who has done web hosting of largely static web pages can attest, disk I/O is generally not something one cares too much about if one can cache content in memory. That is generally what we do with STH. On the test server, we turned off the vast majority of this caching except for php opcode caching. The goal was to ensure that the database and all of the image files were hitting NVMe storage instead of being cached in RAM. This would be a similar setup as if you had a lot of content and decided to get faster primary storage instead of getting more RAM.

The next step was that we changed HAproxy instances to use the Optane backed STH main site version as the primary, and to only failover to the primary infrastructure in the event of a failure. Again, taking measured steps.

We first did a day hosting on Intel DC P3700 400GB SSDs and a second day entirely on the Intel Optane 900p 280GB SSDs.

There are certainly a few imperfect aspects to this type of test. We are using a live website rather than a scripted ab or siege run which means that the access patterns are not the same. To somewhat mitigate this, we scheduled the test during a period where we did not publish new content so that traffic was relatively similar in terms of day-over-day pageviews per content piece and overall. The Optane hosted day was also slightly busier than the P3700 hosted day.

Real-world Optane Impact

Instead of testing raw throughput, we are looking at latency. Although STH generates millions of page views each month, each page view is comprised of several dozen individual requests. There are multiple php components, several database hits to get content for each page, and there are multiple images that must be requested per page just to name a few. The net result is around three dozen requests that will be served from STH directly to browsers but much more that happens behind the scenes.

For a site like STH, this is easy enough to cache in RAM and minimize disk I/O even over hundreds of millions of requests. Once you move outside of RAM strange things happen. For example, in the early days of STH, we would have disk-based AWS EC2 VMs seg fault and kill the site without proper RAM caching. Luckily, the Intel DC P3700 and Intel Optane 900p drives can keep the site up even without heavy RAM caching making this experiment possible.

Looking at latency, we have a distribution of requests serviced based on percentile latencies.

Intel Optane V P3700 STH Web Hosting Latencies
Intel Optane V P3700 STH Web Hosting Latencies

While years ago, folks looked at the 99th percentile latency for web hosting, that is no longer good enough due to the sheer number of requests that modern web pages utilize. As we progress from two to five nines on our scale, we can see that the Intel Optane 900p simply obliterates the Intel DC P3700. This is relatively low queue depth work so it is expected. At the same time, the Optane drive backed hosting gives more than a full 9 better latency distribution.

Looking at the worst case over the day, NAND based NVMe simply is not keeping up with Optane. Here was the Intel QoS chart Intel put out at the P4800X launch:

Intel Optane SSD DC P4800X Predictable Fast Service
Intel Optane SSD DC P4800X Predictable Fast Service

While our results were less dramatic, they are not less impressive. Looking at our maximum request times, the Intel Optane 900p was several ms faster which is an eternity.

Final Words

If you cannot tell by the tone of this article, changing things in our hosting cluster is not done lightly. It took us months of hands-on in the lab testing with multiple Intel DC P4800X units, about a dozen Optane 900p’s, and even several Optane Memory m.2 modules to get to the point where we were willing to add Optane-backed hosting to our cluster pool on a temporary test basis.

The results were simply shocking. The next generation of STH hosting will be on Optane SSDs because they are a significant step upgrade over NAND SSDs should RAM caching become unfeasible. We do not upgrade the hosting cluster often, but after months of working with the drives, the next-generation hosting cluster will have Optane along with traditional NAND SSDs. We test a lot of hardware at STH and very infrequently are we able to make black and white recommendations such as with Optane.

7 COMMENTS

  1. We did this after your first review on some SQL databases and saw even more dramatic results. It’s like the best kept secret in tech because nobody else is shipping these drives so everyone still uses NAND and performance together.

  2. It would be interesting to see what happens when you put the mysql doublewrite buffer and undo/redo logs on an optane device while keeping the data files on a regular SSD. Could also be done with PostgreSQL WAL.

  3. Hello. Could you tell us how your cluster was implemented? Or write an article based on your experience with the cluster solutions integrated with you. My request is rather aimed at exchanging technological ideas, and/or used free software. I’m trying to make a cluster of three bare-metal servers each with Intel® SSD DC P4600 3.2TB, as a DRBD9 shared storage, and all three are active.

  4. What about post spectre & meltdown results? If certain reviews are to be believed, then the faster the drive – the harder is the hit to its performance numbers. I’d assume the optane would still be faster – but how much now?

  5. Ramsay – you are right. We are going to publish spectre & meltdown results as the ecosystem settles. There are a lot of folks tuning applications for patches, and when patching does not go well (e.g. the initial Intel microcode patches), numbers change.

    To give you an idea, Optane post-patch was still better than NVMe pre-patch although not to the same degree.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.