In the early hours of 2018, a blogger going by python sweetness on Tumblr penned The mysterious case of the Linux Page Table Isolation patches. That article has spread like wildfire with The Register picking it up. The Register’s piece here is worth a read. I may be a bit biased in my favor of El Reg’s piece as I have had the pleasure of knowing one of the authors Chris Williams for some time. Whenever I read his work I understand the passion that goes into it. Google ponied up to saying it found the vulnerability. All that aside read those three articles for more in-depth information about KPTI. Today we are going to talk about some of the salient bullets you need to be armed with to start discussing this amongst your colleagues.
What you need to know about the Page Table / KPTI workaround bug
Unless you are a kernel developer, the low-level details are not going to be overly interesting. Here are the bullet points:
- We have heard that this bug impacts more than just Intel.
- It is primarily a security bug that leaves kernel memory potentially vulnerable to a userspace attack.
- Linux, Windows, and other teams are patching kernels to mitigate the impacts of the Intel silicon implementation.
- Linux and Windows are not the only OSes that will be impacted. As an example, Apple OSX was patched in 10.13.2 (early December 2017) for this.
- AMD maintains its silicon does not use the same implementation as Intel, and so it is not impacted by the bug. In the future, we expect most Linux installations will use the patched kernel so we expect some performance degradation on the AMD side.
- The ones especially concerned about this are those who provide public multi-tenant infrastructure where users can potentially launch an attack. AWS, Google, Microsoft, and others.
- This is not something that will be fixed via an Intel microcode patch.
- Unless a specific attack is developed, the as-is state is not inherently unstable. This is not like the Intel Atom C2000 series bug.
- Like that Intel Atom C2000 series bug, most of the folks that are in-the-know are under NDA/ embargo.
- There is a performance impact. Most numbers we have seen peg it at under 0.5%. Some that offer sensational views like to point out that a near worst case scenario can be 30%. Those saying 30% you should view as sensationalist. For most consumer workloads, you can note that after the OSX 10.13.2 patch almost a month ago, there was little discussion of negative performance.
- If you have an Intel chip in production, you are impacted. We have heard that other offerings such as Qualcomm’s ARM server cores and others are impacted as well.
- Intel offered a rebuttal that you can read here.
That should give STH readers a few talking points on the issue. Now for what STH is going to do.
What STH is doing
We did not want to sensationalize this too much. It is one of the more significant silicon bugs lately. At the same time, it has been an issue for a decade.
For us, there is an additional impact. We provide performance numbers for our readers and this is a case where performance is going to change between what we have published previously and what will be the go-forward reality. As a result, we need to address this.
Our current benchmark script run takes several days to run but produces results that are extremely reliable because we keep the exact same stack for every run. Our current plan is as follows:
- Continue publishing our backlog of data on Ubuntu 16.04.3 LTS between now and April
- Work on a preview of legacy v. patched results between now and April once patches mature. One example of this is that Apple fixed in OSX 10.13.2 but we have heard 10.13.3 has additional tweaks. We do not want to publish numbers until we achieve close to steady state for go-forward performance. While this is still in a high rate of change, we are abstaining from publishing formal numbers.
- Go-forward efforts, including some backtesting, on Ubuntu 18.04 LTS starting in April and will be treated as a new dataset.
We already have one DemoEval customer testing their web stack with the new kernel in our lab and they are seeing a sub 1% performance delta which is an expected test run variation. We have also heard/ seen that heavy database applications are going to be impacted considerably more. If you need a few systems, we have the capabilities.
If you are talking around the water cooler and hear people throw around 30% performance hits, take that with a grain of salt until you test. In fact, if someone blanket states 5-30% performance loss disregard them as a reliable source. We expect most users to have a fairly minimal impact. At STH, we are going to provide a picture of performance deltas after the patches get a bit more mature since from what we understand, there may be future performance mitigations available. If you want to test your software, we can help with Intel, AMD, and some ARM environments through DemoEval.
Ninja edit: AMD posted their official response here after this piece went live.