Addressing the CPU Kernel Page Table KPTI Workaround Fervor

8
Intel Xeon Silver 4110 Top And Bottom
Intel Xeon Silver 4110 Top And Bottom

In the early hours of 2018, a blogger going by python sweetness on Tumblr penned The mysterious case of the Linux Page Table Isolation patches. That article has spread like wildfire with The Register picking it up. The Register’s piece here is worth a read. I may be a bit biased in my favor of El Reg’s piece as I have had the pleasure of knowing one of the authors Chris Williams for some time. Whenever I read his work I understand the passion that goes into it. Google ponied up to saying it found the vulnerability. All that aside read those three articles for more in-depth information about KPTI. Today we are going to talk about some of the salient bullets you need to be armed with to start discussing this amongst your colleagues.

What you need to know about the Page Table / KPTI workaround bug

Unless you are a kernel developer, the low-level details are not going to be overly interesting. Here are the bullet points:

  • We have heard that this bug impacts more than just Intel.
  • It is primarily a security bug that leaves kernel memory potentially vulnerable to a userspace attack.
  • Linux, Windows, and other teams are patching kernels to mitigate the impacts of the Intel silicon implementation.
  • Linux and Windows are not the only OSes that will be impacted. As an example, Apple OSX was patched in 10.13.2 (early December 2017) for this.
  • AMD maintains its silicon does not use the same implementation as Intel, and so it is not impacted by the bug. In the future, we expect most Linux installations will use the patched kernel so we expect some performance degradation on the AMD side.
  • The ones especially concerned about this are those who provide public multi-tenant infrastructure where users can potentially launch an attack. AWS, Google, Microsoft, and others.
  • This is not something that will be fixed via an Intel microcode patch.
  • Unless a specific attack is developed, the as-is state is not inherently unstable. This is not like the Intel Atom C2000 series bug.
  • Like that Intel Atom C2000 series bug, most of the folks that are in-the-know are under NDA/ embargo.
  • There is a performance impact. Most numbers we have seen peg it at under 0.5%. Some that offer sensational views like to point out that a near worst case scenario can be 30%. Those saying 30% you should view as sensationalist. For most consumer workloads, you can note that after the OSX 10.13.2 patch almost a month ago, there was little discussion of negative performance.
  • If you have an Intel chip in production, you are impacted. We have heard that other offerings such as Qualcomm’s ARM server cores and others are impacted as well.
  • Intel offered a rebuttal that you can read here.

That should give STH readers a few talking points on the issue. Now for what STH is going to do.

What STH is doing

We did not want to sensationalize this too much. It is one of the more significant silicon bugs lately. At the same time, it has been an issue for a decade.

For us, there is an additional impact. We provide performance numbers for our readers and this is a case where performance is going to change between what we have published previously and what will be the go-forward reality. As a result, we need to address this.

Our current benchmark script run takes several days to run but produces results that are extremely reliable because we keep the exact same stack for every run. Our current plan is as follows:

  • Continue publishing our backlog of data on Ubuntu 16.04.3 LTS between now and April
  • Work on a preview of legacy v. patched results between now and April once patches mature. One example of this is that Apple fixed in OSX 10.13.2 but we have heard 10.13.3 has additional tweaks. We do not want to publish numbers until we achieve close to steady state for go-forward performance. While this is still in a high rate of change, we are abstaining from publishing formal numbers.
  • Go-forward efforts, including some backtesting, on Ubuntu 18.04 LTS starting in April and will be treated as a new dataset.

We already have one DemoEval customer testing their web stack with the new kernel in our lab and they are seeing a sub 1% performance delta which is an expected test run variation. We have also heard/ seen that heavy database applications are going to be impacted considerably more. If you need a few systems, we have the capabilities.

Final Words

If you are talking around the water cooler and hear people throw around 30% performance hits, take that with a grain of salt until you test. In fact, if someone blanket states 5-30% performance loss disregard them as a reliable source. We expect most users to have a fairly minimal impact. At STH, we are going to provide a picture of performance deltas after the patches get a bit more mature since from what we understand, there may be future performance mitigations available. If you want to test your software, we can help with Intel, AMD, and some ARM environments through DemoEval.

Ninja edit: AMD posted their official response here after this piece went live.

8 COMMENTS

  1. “In the future, we expect most Linux installations will use the patched kernel so we expect some performance degradation on the AMD side.”

    Sorry, but this is pure fearmongering. First, the KPTI option can be disabled be means of boot time parameter. Second, and I’m certain that AMD will insist on that, the kernel can recognize the chip and only activate it for affected chips.

  2. Brian Krzanich will get a medal and a bonus of 1 billion USD from the US corporate/goverment for keeping his mouth shut for more than a decade about this issue.

  3. Linus Torvalds looks pretty angry

    ” A *competent* CPU engineer would fix this by making sure speculation doesn’t happen across protection domains. Maybe even a L1 I$ that is keyed by CPL.

    I think somebody inside of Intel needs to really take a long hard look at their CPU’s, and actually admit that they have issues instead of writing PR blurbs that say that everything works as designed.

    .. and that really means that all these mitigation patches should be written with “not all CPU’s are crap” in mind.

    Or is Intel basically saying “we are committed to selling you shit forever and ever, and never fixing anything”?

    Because if that’s the case, maybe we should start looking towards the ARM64 people more.

    Please talk to management. Because I really see exactly two possibibilities: Intel never intends to fix anything OR these workarounds should have a way to disable them. Which of the two is it?
    Linus Torvalds

  4. Patrick, I know you wrote this article very early on, but as it stands it only adds to the confusion out there.

    The most important thing most people should know is that we are talking about two different bugs: Meltdown and Spectre.
    And that Spectre has 3 different attack vectors and they are not equally practical to exploit and not all CPUs are vulnerable to all 3 attack vectors. Spectre is the least known bug (most details are being kept secret for now).

    Intel and ARM A75 (not shipping yet) seem to be affected by Meltdown which is the most serious bug, but also the easiest to fix.

    Everyone else seem to be vulnerable by Spectre but then not all CPUs are vulnerable to all 3 attack vectors. Here, it seems that vector 1 is the most serious and affects Intel but not AMD (maybe some ARM CPUs also).
    Vectors 2 and 3 seem to find much wider range of CPUs that are vulnerable, but it is yet unclear how dangerous and practical these attack vectors are.
    Unfortunately, mitigating Spectre is harder but again very few people right now understand what Spectre really means/is.

    So, different CPUs are vulnerable to different bugs and attack vectors. But it seems only Intel has CPUs vulnerable to Meltdown and all 3 attack vectors for Spectre (vector 1 being the most serious and easiest to exploit).

    I am sure this info will also become outdated soon, but the article here gives the impression of one bug, when we actually have several.

    Also, keep in mind that Spectre is being regarded as a new attack method and not a bug. It is as if we had all being investing in very thick front doors and locks but then someone figured out there is an unprotected side window that can be used to break in.

    Spectre being a new attack method, the 3 vectors known today could become 100 tomorrow. So Spectre is the wildcard.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.