Several STH readers will notice that we had downtime last evening. It sucks. Almost everything is back up, and performing well. While we are still finishing the process, we wanted to give a mini-postmortem to the STH community on what went wrong and also give a quick update into the next-generation that we are bringing online soon.
The Bad: Power Outage in the Data Center
This happened. We are in the same data center suite (for now) as Linode, Android Police and etc. Last evening, the power to the suite was cut. We do not have official word on what happened, but it took a few hours to resolve. When this suite goes down we are not the only ones impacted:
— Android Police (@AndroidPolice) June 21, 2018
And another example:
Connectivity Issues – Fremont https://t.co/S2LYPd3TUY
— Linode Status (@linodesys) June 21, 2018
The outage covered well over an hour, so this is not something that one can realistically fix with in-rack battery backups.
Normally, once power resumes, we are back online within 2-3 minutes while everything boots. We had a 2011 article on how to do this Keeping Servers On: BIOS Setup for Availability (throwback Thursday!)
This time was different. There were three additional failures that we found after the power came back on.
The Worse: Failures Two through Four
After we got confirmation that power was back on, and Linode and others who have larger support teams and status pages, were online, we realized something was wrong. STH was still down. As power was restored we actually had three failures to deal with:
- The primary firewall would not boot. It seems dead due to the AVR54 bug.
- A switch connected to the emergency firewall was not powered on. We did not diagnose this on-site this morning as it will be replaced with a new 10GbE/40GbE switch next week anyway. A potential cause may be a dead PDU port. Update: It was the PDU port failure. You can see how we upgraded power here.
- A dual Intel Xeon E5-2699 V3 server decided it did not want to finish booting due to SATA SSD linkages.
It took about 30 minutes to design and build an emergency firewall appliance and get software loaded then another 2 minutes to make coffee and a 15-minute drive to the data center. We spent a few minutes diagnosing the issue (firewall) to ensure it was indeed the problem before replacing.
Intel Atom C2000 AVR54 Bug Strikes STH
Readers of STH may remember our pieces on the Intel Atom C2000 Series Bug and Intel Atom C2000 C0 Stepping Fixing the AVR54 Bug. In Intel’s Atom line of that generation, there is a nasty bug that, over time, degrades clock signals to the point that systems will not boot. The Intel Atom C2000 series is the most famous of these, but there are others of this era, as we covered in our piece Another Atom Bomb Intel Atom E3800 Bay Trail VLI89 Bug. We are not going to name the vendor (it was not Supermicro despite the images we are using) because it is a known Intel Atom issue.
At STH, we have an in-line firewall that is designed to cut traffic when it dies. There are other options for this such as the bypass designs that we covered in our Supermicro A1SRM-LN7F-2758 Review. Given we do not sell products or services on STH, a few minutes of downtime is essentially a rounding error for our application so the safer option is to endure a few minutes of downtime during maintenance.
This time, we simply swapped the old unit out for a Xeon D-1500 based solution and got everything running. It was faster than bringing everything up in another data center, then migrating everything back to the original infrastructure.
On the Mend: Cleaning-up and the Future
The STH hosting cluster is actually smaller in anticipation of updating over the next ten days or so. We normally only visit the hosting racks once every other quarter, usually when we find things like: What is the ZFS ZIL SLOG and what makes a good one that get us to upgrade immediately. The dual Intel Xeon E5-2699 V3 node was offline while the site was still running, but when it came back online and everything was back to being balanced STH page load times went from about 2.85s to 1.42s which is a big improvement.
What is next for our environment? CPUs and systems are in. We are preparing for an EPYC upgrade next. Stay tuned. The switch that went down was going to be replaced anyway, so that is getting a significant upgrade next time.
Indeed. If you buy a retail AMD EPYC 7000 series processor, there is an enormous case badge in the package that will not fit a rackmount server bezel.
Want to Know More About Hosting?
We keep tabs on our maintenance needs for our Falling From the Sky Part 4 Leaving the Cloud 5 Years Later series as well as any drive failures Used enterprise SSDs: Dissecting our production SSD population. Our failures this week are now in our tracking sheets.
There are certainly things that can be designed better in our infrastructure. We have plenty of bandwidth, colo space, and all of the hardware one could want. Part of STH’s early mission was to find the balance between overbuilding (and seeing failures from complexity) and underbuilding (and seeing failures like these.) That is still a journey. We should not have left the firewall with the AVR54 bug installed. This was a risk that we took and it bit us.