Earlier this month, I tweeted a fun little snippet early on a Monday morning. We were talking about this with some in our team and had a few external requests, and people thought it was interesting. We had an AMD EPYC 7601 server that was up for almost two years.
Finding an AMD EPYC server that has been up for 725 days at around ~70% load average is like an archeological find. This has been running since only about a quarter after EPYC was in solid supply. Alas, time to reboot. pic.twitter.com/XnFc9URKiJ
— Patrick J Kennedy (@Patrick1Kennedy) December 2, 2019
The 725 Day AMD EPYC Server
Here is a quick shot of the uptime report before rebooting the server.
The story behind the Tyan Transport SX AMD EPYC-powered server, a predecessor to the Tyan Transport SX TS65A-B8036 2U 28-bay AMD EPYC server we just reviewed, is fairly simple. At STH, we run our own infrastructure. As a result, I have a strong belief that if STH is going to recommend something, that I should be willing to use it.
During the AMD EPYC 7001 launch, AMD AGESA, or the basic code underpinning EPYC systems, was rough to say the least. To give an example, we rebooted a test system the day before the launch event and it ran into an AGESA bug that required one to pull the CMOS battery. That would be annoying on a consumer platform, but it is an extra level of annoyance on a server. Pulling a CMOS battery is not a simple remote hands job like a power cycle or drive swap when you are 1500 miles away. That is why we pushed our Dual AMD EPYC 7601 review. I refused to review a server or a chip when the AMD AGESA was that rough, even if it meant we would not be first to publish benchmarks.
The Tyan Transport SX server was our primary 1P benchmarking system for some time. It, however, transitioned into another role. We hammered the box seeing how long until it failed in the lab. After the AGESA updates in 2017, we looped through different Docker benchmark containers constantly to see when it would fail.
By mid-2018, we started to deploy EPYC in our production web hosting setup. First, we utilized the systems for a backup server, then more nodes came online for hosting and some of the other bits. As I am writing this we are around 50/50 Intel Xeon and AMD EPYC. This server went a long way in proving that the new platform can be trusted.
On December 2, 2019, we needed to physically move the box. Logging back in, the server has been dutifully looping benchmarks for over 725 days (we let it hit 726 days before moving it.) It had zero errors in that time which is impressive, but what one would expect from an Intel Xeon server.
The AMD EPYC 7002 series launch was completely different. Everything just worked for us. There is a huge level of maturation that happened on the EPYC 7001 platforms. In interviews, the AMD team talks about getting the ecosystem primed. This is a great example of how far that has come.
For those wondering, whether the AMD EPYC platform can run 24×7, this is a great example of where it ran, without issue, longer than most servers practically will. Today, servers are getting more frequent security updates and patches. Even with live patching, sometimes a reboot is required. 726 days, or essentially two years is a great result that frankly exceeded our team’s expectations.