Thanks to a reader that sent in this Reddit post we were alerted that the AMD EPYC 7002 “Rome” series core can hang after just under 3 years of uptime, or around 1044 days. While there are many bugs in processors given their complexity, this one is particularly interesting.
AMD EPYC 7002 Rome CPUs Hang After Less Than 3 Years of Uptime
This is not just speculation, instead, this is an official AMD Errata 1474 in 56323-PUB_1.01.
Description (Source: AMD Revision Guide for AMD Family 17h Models 30h-3Fh Processors)
A core will fail to exit CC6 after about 1044 days after the last system reset. The time of failure may vary depending on the spread spectrum and REFCLK frequency.
Potential Effect on System
A core will hang.
Either disable CC6 or reboot system before the projected time of failure.
No fix planned
For most of our readers, machines will be rebooted once every so often for things like security patches or other maintenance windows. At the same time, this is a fairly big deal since the remedy is effectively rebooting a system.
We checked the STH lab and it appears as though we actually had a HPE AMD EPYC 7002 Rome system that we forgot about hit 2 years and 261 days or 991 days total uptime running Proxmox VE before the system was decommissioned. The reason that the system had such high uptime is that it was part of a lab project that was outside our normal management tools and we forgot it was there apparently.
If a typical server lifecycle is 5 years these days, then it means that one might need to do a minimum of a single reboot over its lifetime to avoid this bug, so long as the single reboot happens between days 913 and 1044. Then again, a number of our readers are going to think this is silly with regular security patches. Others are going to think this bug is a major pain to track and deal with. If you are the type of admin that has a server up for around three years, then this might impact you.
Perhaps the next week should be dedicated to looking for older AMD EPYC “Rome” systems and seeing if any have had >900 day uptime.