The Intel Atom C2000 Series Bug – Why it is so quiet

7
Intel Rangeley Avoton CPU Package
Intel Rangeley Avoton CPU Package

We now have received responses from eight different vendors (more inquiries outstanding) on the Intel Atom C2000 bug. Intel noted that it needed to set aside a reserve for this bug in their latest earnings call. At STH we were the first to review Avoton and Rangeley products so we started digging. We also did want to note that we now have 22x Rangeley-server and networking products in production. The newest is 20 months old and the oldest is over 3 years and we have had zero failures thus far. We expect to see more vendors announce solutions as early as this week but wanted to provide a bit of insight as to what is happening. We also do believe (unconfirmed) that this may be a reason we recently saw yet another Denverton delay.

The Bug – A few non-NDA Sources

Intel Rangeley Avoton CPU Package
Intel Rangeley Avoton CPU Package

You can see in the latest Intel Atom C2000 family spec update a new errata. You can also see Cisco’s page on the bug and equipment replacement here. Here is AVR54 which popped up in the latest errata:

AVR54. System May Experience Inability to Boot or May Cease Operation Problem: The SoC LPC_CLKOUT0 and/or LPC_CLKOUT1 signals (Low Pin Count bus clock outputs) may stop functioning.

Implication: If the LPC clock(s) stop functioning the system will no longer be able to boot.

Workaround: A platform level change has been identified and may be implemented as a workaround for this erratum.

(Source: Intel Atom C2000 family spec update dated January 2017)

From Cisco’s FAQ:

Q: Do you expect these products to fail at 18 months in operation?
Although the issue may occur beginning at 18 months in operation, based on information provided by the supplier, we don’t expect an unusual spike in failures until year three of runtime.

(Source: Cisco Clock Signal Component Issue FAQ)

Responses from Vendors

Cisco was perhaps the most aggressive in getting platforms RMA’d and fixed and has been very public about it. Other vendors have been slower to announce plans.

  • Cisco: Cisco has been extremely proactive on this. There is a full support page on the issue. They are replacing products for customers under warranty/ support contract as of November 16, 2016. My first tech job was working in a Cisco factory in a department working on refurbished boards. 20 years later I would expect that they will perform platform level fixes to current boards RMA’d.
  • HPE: HPE is stating that although it uses the Atom C2000 series (e.g. in the Moonshot), it is not seeing customers with problems. Update June 2017: HPE has issued a customer advisory “To the best of our knowledge, there have been no failures of any HPE product due to this issue.” (from advisory updated dated 201706-15 accessed 2017-06-22)
  • Juniper: Several parts seem to be impacted per the current TSB: MPC7E-10G, MPC7E-MRATE, MX2K-MPC8E, MX2K-MPC9E, EX9200-12QS, EX9200-40Xs, FPC3-PTX-U2, FPC3-PTX-U3, FPC3-SFF-PTX-U1, CB2-PTX, PTX-IPLC-B-32, PTX-ILA-M-AC, PTX-ILA-M-CHAS
  • Fortinet: Fortinet has a Bulletin on the parts.
  • Supermicro: RMA for platform-level workaround available for concerned customers. We also did confirm that Supermicro has implemented the platform level workaround in products shipped from January 2017 onwards.
  • Netgate: Netgate, the company behind pfSense firewalls, just (7 Feb 2017) posted information about a component issue and how they are handling replacements.
  • QCT confirmed switch control plane C2758 CPUs are impacted by a clock signal component issue. For example, in QCT’s Broadcom Trident II switches such as the bare metal switch version of our T3048-LY8 lab switches. Note the ARM variants like we are using are not impacted.
  • iXsystems: FreeNAS Mini and FreeNAS Mini XL are impacted. The company has a FAQ here. Any produced after February 2017 will have the fix and warranties are being extended as a result.
  • Synology: The company is extending the warranty of Intel Atom C2000 series models. It says it is working on a fix even after having shipped “hundreds of thousands” of devices based on the processor. Synology is also the first vendor to publicly acknowledge Intel’s Atom C2000 is faulty.
  • ASRock Rack: Not seeing major issues from customers yet. Working with a vendor on this issue. Current course of action is to replace boards under three-year warranty.
  • Netgear: The company has a KB article addressing NAS units that all contain Intel Atom C2000 parts. The company is proactively reaching out to registered users of the products to determine how to address concerns.
  • Sangoma: Sangoma publicly stated that the C2000 bug impacts several of their UC and PBX systems. Here is the customer letter. The company will extend warranties to 5 years for this specific issue.
  • Others: Virtually every vendor uses Avoton/ Rangeley in some capacity. There is a good chance that if you are running a SDN switch with something like Cumulus Linux on it, you have a Rangeley chip inside. These things are everywhere. Most vendors cited NDA and declined to comment.

If you are from a company or know of a company that has publicly disclosed their plans, please let me know (e-mail patrick at this domain) and we will update this post.

Industry-wide NDA?

If you notice from Cisco’s FAQ, Cisco is declining to provide specifics:

Q: Who supplies the impacted component?
As a matter of policy, Cisco stands behind the reputation of our products. We do not intend to publicly name the supplier.

(Source: Cisco Clock Signal Component Issue FAQ)

Thus far (February 7, 2017) we have received responses from eight vendors who supply Rangeley products, excluding Cisco. Every single vendor has declined to discuss specifics citing NDA, some called me directly to say they were not responding due to NDA concerns.

No vendor confirmed this, however putting the pieces together we can see that all of the vendors are giving us the “cannot talk about this due to NDA restrictions” response. We can also see that Intel set aside a large reserve. Our educated guess is that Intel may have tied access to those reserve funds to signing an NDA for not discussing the issue.

How bad is it?

Since we now have almost two dozen Intel Atom C2000 series machines deployed for 20-40 months, from 6 different vendors, we feel that this is not going to be the case where every machine fails immediately at 18 months or 36 months. While a small sample size, we can at least rule out “every” device failing. We did confirm this with a few of STH’s web hosting industry readers who we know have Avoton/ Rangeley deployed in much greater quantities.

UPDATE 2017-02-09Online.net just posted that they are aware of a vendor component issue but are not seeing high failure volumes even on almost 68,000 nodes. Assuming this is related to the C2000 bug this would support what we have seen in our small deployments and from other web hosts with thousands installed themselves.

At the same time, we do urge manufacturers to have longer-term replacement plans available in the event failures do occur. We also hope to see vendors clearly spell out what their replacement policies are on this issue in a centralized place, similar to how Cisco has done.

Please do let us know if you hear of any programs or responses to this Intel Atom C2000 series bug and we will update this article accordingly.

Discuss this article on the STH forums.

7 COMMENTS

  1. There are expenses related that aren’t readily obvious…you are a company that has a support contract with an authorized third party supplier of Cisco hardware. That support contract included the replacement of failed hardware, or in this case, basically a hardware reacall…now those third parties have to make an onsite visit to each location to replace the hardware they have insured…that cost goes beyond the “ship it to us at no cost to you, and we’ll replace the unit”…the value added reseller is going to take a hit because they will be making onsite replacement visits to replace units they just deployed under contracts based on not needing to visit a location to replace failed hardware for years…instead they are expending their resources in less than a year to replace the defective units…this is why Cisco jumped onboard first…because they will need to reimburse their authorized value added resellers you now are going back to sites at their own cost to replace the defective units, and with the 4300ISR line being affected and really getting a HORRIBLE name by CIsco’s own fault, the last thing Cisco wanted was to ignore a hardware timebomb failure in the 4300, since the 4300 (XE IOS as a whole) has been a joke.

  2. Late to the game but just as a note to people reading this in the future… this problem is/was very real. I have seen a few boards fail and now my personal NAS board, an Asrock Rack C2550D4i has just packed in. The system seized up after a few days of slowing down (I noticed system clock drift too). Now it doesn’t POST at all, despite IPMI working fine.

    This was after 20 months of usage.

    Got a new board which hopefully has the fix applied but it’s definitely one to look out for.

  3. Ben – there is also an issue with the C2550/C2750 with FreeNAS and the watchdog timer causing too many writes to some flash on the MB – it could be that too.

  4. Does anyone know if Intel has actually fixed their CPUs, or are they expecting everyone to put a resistor between two of the pins to keep this defective product from toasting itself?

    I’m curious because I was actively looking at a Synology, but won’t if the CPUs have actually been fixed.

    Synology extended their warranty by a year, but after that if your unit dies you’re basically SOL.

LEAVE A REPLY

Please enter your comment!
Please enter your name here