Are RAID controller lithium-ion batteries ticking time bombs within your server? Over this past weekend, we saw an IBM ServeRAID battery pack that looked ominous. Everyone knows that RAID controller batteries need to be replaced at regular intervals. That is one of the key drivers towards supercapacitors on RAID cards. Before this weekend, our assumption that it was a matter of keeping data safe in the event of a power failure. After this weekend, we now think it may be a safety issue.
Background: Why RAID Controllers Have Batteries
These days, the vast majority of RAID controllers are SAS (or specifically SAS3) based. There are some low-end offerings from companies like Marvell that offer SATA III RAID 0 or 1 as an example, but these are meant more for boot drives, simple NAS units, and similar applications. For parity RAID levels such as RAID 5 and RAID 6, the controller must do XOR calculations to figure out what data to write on each device anytime data is written to an array. By modern processor standards, this is not an enormously difficult calculation even on 8x 10TB hard drives, however, it still takes time.
For many generations, companies that produce RAID controllers have solved the challenge of getting better write performance by adding DRAM. The host system can then write data to the RAID controller. This data is temporarily stored in DRAM until the parity calculations are flushed to disk.
We first covered the basics of writing data to storage in our piece: What is the ZFS ZIL SLOG and what makes a good one.
That is a good article to read. Although it is focused on ZFS write caching, it is similar conceptually to write back cache on a RAID controller where you would typically have a battery back up unit (BBU) installed. By having the BBU installed, in the event of power loss, the RAID controller can keep data held in onboard DRAM awaiting a write for several hours. In the case of flash-backed write cache (FBWC) RAID controllers, the data in DRAM is written to flash during a power failure. That operation is powered by the BBU.
For those in the RAID industry, or operations teams with RAID cards deployed, RAID controllers with BBUs have been around for a long time and are essentially a known quantity. The BBUs need to be replaced at regular intervals as a loss in battery capacity can lead to unsuccessful backup during a power event. This weekend, we saw another reason to replace batteries.
A RAID Controller Battery Time Bomb Discovered
This weekend we had a major refresh of the lab since we have another 12kW of servers that just arrived or are inbound. It was time to clear out any server from the Intel Xeon E5-2600 V1/ V2 generation. Most data centers have already done this, but we keep machines in the lab simply to perform backtesting. These servers stay offline and are largely forgotten until the need arises to back-test a benchmark on an older generation CPU.
We pulled one of the servers readying it for final retirement. This was a 2011/2012 era server that had a (Broadcom) LSI SAS 2108 RAID controller, the IBM ServeRAID M5014 inside. After the Intel Xeon E5-2600 V3 Haswell-EP launch the server has been powered off save for a once a quarter or so benchmarking run. We also did not utilize the RAID controller as our benchmarking standard is a directly attached SATA drive so the controller was unused. It was physically in the server, but we had not used the IBM M5014 in years. When we pulled the RAID controller, specifically the LSI BAT1S1P battery pack, as part of the LSIiBBU08 kit, looked different.
We pulled the sticker off and there was indeed a large bulge underneath. The battery pack has deformed over the years.
The LSIiBBU08 and LSI BAT1S1P were extremely popular. They were used on IBM ServeRAID cards like the M5014 we have and the higher memory M5015. They were also used for just about every LSI SAS 9200 series RAID controller with onboard memory. These included popular cards like the LSI SAS 9260-8i, 9261-81, 9280-8i as countless OEM versions.
While the LSIiBBU08 is the lithium polymer version, the LSIiBBU07 was the lithium-ion variant with a 1-year replacement cycle. At this point, both kits are getting to the end of where their controllers are useful. The SAS2 generation was PCIe 2.0 only and designed for a hard drive world. Still, our battery was manufactured in 2011 and there are many organizations with Sandy Bridge and Ivy Bridge generation Intel Xeon E5-2600 V1/ V2 servers in the field. If our battery looks like this, there is a good chance there are others in the field that look the same.
This Should Not be an Issue
Being fair in the process means that we should point out that a 7-year-old BBU kit is something that should never happen. With these BBU kits, the 12-hour data retention time setting should yield a 3-5 year replacement cycle. Proper maintenance would have meant we should have replaced the battery. At the same time, had we not pulled the server, we would not have known that the BBU was in such a physically degraded shape.
These days, many RAID controllers utilize supercapacitors that should not have the same replacement needs as the LSI BBU units. At the same time, we just got a new HPE Gen10 server in the lab with a lithium-ion battery so these types of concerns are going to continue for years in the future.
Everyone likes to think that their products are being used and maintained properly. With RAID controllers that utilize battery back up units, there is an extra maintenance required to refresh batteries. This is less critical for servers that are expected to last 36 months. Most organizations have a server that sits somewhere in their infrastructure that is well over 3-5 years old. For those servers, if you use traditional RAID cards, ensure that you have a plan to regularly replace batteries. It is not enough to stop using the BBU feature for power loss protection. Bulging batteries can be a precursor to larger failures on a range of devices. You do not want a battery bursting in a server.
If you are buying a new server, get a supercap solution or rely upon a HBA with software RAID or software-defined storage.