Hardware RAID controller cache battery failure frequency/lifetime?
Solution 1
I suspect your Supermicros are broken one way or the other - possibly the battery packs are overheating. Most recent LSIs would report the temperature through MegaCLI - you might want to monitor this value on servers which needed replacement.
root@host:~/SOLARIS# ./MegaCli -AdpBbuCmd -GetBbuStatus -aALL
BBU status for Adapter: 0
BatteryType: BBU
[...]
Temperature: 41 C
I have seen a couple of Dell and Fujitsu systems with LSI BBU controllers, none of them had yearly battery pack replacement (except you screwed the pack up by deep-discharge). The typical life time has been around 3 to 5 years.
Solution 2
My experience with IBM versions of the LSI platforms over a few hundred installs is that the average battery barely makes 2yrs, and supercap isn't any better, some of which can be fixed with a firmware update, but LSI just haven't got it right. I have had about 75% supercap failures in the first 2 yrs.
Solution 3
Average battery life should be 3-5 years. And don't forget that flash-based FBWC fails also. I don't know why/how, but we were replacing them fairy regularly on our HP servers. I should last longer than the battery, but I don't have statistics from our individual servers.
The standard way to prevent effects of failed battery and battery learning is to have multiple batteries. This is how HP storage (like HP EVA) have it. You have 2 hot-plug batteries and while one is low charge or being replaced, controller works with the remaining one. I'm no sure if it is possible to have multiple batteries connected to SmartArray, but hpacucli
diag
output suggest it should be supported:
Battery 1 firmware is up to date. Battery 2 not present. Battery 3 not present. Battery Status: Battery 1 Battery 2 Battery 3 --------------- --------- --------- --------- Present: YES NO NO Responding: YES N/A N/A PIC Revision: 52 . . Status: 0x80 . . Extra Status: 0x01 . . Enabled: FALSE . . Charging: FALSE . . Good: TRUE . . Open: FALSE . . Shorted: FALSE . . Sample Err: FALSE . . Control: 0x00 . . Load Current: (0x70) 24.6mA . . Per Memory Chip: 4920uA . . Voltage: (0xae) 5640mV . . Capacity: 100% . . Depletion count: 0x00 . .
Related videos on Youtube
ewwhite
Updated on September 18, 2022Comments
-
ewwhite over 1 year
I'm in an environment that contains many Supermicro servers equipped with Adaptec and LSI MegaRAID hardware RAID controllers. These controllers contain battery-backed cache modules to help boost write performance and protect data in-transit.
A frequent support issues is RAID controller battery failure. This shifts the array from write-back to write-through mode. There's clearly a negative performance impact as the system runs with degraded write speed. This persists until a downtime window can be established to power the system down and replace the battery.
This is a very routine operation for us; almost weekly across several thousand physical servers... We even have charging stations in place to prep replacement batteries so that can be swapped-in without a charge cycle.
Perhaps I'm spoiled by a long history with HP ProLiant servers and Smart Array RAID controllers, but HP systems typically had battery lifetimes of 4-6 years. They eventually eliminated the use of RAID batteries around 2009. They were replaced with supercapacitor-backed memory modules (flash-backed write cache, or FBWC) and don't require replacement, disposal or a lengthy initial charge cycle.
Since I see the Adaptec and LSI controller battery failures sometimes occurring on systems that have been in service for less than 12 months, I wonder if this is common in other environments.
If this is common, how do other large server environments handle this?
- Any tips or tricks to handling RAID battery replacements?
- Are there any configuration parameters that can help?
- How disruptive is this to operations in your environment?
- Could poor chassis cooling and temperature be a factor?
- Are we doing something wrong?
- Dell PERC controllers are made by LSI. Do Dell environments experience the same short battery lifetimes?
LSI product literature outlining a new-generation battery that can last longer in service than 1 year.
HP ProLiant DL585 G2 server with 1000+ day uptime and a happy RAID battery...
# uptime 05:38:08 up 1031 days, 44 min, 31 users, load average: 0.49, 0.64, 0.99 # hpacucli Cache Board Present: True Cache Status: OK Accelerator Ratio: 50% Read / 50% Write Total Cache Size: 512 MB Battery Pack Count: 1 Battery Status: OK
-
Admin almost 11 yearsJust a hint: The last generation of Adaptec controllers use supercaps/flash instead of batteries as well.
-
Admin almost 11 yearsOh, I'm aware that all of the manufacturers have supercap-based solutions now, but given the existing installation footprint, it's hard to make a broad change across the infrastructure.
-
Admin almost 11 yearsWell, given that batteries last 2-3 years in this scenario before you are in a low power situation - guess what ;) With thousands of servers you have to do what you have to do. Simple like that. The HP server you say there may simply not realize the battery is not having the power it once had.... you know. As in: In a failure, it may not last as long as you want ;)
-
Admin almost 11 years@TomTom The HPs lose about 20% of their charge capacity after 5 years. They do fail but it takes awhile. For the LSI and Adaptec, is this failure rate common? Just plan to take systems down when it happens?
-
Admin almost 11 yearsI have never done this (probably because it sounds like a bad idea and I haven't had the issue as frequently as you are), but you could try replacing a RAID battery on a test server while it is on. Slide it out, take the cover off, disconnect the bad battery, and connect the good, then back in the rack...If all goes well, you have a new battery replacement process that doesn't involve downtime.
-
Admin almost 11 years@August Uhm, as risky procedures go, this sounds pretty high on the "OMG WHERE DID MY DATA GO" list.
-
Admin almost 11 years@ewwhite - i have 3 adaptechere, all start being unreliable and are now in for a replacement (with capacitator) after 4 years. Yes. Those are normally not hot swappable.
-
Admin almost 11 yearsYep it sure does...I agree it sounds like a horrible idea, but given the situation and requirement for no downtime, it might be worth a shot on a test server (or thirty test servers...) to see if it is possible. What is another option besides redoing the infrastructure to not rely on individual RAID batteries in thousands of servers?
-
Admin over 7 yearsMy experience with IBM oem'd LSI is similar. Battery's used to last barely a year, and supercaps are no better (sample from > 150 servers) Much of the documented "fixes" would indicate poor design. Then to add further insult, they just make them a consumable item. The supercap issues I have tried to fix are the battery controller module, not the capacitor.
-
voretaq7 almost 11 yearsI would add that unless the system EXPLICITLY authorizes hot replacement of the RAID BBU I would not attempt it. I've never seen a system require annual replacement of the RAID cache battery. 3-5 years is a typical service life.
-
ewwhite almost 11 yearsI think you got it!