Non-volatile cache RAID controllers: what kind of protection is there against NVCACHE failure?

raid hardware-raid cache battery

8,443

Solution 1

You're over-thinking this.

Of course, this depends slightly on the manufacturer's specific implementation, but having deployed thousands of HP ProLiant servers over 10 years, I've experienced hundreds of RAID controller battery failures. I replaced the bad units, knowing that sudden power-loss or a system crash would result in some level of data corruption if I didn't have a healthy battery in place.

I was happy to see the move to flash-backed write caches in recent years. The flash units on HP ProLiant systems are a separate super-capacitor that attaches to the controller RAM module. I suppose they can fail. I've not experienced one yet. The HP Smart Array RAID controllers can be set to keep write caching enabled regardless of the battery/supercap health. This presumes facility protection against sudden power-loss. You still have to worry about application stability and system crashes.

It sounds as though you're referring to Dell PERC controllers and their NVCACHE implementation. It's a similar design. Dell explains in their guidebook...

4.5.1 Non-Volatile Cache
Dell PERC controllers with non-volatile (NV) cache use the standard battery as contained in the Dell 
PERC controllers with a battery back-up unit (BBU). The difference is in battery implementation:

- The battery in the BBU offering retains the data in cache in the event of a power cycle for a 
guaranteed period of 24 hours (typically up to 72 hours).

- The battery in the NV cache offering will transfer the data from cache to flash in the event of 
a power cycle, where the data will be retained for up to ten years.

Think about your application and your storage access patterns. Are you really writing to the array fast enough and with an amount of data that cannot be flushed to disk effectively? Is your application unable to recover from a crash or sudden reboot?

If you're really concerned about application availability, focus on protecting facility power (healthy UPS + generator) and bolstering your systems with redundant components (power supplies, fans, etc.)

Edit:

I'm looking at an HP Smart Array P410 RAID controller with a flash-backed write cache onboard. There are health LEDs for the flash module and the older external battery (HP equipment doesn't do the relearn cycle).

A dedicated battery microcontroller continuously monitors the HP Smart Array battery pack for signs of damage, including an open battery terminal, partial battery short, charge timeouts, and over discharge conditions.

For the super-capacitor, its health is monitored, but the LED indicator is located on the flash module. The RAM is ECC error-correcting, so that's also another level of defense. Both are reported to the host server, via SNMP traps and can be viewed through diagnostic utilities.

From HP's Smart Array technology guide.

The Super-cap sub-assembly consists of two 35-Farad 2.7V capacitors, configured in series, providing 17 Farads at up to 5.4V. The charger maintains the Super-cap at 4.8V, providing the required amount of power to complete backup operations while extending the life of the Super-cap. The charger monitors Super-cap health and activates LED indicators on the FBWC module to warn of impending failure. The Super-Cap module uses the same form factor and housing as the HP 650 mAh P-Series battery used in the HP BBWC.

My point with all of this is that the manufacturers have engineered solutions to make the flash cache solution work and become a viable replacement for the older battery-based technology. It's in their interest to provide proper monitoring facilities.

As a note, check the visual indicators for the HP's newest-generation flash-modules. You can be sure that all of these checks are integrated into the alerting and diagnostics system for the controller.

enter image description here

Solution 2

Presumably, the server itself would fail to boot if the BIOS on the RAID controller encountered a failure during tests. It would check the onboard memory just as the main server BIOS checks its own memory. If you want details on this, your best bet is to call the manufacturer of your RAID card.

8,443

Skoua

Updated on September 18, 2022

Comments

Skoua over 1 year
The battery back-up (BBU) model:
- admin enables write-back cache with BBU
- writes are cached to the RAID controller's RAM (major performance benefit)
- the battery saves uncommitted and cached data in the event of a power loss (reliability)
If I lose power and come back within a day or so, my data should be both complete and uncorrupted.

The downside to this is that, if the battery is dead or low, OR EVEN IF IT IS IN A RELEARN CYCLE (drain/charge loops to ensure the battery's health), the controller reverts to write-through mode and performance will suffer. What's more, the relearn cycles are usually automated on a schedule which may or may not happen in the middle of big traffic. So, that has to be manually disabled and manually scheduled for off-hours if it's a concern. Annoying either way.

NV caches have capacitors with a sufficient charge to commit any uncommitted-to-disk data to flash. Not only is that more survivable in longer loss situations, but you don't have to concern yourself with battery death, wear-out, or relearning.

All of that sounds great to me. What doesn't sound great to me is the prospect of that flash module having an issue, though. What if it's completely hosed? What if it's only partially hosed? A bit corrupted at the edges? Relearn cycles can tell when something like a simple battery is failing, but is there a similar process to verify that the flash is functional? I'm just far more trusting of a battery, warts and all.

I know the card's RAM can fail, the card itself can fail - that's common territory, though.

In case you didn't guess, yeah, I've experienced a shocking-to-me amount of flash/SSD/etc. failure :)
- user9517 almost 12 years
  
  This seems like the kind of thing you should ask the manufacturers - they'll know for sure what happens, what alerts are raised and how it all hangs together.
- Skoua almost 12 years
  
  I could end up polling multiple manufacturers about it in order to deduce an industry standard. But if nobody knows so far, this seems to me a good question to ask here in order for the community to have an answer. If I find out offline, I'll report back here.
- pauska almost 12 years
  
  Either contact a manufacturer and get the answer you want, or disable write caching all together. This is going nowhere.
- Skoua almost 12 years
  
  The question was asked on a Friday evening. It is currently Sunday afternoon. I already said that I would report back here if I get an answer offline, but weekends aren't a great time to contact card manufacturers. Your response is unwarranted, and unhelpful.
- ewwhite almost 12 years
  
  What's the solution?
Skoua almost 12 years

"I've experienced hundreds of RAID controller battery failures" Because of proactive relearn cycles, battery failures are detectable and their would-be problems avoidable. "I suppose they can fail. I've not experienced one yet." Absent the equivalent of a relearn cycle, how would you know if the flash on a running server's NV card was faulty? Your response doesn't address the subject.
Skoua almost 12 years

If the flash failed after a sudden outage, uncommitted data would be lost whether or not the card booted. I'm asking about proactive fault detection, along the lines of a BBU card's relearn cycle. I currently own BBU cards, and I'm asking about the NV space in general.
ewwhite almost 12 years

I address the use of the battery with NVCACHE, as per Dell's description. I explain how to engineer around your concerns by making facility power more robust. And as far as detecting the health of the RAID controllers and flash modules, internal health checks are available. I reference HP because that's what's I can physically view at the moment, but Dell should have the same checks. See my edits above.
Skoua almost 12 years

Well, I already summarized BBU and NV operation in the OP. I also happen to use Fujitsu servers, not Dell, with LSI BBU RAID, redundant PSUs, and UPSes, racked in a data center with backup power. I've still seen servers have unexpected outages in environments like this. It even happens at EC2, with EC2's resources! And I'm still curious about the subject. Telling me to use a safer airplane doesn't help me detect faults in my parachute, which is the only specific question I've asked.
Skoua almost 12 years

"It's in their interest to provide proper monitoring facilities." - I agree, of course, but that doesn't inform me about what they are or how they work. The LED array you pasted covers some major failures (no power, dead controller), but I don't see anything specific about read/write integrity, an insidious failure I've experienced many times with NV storage. As BBU cards in general do relearn cycles, NV cards in general may well (and hopefully do!) perform some kind of routine random read/write checksums. But unlike BBU cycles or ECC, it's apparently not a well-known process. So I ask.
user9517 almost 12 years

@astrostl: Like I said above, ask the manufacturers, they'll know for sure. As it is it looks like all you really want is an argument, and this isn't the right room for it.
Skoua almost 12 years

And like I responded above, I will if nothing bubbles up here. If I'd asked about BBU integrity, many of us would have the answer off the top of our heads. I'd love to hear that on NV. But if not, a public Q&A creates a venue to share manufacturer-derived knowledge, to later come off the top of our heads. And I'd prefer answers to arguments, but I'm not going to pretend a response like this is helpful: I don't know, you shouldn't care, here's something you already said, here's something you already know, here are some text/picture pastes. It looks good on a skim, though, and gets points.
Skoua almost 12 years

First-order failures should be uncommon, so don't worry about second-order failures? Do you have a problem with test restores? I know relearn cycles harm performance, but I don't disable them because I don't expect a sudden outage. You didn't expect sudden outages on your known-expired BBU cards, but you took their servers offline to replace the batteries anyway. I'm not concerned about lifecycle overruns, I'm concerned about random flash failure. If you don't have an answer, don't respond. You're assuming, accusing, and even discouraging technical curiosity. Knowing how it works can't hurt.