RAID5 2-disk failure - what steps to take?

25,987

Solution 1

How appropriate, that this should come on the heels of "Backup Appreciation Week" (or whatever it's called).

The problem with trying to do anything yourself is that you're just increasing the amount of degradation on the drives whenever you're running them. Decide now if you're going to send it to the pros, and if so, just do it. Presumably if this data is important enough to spend thousands of dollars recovering, it's probably stuff you want sooner rather than later, so just send it off now.

Oh, and backups. Make good backups. RAID isn't a backup, and RAID 5 barely even counts as redundancy these days, given the size of drives (and hence the time required to rebuild a large array).

Solution 2

Short Answer: Build a non RAID 5 array that can hold the data and then restore from backup.

If you don't have a backup β€œYou're Doing it Wrong”

Longer Version:

Consider RAID 10. If space is a concern buy more disks and go to RAID 6 if your controller supports it or buy even more disks and do RAID 10 anyway. Build your RAID array(s) and then restore your data from the last backup.

Solution 3

You might find useful information using RAID Reconstructor, which is read-only and will scan the drives to determine what's up with them. You'll obviously need to be able to connect the drives to another system, not through a RAID controller. Evaluating your drives won't cost you anything.

Solution 4

Cry. We had this happen twice in two weeks. Our AC units were on the fritz and the temperature monitors did not report on it. The heat killed a lot of our drives.

Amusingly, our brand new data center, was getting ready for an expansion, and the joyous facilities group said, no worries, you are running at 46% of capacity.

Later we found out they wrote down the AC units sizes wrong, by a factor of 2 and we were actually at 97% of capacity. Oopsy.

Then we added a whole stack of new servers, thinking we had tons of buffer space on the AC capacity.

Thus we had heat issues for months, as we needed them to get us bigger compressors for the AC, which was scheduled to take 18 months.

What else is new in the world?

Solution 5

womble's answer covers the worst-case scenario, but there is a decent chance that one or both of the disks is perfectly fine. If you want to try recovering the data yourself, I'd recommend only trying to use one of the failed drives in your recovery attempt, and set the other drive aside in case you eventually do need to send the entire RAID5 into a data recovery firm.

With inexpensive SATA cards, it was not uncommon for us to lose two drives from our RAID5 at once even though only one of them was defective. We also had a couple of occasions in which neither drive was bad, and we couldn't reliably pinpoint the cause of the RAID5 failures. We've since switched to larger drives in a RAID1 configuration, and are considering switching to ZFS on a raidz2 or raidz3.

As someone else mentioned, the recovery service won't be able to recover data from just the failed drives. You'll have to send in all the disks from your RAID5.

You should be aware that there are varying levels of failure. If there is severe physical damage due to a crashed head, your only hope lies with a recovery service, but the chances are, your data is gone.

If you can't justify the cost of sending all the drives to a data recovery service, you may be able to duplicate the drive's contents onto a good drive using dd or dd_rescue, then perform additional diagnostics on the failed drive while you reassemble your RAID and run a full backup. Unfortunately, you may not be able to determine whether your files are okay or if they are corrupt, unless you have a recent list of checksums or existing backup to compare them against.

If you can determine that Sector 0 is bad (usually indicated by repeated clicking after power-on), you're hopelessly out of luck. An Ontrack recovery agent told me they could not recover any data from a drive I sent in, because they absolutely need to be able to write to Sector 0. I was a little irked, because I had already determined that Sector 0 was bad before I sent the drive in, and Ontrack wasn't upfront with their capabilities.

You might be able to tell if the disks were erroneously marked as failed by reviewing the system logs and/or using smartctl (from the smartmontools package) to view the SMART diagnostic information stored on the drives. If smartmontools reports good drive health and you don't have any reallocated sectors (under "reallocated sector count"), then your drive may be fine and you can try reassembling the RAID and backing it up.

For the future, you might also want to consider setting up an OpenSolaris box with ZFS on raidz2 or raidz3. These will give you double- or triple- parity, respectively, allowing you to lose 2 (raidz2) or 3 (raidz3) drives before losing your data. In addition, ZFS checksums everything, so your filesystem won't be prone to silent data corruption, as it is with other single-disk or RAID configurations.

Having at least double-parity in any RAID configuration is desirable, because you still have redundancy while you're in the process of replacing and rebuilding the first failed disk. (Of course, you shouldn't wait until 2 disks have failed before replacing the first failed disk.)

Share:
25,987

Related videos on Youtube

HopelessN00b
Author by

HopelessN00b

Updated on September 17, 2022

Comments

  • HopelessN00b
    HopelessN00b almost 2 years

    I've got a 6-disk RAID5 array on a gentoo server. mdadm is reporting that two of the disks have failed. In the event that the disks are actually gone, I'm prepared to send the drives to professionals for recovery, but I don't want to have to do that unless it's necessary, and I don't want my own recovery attempts to make life harder for them. That said, if I can get the array back up and running myself, I'd prefer to do that.

    1) What steps should I take immediately to reduce the risk of data loss?

    2) What's the best way to tell if the drives are actually dead or have just been marked as failed erroneously?

    3) Is there any risk in rebooting the machine and/or attempting to rebuild the array myself?

    • Erik Nijland
      Erik Nijland over 14 years
      If you plan to reuse the remaining 4 drives consider RAID 10. If space is a concern buy more disks and go to RAID 6 if your controller supports it or buy even more disks and do RAID 10 anyway. baarf
  • Maxwell
    Maxwell over 14 years
    +1 You need to ensure a proper backup policy before dealing with large sets of data. Redundents arrays of disks are not backups.
  • rob
    rob over 14 years
    This looks pretty slick, but it looks like it's Windows-only. (The asker said he's running Linux software RAID on Gentoo.) Good find, though.
  • rob
    rob over 14 years
    There is a workaround for the long rebuild time, at least in cases where the failures are localized to one area on the disk (as is usually the case with bad sectors). When we were using RAID5, we partitioned the big drives into smaller chunks and created multiple RAID5s, then LVMed those together. This allowed us to keep failures localized on the disk, as well as dramatically speed up the rebuild since we could rebuild one part of the LVM (one RAID5) at a time, starting with the failed region.
  • NuckinFutz
    NuckinFutz over 14 years
    Using something like dd on a failed drive can do more harm to that drive creating a situation where you'll lose even more data. If you're ready to send it to the pros, send it to the pros.
  • womble
    womble over 14 years
    @rob: You still need to rebuild all those arrays when you replace the disk, and the added complexity of your solution sounds like a bit of a nightmare to manage. I note, also, that you apparently don't use RAID 5 any more, which suggests that you found RAID 5 to be ultimately insufficient.
  • rob
    rob over 14 years
    @womble: Forgot to mention before, +1. :) The rebuild workaround wasn't a hassle, and did allow us to rebuild the failed regions first, and manually fail/rebuild the rest of the RAIDs on our own schedule. Yes, we did stop using RAID5 for multiple reasons. We always seemed to lose a pair of disks simultaneously instead of just the single bad disk, which caused us extra downtime and hassle--even with backups. Second, 1 redundant disk is insufficient, since a failure during rebuild means more downtime. Now we're using 3-disk RAID1 (2 mirrors), and are looking at moving to ZFS on raidz2 or raidz3.
  • Rowan Hawkins
    Rowan Hawkins about 6 years
    For future people.This probably wasn't the OP's issue. TLER(WD's name) existing would make a drive that hasn't been marked failed degrade the performance of the array. Once the drive is marked failed, the array would no longer issue it requests. RAID 5 makes sense where you need 'some' high availability, and you are using 3-4 smaller capacity drives. Once you reach 2TB drives, then the rebuild time and the chance of URE failures impacts your ability to successfully rebuild the array as a whole.