How to recover from a drive failure in a RAID 5 configuration?

26,923

Solution 1

The system is running very slowly because it has to reconstruct the missing data which involves additional CPU and I/O.

If you have a missing disk in a RAID-5 configuration you have no recovery strategy. If another disk goes down you will lose your data. Run, don't walk, to the nearest vendor from which you can get a compatible part covered by manufacturer's warranty shipped by a same-day urgent courier. If the vendor you bought the array from is already in the process of getting the part, get both parts and stash the other one away as a spare.

If you have a RAID-5 being used for a production system you should consider leaving a spare disk in the array as a hot spare.

Added - If your logs are not on a separate volume (physically separate disks) move them to a separate set of disks, even just a single mirrored pair. This will also be a performance win if your database has any significant load as contention on log volumes has a disproportionately bad effect on performance.

If this is possible you can also make your database more robust by doing the following:

  1. Shut down the database.
  2. Backup the database.
  3. Move the logs to a physically separate set of disks (make sure you reconfigure the database so it knows where the logs have been moved to).
  4. Restart the database and application.

If you have the logs on a separate volume you can restore and roll forward from the backup if and only if a disk failure does not compromise the logs. Database logs should be on a separate disk volume for (amongst others) the following reasons:

  • Logs usage patterns are predominantly sequential, appending log entries onto the end of the file (the file is in effect a ring buffer). This means that a large number of log entries can be written out quickly as there is little disk head seek activity.

  • If they are sharing physical disks with a heavily random access workload (e.g. a transactional tables and indexes) they will be slowed down disproportionately as the head seek activity disrupts the sequential writes.

  • Having the logs on a separate volume is almost always a performance win and only needs a single mirrored pair for logs to support quite a heavy workload. This means that the hardware to do it is quite cheap, so there is a small cost for a big performance and reliability win.

  • If your data array goes down the logs are not lost. If you have a proper backup strategy you can restore from the backup and roll foward from the logs. This means that a whole array can go down on the server without being a single point of failure. Both the log and data arrays have to fail simultaneously to cause data loss.

Solution 2

1) Backup.

Right now no data has been lost. If your backups are not up to date backup now.

2) Read the manual, call the vendor etc.

Different RAID systems have different steps for replacing a disk, and done wrong you risk destroying the whole array. Without knowing what sort of RAID hardware/software you have we can only guess at the steps needed.

Also, the slow performance is because RAID 5 in a degraded state (i.e.: one disk dead) has horrible read performance. How horrible depends on how the parity is stored and which disk died, but the "good" news is slow performance with one disk gone is a known issue and not cause for panic.

Solution 3

First I would read the manual for the hardware/software that you're using - the section for failure recovery :)

Should be a simple matter of replacing the disk and rebuilding the array though.

The most important point in such cases is that the disk should be replaced as soon as possible since if another disk fails you will probably lose data. Also you should address the cause of failure - was it because the disk was getting old? Should you replace the other ones too? Or was it because of a power surge, heat or vibration?

Solution 4

As far as I understand RAID5, when your replace the failed drive, it is automatically rebuilt, from information stored on the other two. Whether you can 'hot-swap' the new drive into place does depend on you system - you may have to power down first. Either way, considering the relatively low cost of drives, and the importance of your data (Reflected by your decision to use RAID5 in the first place), you really ought to have a spare drive, sat in a drawer, ready for such an eventuality.

I've recently built a new development PC for myself, and setup the main data drives under RAID5. I ordered one more drive than necessary, so that I've got the spare ready for that emergency moment (That I'm hoping won't happen)

Now you've asked the question, I suppose I'd better read up on the subject some more.

Share:
26,923

Related videos on Youtube

Philip Fourie
Author by

Philip Fourie

Works for Theta

Updated on September 17, 2022

Comments

  • Philip Fourie
    Philip Fourie almost 2 years

    This morning a drive failed on our database server. The drive array (3 disks) is setup in a RAID 5 configuration.

    While we wait for a drive replacement we are preparing for a recovery strategy. Users are continuing to work on the system, albeit very slowly (don't know why??).

    How does one install the new drive - will the data for this drive automatically be rebuilt from the parity or is there another process we should follow?

    Edit: This is a hardware RAID controller. (Thanks for the answers so far, appreciated)

    • David Schwartz
      David Schwartz almost 13 years
      By the way, the time to decide what to do if a drive fails on a critical server is before a drive fails on a critical server.
  • Philip Fourie
    Philip Fourie almost 16 years
    Thanks for the answer especially explaining why the system is running slowly.
  • Mike Broughton
    Mike Broughton almost 16 years
    Spot on. I would even suggest shutting it down until you get that replacement drive in place. Like Nigel says, you have no recovery strategy. Loss another drive, loose it all.
  • Philip Fourie
    Philip Fourie almost 16 years
    Hi Nigel, thanks for taking the time and sharing your expertise. It is indeed great advice. I'll report back later on the outcome of the recovery.
  • ConcernedOfTunbridgeWells
    ConcernedOfTunbridgeWells almost 16 years
    For small data volumes a mirrored pair is better as it typically has better sequential access speed than a small RAID-5. If you want hot-swap, look at some of the hot-swap bay systems on somewhere like scsi4me.com