3Ware 9650SE RAID-6, two degraded drives, one ECC, rebuild stuck

8,084

I managed to get the RAID to rebuild by issuing the following command in tw_cli without pulling any drives or rebooting the system:

/c2/u1 set ignoreECC=on

The rebuild didn't proceed immediately, but at 2 AM the morning after I made this change, the rebuild started and about 6 hours later, it was complete. The drive with ECC errors had 24 bad sectors that have now been overwritten and reallocated by the drive (according to the SMART data). The filesystem seems intact, but I won't be surprised if I hit errors when I get to whatever data was on those sectors.

In any case, I'm much better off that I was before, and will likely be able to recover the majority of the data. Once I've gotten what I can I'll pop out the drive that's failing and have it rebuild onto a hot spare.

Share:
8,084

Related videos on Youtube

cswingle
Author by

cswingle

Updated on September 18, 2022

Comments

  • cswingle
    cswingle almost 2 years

    This morning I came in the office to discover that two of the drives on a RAID-6, 3ware 9650SE controller were marked as degraded and it was rebuilding the array. After getting to about 4%, it got ECC errors on a third drive (this may have happened when I attempted to access the filesystem on this RAID and got I/O errors from the controller). Now I'm in this state:

    > /c2/u1 show
    
    Unit     UnitType  Status         %RCmpl  %V/I/M  Port  Stripe  Size(GB)
    ------------------------------------------------------------------------
    u1       RAID-6    REBUILDING     4%(A)   -       -     64K     7450.5    
    u1-0     DISK      OK             -       -       p5    -       931.312   
    u1-1     DISK      OK             -       -       p2    -       931.312   
    u1-2     DISK      OK             -       -       p1    -       931.312   
    u1-3     DISK      OK             -       -       p4    -       931.312   
    u1-4     DISK      OK             -       -       p11   -       931.312   
    u1-5     DISK      DEGRADED       -       -       p6    -       931.312   
    u1-6     DISK      OK             -       -       p7    -       931.312   
    u1-7     DISK      DEGRADED       -       -       p3    -       931.312   
    u1-8     DISK      WARNING        -       -       p9    -       931.312   
    u1-9     DISK      OK             -       -       p10   -       931.312   
    u1/v0    Volume    -              -       -       -     -       7450.5    
    

    Examining the SMART data on the three drives in question, the two that are DEGRADED are in good shape (PASSED without any Current_Pending_Sector or Offline_Uncorrectable errors), but the drive listed as WARNING has 24 uncorrectable sectors.

    And, the "rebuild" has been stuck at 4% for ten hours now.

    So:

    How do I get it to start actually rebuilding? This particular controller doesn't appear to support /c2/u1 resume rebuild, and the only rebuild command that appears to be an option is one that wants to know what disk to add (/c2/u1 start rebuild disk=<p:-p...> [ignoreECC] according to the help). I have two hot spares in the server, and I'm happy to engage them, but I don't understand what it would do with that information in the current state it's in.

    Can I pull out the drive that is demonstrably failing (the WARNING drive), when I have two DEGRADED drives in a RAID-6? It seems to me that the best scenario would be for me to pull the WARNING drive and tell it to use one of my hot spares in the rebuild. But won't I kill the thing by pulling a "good" drive in a RAID-6 with two DEGRADED drives?

    Finally, I've seen reference in other posts to a bad bug in this controller that causes good drives to be marked as bad and that upgrading the firmware may help. Is flashing the firmware a risky operation given the situation? Is it likely to help or hurt wrt the rebuilding-but-stuck-at-4% RAID? Am I experiencing this bug in action?

    Advice outside the spiritual would be much appreciated. Thanks.

    • David Schwartz
      David Schwartz about 12 years
      Not to state the obvious, but this is precisely what backups are for. You can try to read any critical data that you might not have backed up first. RAID is not backup, a single failure in the controller or host OS can take out a RAID array.
    • cswingle
      cswingle about 12 years
      David, indeed you are correct. We do have backups of some of it, but much of the data is publicly available and we made a decision not to back that up. Maybe the wrong decision, but there I am: recover the data, or spend weeks redownloading it in the background. I am hoping someone has some 3ware experience to help me identify the safest next course of action.
    • HopelessN00b
      HopelessN00b about 12 years
      Well, you are correct that the array will fail if you pull out the drive that's in WARNING state, so don't do that... not sure what you should do, though. Can you get access to the volume and try copying off/backing up your data? That's probably what I'd do. Pray I get the data off before the array fails, and once that happens, no big deal if the array fails.
    • cswingle
      cswingle about 12 years
      HopelessN00b: It was mounted when I started this whole process but threw I/O errors almost immediately when I tried accessing the PostgreSQL databases on it. I then tried an xfs_repair, which failed. Today I was able to mount it and I'm carefully copying off the most important stuff. So far so good. Once I've gotten everything I can, I'll feel more comfortable exploring the available tw_cli options.