What does 3Ware's tw_cli mean by a "DEGRADED" disk vs "ECC-ERROR"?
Solution 1
ECC error means that there is at least one unreadable sector on the drive. However, if you are lucky, that sector might not actually be used by the filesystem located on that volume, therefore you might still be able to copy your data from the array in this state.
There are also some options to ignore ECC errors during rebuild:
/cx/ux start rebuild disk=p [ignoreECC]
/cx/ux set ignoreECC=on|off
However, using these options means that the RAID stripe affected by a bad sector will be corrupted (not sure what exactly the card will do in this case — it might replace the whole stripe with zeros, or even with random data), therefore the “recovered” array might actually have undetectable corruption (if the affected stripe was in the middle of some data file). Copying your data from the array to some other place before trying to rebuild might be safer (at least you should get errors when trying to read the bad area).
You should set up scheduled verify of the array to catch unreadable sectors earlier, so that you can replace a drive which just started going bad.
Solution 2
I have never experienced a physical drive (p0) to go into status DEGRADED, however you might be able to get back the ECC-ERROR drive or even the DEGRADED drive by removing them via
/c0 p1 remove
and then issuing a rescan
/c0 rescan
put them back into the raid unit via
maint rebuild c0 u0 p1
SATA-Drives that failed me with ECC-ERROR i was able to resurrect if even just for a few hours before failing again.
Solution 3
It's very likely your data is gone. ECC error means an unrecoverable error while reading from this disk.
If you haven't a backup, you can try to dump the current state of the array. This might be possible because the controller doesn't know if it lost data or just an empty area (it lacks any insight into the file system).
Related videos on Youtube
Bill Weiss
Part time sysadmin, part time security person. Sometimes I herd them, sometimes they herd me.
Updated on September 18, 2022Comments
-
Bill Weiss almost 2 years
I've got a sad RAID array on a 3ware 9650SE-16ML card. What I can't tell is if I've just suffered a double-disk failure (bummer!) or if I'm reading this wrong. The relavent output of
/c0 show all
is:Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 DEGRADED u0 931.51 GB 1953525168 5QJ07MAH p1 ECC-ERROR u0 931.51 GB 1953525168 5QJ0DCW9 p2 OK u0 931.51 GB 1953525168 5QJ0DW9C p3 OK u0 931.51 GB 1953525168 5QJ0CKXJ
And the failure is (from
show alarms
):Ctl Date Severity Alarm Message ------------------------------------------------------------------------------ c0 [Sun Nov 20 07:47:23 2011] INFO Rebuild started: unit=0 c0 [Sun Nov 20 08:20:12 2011] ERROR Drive ECC error reported: port=1, unit=0 c0 [Sun Nov 20 08:20:12 2011] ERROR Source drive error occurred: port=1, unit=0 c0 [Sun Nov 20 08:20:12 2011] ERROR Rebuild failed: unit=0 c0 [Sun Nov 20 08:20:12 2011] INFO Rebuild paused: unit=0
I think that what happened is p0 failed, and then p1 had an ECC error (aka, my data is gone). But... maybe not? It stays at 97% rebuilt, but can't get past this error.
As far as I can tell, a previous admin turned off the periodic verify, which is what got us into this state. This isn't something most people should worry about with their 3Ware RAIDs!
Update
After beating on it for a couple of days, I did the IgnoreECC bit and it rebuilt, but my data is hosed. Bummer.
-
Admin over 12 yearsTry the Freezer Recovery method if there's any important data on it.
-
-
Bill Weiss over 12 yearsI did this with the p0 drive (on the assumption that it was the bad one) and it's trying to rebuild, but it marked the drive as DEGRADED almost immediately. Bummer.
-
Sergey Vlasov over 12 yearsAFAIR, the drive is kept marked as DEGRADED during the rebuild — see, e.g., here. What is important is the array status (REBUILDING or something else?).
-
Bill Weiss over 12 yearsHm. It is in fact rebuilding... All four drives are flashing a lot, that's a good sign, right?
-
Bill Weiss over 12 yearsStiiiiiil rebuilding... it's at 37% after 4 hours. Bummer.
-
Bill Weiss over 12 yearsI'm doing the ignoreECC bit now. Not looking great for my data.
-
Bill Weiss over 12 yearsNo luck, it got to 97% and hung again. I tried swapping p0 for another drive, same issue. I'm trying the ignoreECC suggestion now.
-
user728702 over 12 yearswhich raid level is it btw?
-
Bill Weiss over 12 yearsRAID5 (15 characters needed)
-
Bill Weiss over 12 yearsWell, that got it through the rebuild, but nommed on my data. Bummer. That'll teach us to turn off verify...