Which drive in RAID has bad sectors?

raid hardware-raid lsi megaraid bad-blocks

5,573

Solution 1

Smartmontools has extensions that allow it to poll a drive for SMART data through an LSI (as well as others) RAID array. Normally, this isn't something you can do as the RAID abstraction obscures direct interfaces with the drives.

Smartmontools might not be installed on your machine. However, it is native to most "main repositories" of most distributions, and there is even a Windows version at: http://sourceforge.net/projects/smartmontools/files/

It can be used to poll a drive behind an LSI MegaRAID controller like so:

smartctl -a -d megaraid,N /dev/sdX

Where "-a" means display all disk data, -d means device type (megaraid being the type in your case), followed by N which means the drive number in that controller. To access the drive in slot 0, you would say 0 here. If you wish to poll all four of your drives, run this command four times, replacing N with 0 to 3. sdX is the RAID abstraction itself, as seen normally within the operating system. Yours is probably sda.

You will see a long output from each drive, and what you're looking for is either a reported general SMART failure (which you might not find, as your controller isn't rejecting drives), or reported "offline uncorrectable sectors" or "pending sectors". Any drive with more than 0 in this field is bad. No mercy should be given to those fields, as it takes a LOT of failed reads to increment either value by one.

You can also perform a short or long test like so (same rules above apply):

smartctl -t [long|short] -d megaraid,N /dev/sdX

Solution 2

If the RAID passes the errors on to you, then obviously something is wrong that cannot be silently corrected.

If you get read errors, that means that all redundant copies of these blocks have been destroyed. The faulty drives are not ejected, because there are no spares.

If you get write errors, that means that one drive continues to report write errors, and the RAID cannot eject it because it is not currently redundant. You should never see a write error in a redundant setup, so if you do, replace the controller.

If you can add more disks, create a third mirror -- recovery will complain, and you will need to check the file systems, but you should end up with as much of your data intact as can be, and I'd expect any good controller to then kick out all broken disks.

Once you are back on a clean setup, set up scheduled checks to catch these errors before they become a problem.

Solution 3

If you are using Linux or Windows then boot your system and use the megacli utility.

megacli -pdlist -aALL

In the results check the "Firmware State" line. The degraded disk will show as:

Firmware state: Offline

5,573

James

Updated on September 18, 2022

Comments

James almost 2 years
I have 4 physical drives in a single virtual drive using an LSI MegaRaid SAS controller. It seems (at least) one of the drives has bad sectors because:
- io errors occur when attempting to back up some files
- running badblocks reports some bad sectors
I'm hoping that resolving the issue will be as simple as swapping out the problematic disk(s) and rebuilding the raid array. I thought LSI MegaRaid WebBIOS would allow me to identify the problematic disk(s) but I can't find any options to check for bad sectors.

Below is a screenshot of the WebBIOS:

Could anyone offer any advice as to how the problematic disk(s) can be identified?
- David Schwartz over 7 years
  
  What RAID level is the array?
- James over 7 years
  
  @DavidSchwartz It's a RAID10 array
HBruijn over 7 years

AFAIK megacli also exists for Windows.
Vikelidis Kostas over 7 years

@HBruijn I wasn't aware of that. Thanks for mentioning it.
Spooler over 7 years

While a disk may not be degraded in an array, it can still be bad. It takes quite a bit for controllers to eject a drive in some cases, and if they are not "manufacturer certified" drives, they won't get automatically ejected unless they have a total SMART failure. In the meantime, they will still negatively impact the array.
James over 7 years

You're right it is sda. Unfortunately when running the command I get Smartctl open device: /dev/sda [megaraid_disk_01] failed: INQUIRY failed
Spooler over 7 years

Silly question: are you running it as root?
James over 7 years

Yes running as root. I think I've got it working - the indexes are 2,3,4,6 rather than 0,1,2,3 as I'd assumed. I found this out by running MegaCli -LdPdInfo -a0 - this shows index as the "Device ID: XXX"
James over 7 years

Two of the disks have non-zero values for 'read' under 'Total uncorrected errors' and 'Non-medium error count'. Are these the values I should be looking at? One of them is 'DiskGroup: 0, Span: 0, Arm: 1' and the other 'DiskGroup: 0, Span: 1, Arm: 0'. Any advice what to do next?
Spooler over 7 years

Non-medium error count means anything other than write-read, or verify errors. They typically (if uncorrected) involve the drive "dropping out" of a controller for a time or resetting without signal to. Since you also have an uncorrected error count, you are running into multi-bit and many-bit errors, which are damning but very hard to track with anything other than SMART. You should replace those drives immediately.
James over 7 years

Just to confirm, all firmware states are 'Online, Spun Up'
James over 7 years

Am I right in thinking that both can be replaced since they are on separate spans? Should they be replaced one at a time - i.e. replace a drive, let it finish rebuilding and then replace the next drive?
Spooler over 7 years

Always replace one drive at a time, unless you're really good at (and enjoy) restoring broken arrays. This is true of pretty much any by-disk array membership.
user121391 over 7 years

Couldn't write errors also just be that - write attempts that were finally unsuccessful?
ddm-j over 7 years

@user121391, the drives are supposed to remap bad sectors on write, silently. If a drive reports a write error that means it has run out of sectors to remap to, so a large number of sectors has gone bad. That is usually reason to immediately kick the drive out. Propagating a write error upwards means that none of the drives could write to that sector. That is either the controller being broken and writing to an invalid sector (-> replace the controller), or all of your drives have severe problems and the entire setup needs to be investigated.
ddm-j over 7 years

@user121391, disk failures are either gradual and only detectable on access, or sudden and global. That is why you need to read and compare the data across all disks periodically -- any drive reporting a read failure is given a good copy by rewriting the sectors, that the drive should store in one of the remapped sectors, and the error is logged for the admin. If a drive fails to read the same sector again on the next check, throw it out and never buy from the same vendor again.
user121391 over 7 years

I think I misunderstood your initial answer in the sense of "get write errors" as "get a report that write errors have occurred in a drive in the array" vs. "receive write error from the disk directly", that was the reason for my confusion. Now it makes sense.
Guntram Blohm over 7 years

+1 for "If the RAID passes the errors on to you, then obviously something is wrong that cannot be silently corrected.". If the RAID reports errors while being used normally, it's already too late. (Of course, a special RAID utility reporting an error is something different).