Does a bad sector indicate a failing disk?

ubuntu hard-disk fsck sata badblocks

21,189

Solution 1

Bad sectors are always an indication of a failing HDD, in fact the moment you see an I/O error such as this, you probably already lost/corrupted some data. Make a backup if you haven't one already, run a self test smartctl -t long /dev/disk and check SMART data smartctl -a /dev/disk. Get a replacement if you can.

Bad sectors can't be repaired, only replaced by reserve sectors, which harms HDD performance, as they require additional seeks to the reserve sectors every time they are accessed. Marking such sectors as bad on the filesystem layer helps, as they won't ever be accessed then; however it's hard to determine which sectors were already reallocated by the disk, so chances are the filesystem won't know to avoid the affected region.

Solution 2

To make the drive to reallocate the sectors, usually you need to write something into them. However, dd (Disk Destroyer) does not always work, and is very unsafe: if you confuse the skip and seek options, you can easily shoot yourself in the foot by skipping the N first blocks of /dev/zero and writing a block from that "offset" over the sector 0 of your hard disk.

If you really know you want to force the sector be overwritten with zeroes, you should use hdparm:

% sudo hdparm --read-sector 833192656 /dev/sda
/dev/sda:
reading sector 833192656: FAILED: Input/output error

Yes, the sector 833192656 was failing in smart-tests, too. To write zeroes to it, use --write-sector:

% sudo hdparm --write-sector 833192656 /dev/sda
/dev/sda:
Use of --write-sector is VERY DANGEROUS.
You are trying to deliberately overwrite a low-level sector on the media.
This is a BAD idea, and can easily result in total data loss.
Please supply the --yes-i-know-what-i-am-doing flag if you really want this.
Program aborted.

As a safeguard, hdparm doesn't really write anything unless you pass the --yes-i-know-what-i-am-doing switch to hdparm:

% sudo hdparm --yes-i-know-what-i-am-doing --write-sector 833192656 /dev/sda
/dev/sda:
re-writing sector 833192656: succeeded
% sudo hdparm --read-sector 833192656 /dev/sda                              
/dev/sda:
reading sector 833192656: succeeded
0000 0000 0000 0000 0000 0000 0000 0000
[      ... more zeroes here...        ]
0000 0000 0000 0000 0000 0000 0000 0000
%

Solution 3

No, bad sectors are not always an indication of a failing drive. Sometimes if a write is in progress at the time of a power failure, the data in the sector will be corrupted, resulting in an error when you try to read it. Attempting to write new data to the sector may work just fine since there's nothing physically wrong with it.

You can run badblocks -n on the drive to read and rewrite every sector, or in your case since you already know the number of the sector in question, you can use dd to write zeros to it. You can check the SMART stats with smartctl -a. You should see the pending reallocated count indicate how many sectors have failed to read, and after attempting to write the sector, this count will go down. The reallocated sector count may go up, in which case it was physically bad and has been remapped to the spare pool, and this may be a sign that the drive is on its way out. If not, then then it was just scrambled and should be fine now.

Try reading the sector first:

dd count=1 if=/dev/sda of=/dev/null skip=nnnn

If that fails, then you have the number right, then you can zero it out with:

dd count=1 if=/dev/zero of=/dev/sda seek=nnnn

Double check that you typed the command exactly before hitting enter.

21,189

MrNorm

Updated on September 18, 2022

Comments

MrNorm 9 months

My Ubuntu 13.10 system has been performing very poorly over the last day or so. Looking at the kernel logs, it appears that the <1yr old 3TB SATA disk is having issues with a particular sector:

Nov  4 20:54:04 mediaserver kernel: [10893.039180] ata4.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Nov  4 20:54:04 mediaserver kernel: [10893.039187] ata4.01: BMDMA stat 0x65
Nov  4 20:54:04 mediaserver kernel: [10893.039193] ata4.01: failed command: READ DMA EXT
Nov  4 20:54:04 mediaserver kernel: [10893.039202] ata4.01: cmd 25/00:08:f8:3f:83/00:00:af:00:00/f0 tag 0 dma 4096 in
Nov  4 20:54:04 mediaserver kernel: [10893.039202]          res 51/40:00:f8:3f:83/40:00:af:00:00/10 Emask 0x9 (media error)
Nov  4 20:54:04 mediaserver kernel: [10893.039207] ata4.01: status: { DRDY ERR }
Nov  4 20:54:04 mediaserver kernel: [10893.039211] ata4.01: error: { UNC }
Nov  4 20:54:04 mediaserver kernel: [10893.148527] ata4.00: configured for UDMA/133
Nov  4 20:54:04 mediaserver kernel: [10893.180322] ata4.01: configured for UDMA/133
Nov  4 20:54:04 mediaserver kernel: [10893.180345] sd 3:0:1:0: [sdc] Unhandled sense code
Nov  4 20:54:04 mediaserver kernel: [10893.180349] sd 3:0:1:0: [sdc]
Nov  4 20:54:04 mediaserver kernel: [10893.180353] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov  4 20:54:04 mediaserver kernel: [10893.180356] sd 3:0:1:0: [sdc]
Nov  4 20:54:04 mediaserver kernel: [10893.180359] Sense Key : Medium Error [current] [descriptor]
Nov  4 20:54:04 mediaserver kernel: [10893.180371] Descriptor sense data with sense descriptors (in hex):
Nov  4 20:54:04 mediaserver kernel: [10893.180373]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Nov  4 20:54:04 mediaserver kernel: [10893.180384]         af 83 3f f8
Nov  4 20:54:04 mediaserver kernel: [10893.180389] sd 3:0:1:0: [sdc]
Nov  4 20:54:04 mediaserver kernel: [10893.180393] Add. Sense: Unrecovered read error - auto reallocate failed
Nov  4 20:54:04 mediaserver kernel: [10893.180396] sd 3:0:1:0: [sdc] CDB:
Nov  4 20:54:04 mediaserver kernel: [10893.180398] Read(16): 88 00 00 00 00 00 af 83 3f f8 00 00 00 08 00 00
Nov  4 20:54:04 mediaserver kernel: [10893.180412] end_request: I/O error, dev sdc, sector 2944614392
Nov  4 20:54:04 mediaserver kernel: [10893.180431] ata4: EH complete

The kern.log file is around 33MB mostly full of the above error repeated and the sector doesn't appear to be any different in the repeated messages.

I'm currently running the following command on the now unmounted disk to test and attempt to sort out any issues the disk might have. I'm around 12hrs in and expect it to take another 24/48 hours as the disk is so large:

e2fsck -c -c -p -v /dev/sdc1

My question is: Is this drive failing, or am I looking at a common issue here? I'm wondering if there is any point to me to repairing or ignoring bad sectors and whether I should replace the disk under warranty whilst it's still covered. My knowledge of the above command is somewhat lacking, so I'm sceptical as to whether it'll help or not.

Quick update!

e2fsck finally finished after 2 days with lots of 'multiply-claimed block(s) in inode'. Trying to mount the filesystem resulted in an error, forcing it to drop back to read-only:

Nov 11 08:29:05 mediaserver kernel: [211822.287758] EXT4-fs (sdc1): warning: mounting fs with errors, running e2fsck is recommended
Nov 11 08:29:05 mediaserver kernel: [211822.301699] EXT4-fs (sdc1): mounted filesystem with ordered data mode. Opts: errors=remount-ro

Trying to read the sector manually:

sudo dd count=1 if=/dev/sdc of=/dev/null skip=2944614392
dd: reading ‘/dev/sdc’: Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 5.73077 s, 0.0 kB/s

Trying to write to it:

sudo dd count=1 if=/dev/zero of=/dev/sdc seek=2944614392
dd: writing to ‘/dev/sdc’: Input/output error
1+0 records in
0+0 records out
0 bytes (0 B) copied, 2.87869 s, 0.0 kB/s

On both counts, the Reallocated_Sector_Ct remained 0.

The drive does go into a sleep state quite often. I'm now thinking this could be a filesystem issue? I'm not 100%.

MrNorm over 9 years

Thanks. Really helpful to know as it's always been a grey area for me. I'm going to zero the drive and send it back, as it's within warranty.
RobotHumans over 9 years

Not so. Bad sectors just indicate grossly high traffic to a sector. In MOST cases, it does indicate a failing disk. You can tune your seek speed to mark slower responses as bad though... It's too complex to say always though.
MrNorm over 9 years

It's interesting you say that, because I got some interesting information following your commands. I have amended my question above.
frostschutz over 9 years

Does your drive not support SMART for some reason or why is it that you haven't checked that yet?
user over 9 years

@frostschutz "On both counts, the Reallocated_Sector_Ct remained 0." Seems the OP has checked SMART.
psusi over 9 years

@MrNorm, please add the full smartctl -a output to your question.
Antti Haapala almost 8 years

Please don't use this (it doesn't even always work), and if you confuse skip and seek you will overwrite your MBR instead. See my answer
aircraft over 5 years

@frostschutz whats the meaning of Get a replacement if you can.? do you mean replace the Disk?
Owais F over 5 years

@AnttiHaapala +1, I did exactly what you are warning about. Luckily, this disk actually had a GPT, so I only damaged the protective MBR which gdisk knows how to repair. Use of hdparm seems much safer -- TIL about that, upvoted.
psusi over 5 years

You certainly do have to dboule check your commands ( as noted in my original answer two years before Antti's ) before you foul things up, but it doesn't seem to any easier to me to mess up the arguments to dd than to hdparm.
TooTea over 4 years

Although this is an ancient aswer, I'm really wondering what do you mean by "dd doesn't always work". Are you suggesting that it might fail to write data as instructed? It's not doing anything particularly prone to failure, just copying data around. You could get the same result using two lines in almost any programming language.