Is my hard drive about to die?

raid mdadm smart linux hardware

7,158

It's not about to die.. It's already dead.

Replace it ASAP, and restore from backups if you lose any data.

7,158

Author by

Hristo Deshev

Updated on September 18, 2022

Comments

Hristo Deshev over 1 year

I have two hard drives set up as a RAID 1 array on my server (Linux, software RAID using mdadm) and one of them just got me this "present" in syslog:

Nov 23 02:05:29 h2 kernel: [7305215.338153] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 02:05:29 h2 kernel: [7305215.338178] ata1.00: irq_stat 0x40000008
Nov 23 02:05:29 h2 kernel: [7305215.338197] ata1.00: failed command: READ FPDMA QUEUED
Nov 23 02:05:29 h2 kernel: [7305215.338220] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in
Nov 23 02:05:29 h2 kernel: [7305215.338221]          res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F>
Nov 23 02:05:29 h2 kernel: [7305215.338287] ata1.00: status: { DRDY ERR }
Nov 23 02:05:29 h2 kernel: [7305215.338305] ata1.00: error: { UNC }
Nov 23 02:05:29 h2 kernel: [7305215.358901] ata1.00: configured for UDMA/133
Nov 23 02:05:32 h2 kernel: [7305218.269054] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 02:05:32 h2 kernel: [7305218.269081] ata1.00: irq_stat 0x40000008
Nov 23 02:05:32 h2 kernel: [7305218.269101] ata1.00: failed command: READ FPDMA QUEUED
Nov 23 02:05:32 h2 kernel: [7305218.269125] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in
Nov 23 02:05:32 h2 kernel: [7305218.269126]          res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F>
Nov 23 02:05:32 h2 kernel: [7305218.269196] ata1.00: status: { DRDY ERR }
Nov 23 02:05:32 h2 kernel: [7305218.269215] ata1.00: error: { UNC }
Nov 23 02:05:32 h2 kernel: [7305218.341565] ata1.00: configured for UDMA/133
Nov 23 02:05:35 h2 kernel: [7305221.193342] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 02:05:35 h2 kernel: [7305221.193368] ata1.00: irq_stat 0x40000008
Nov 23 02:05:35 h2 kernel: [7305221.193386] ata1.00: failed command: READ FPDMA QUEUED
Nov 23 02:05:35 h2 kernel: [7305221.193408] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in
Nov 23 02:05:35 h2 kernel: [7305221.193409]          res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F>
Nov 23 02:05:35 h2 kernel: [7305221.193474] ata1.00: status: { DRDY ERR }
Nov 23 02:05:35 h2 kernel: [7305221.193491] ata1.00: error: { UNC }
Nov 23 02:05:35 h2 kernel: [7305221.388404] ata1.00: configured for UDMA/133
Nov 23 02:05:38 h2 kernel: [7305224.426316] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 02:05:38 h2 kernel: [7305224.426343] ata1.00: irq_stat 0x40000008
Nov 23 02:05:38 h2 kernel: [7305224.426363] ata1.00: failed command: READ FPDMA QUEUED
Nov 23 02:05:38 h2 kernel: [7305224.426387] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in
Nov 23 02:05:38 h2 kernel: [7305224.426388]          res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F>
Nov 23 02:05:38 h2 kernel: [7305224.426459] ata1.00: status: { DRDY ERR }
Nov 23 02:05:38 h2 kernel: [7305224.426478] ata1.00: error: { UNC }
Nov 23 02:05:38 h2 kernel: [7305224.498133] ata1.00: configured for UDMA/133
Nov 23 02:05:41 h2 kernel: [7305227.400583] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 02:05:41 h2 kernel: [7305227.400608] ata1.00: irq_stat 0x40000008
Nov 23 02:05:41 h2 kernel: [7305227.400627] ata1.00: failed command: READ FPDMA QUEUED
Nov 23 02:05:41 h2 kernel: [7305227.400649] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in
Nov 23 02:05:41 h2 kernel: [7305227.400650]          res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F>
Nov 23 02:05:41 h2 kernel: [7305227.400716] ata1.00: status: { DRDY ERR }
Nov 23 02:05:41 h2 kernel: [7305227.400734] ata1.00: error: { UNC }
Nov 23 02:05:41 h2 kernel: [7305227.472432] ata1.00: configured for UDMA/133

From what I read so far, I am not sure if read errors mean that a hard drive is dying on me (no write errors so far). I've had hard drive errors in the past and those always had errors about failing to write to specific sectors in the logs. Not this time.

Should I be replacing the drive? Could something else be causing the problem?

I've scheduled a smartctl -t long test that will finish in a couple of hours. I hope this will give me some more info.

UPDATE: Something like a miracle happened. Details below:

I was backing up some files off that machine, preparing to replace the faulty drive. Then, as I was copying those huge files, I got this logcheck email:

Security Events for kernel
=-=-=-=-=-=-=-=-=-=-=-=-=-
Nov 23 17:16:24 h2 kernel: [7359837.963597] end_request: I/O error, dev sdb, sector 1202093816
Nov 23 17:16:41 h2 kernel: [7359855.196334] end_request: I/O error, dev sdb, sector 1202093816

System Events
=-=-=-=-=-=-=
Nov 23 17:14:06 h2 kernel: [7359700.193114] ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 17:14:06 h2 kernel: [7359700.193139] ata2.00: irq_stat 0x40000008
Nov 23 17:14:06 h2 kernel: [7359700.193158] ata2.00: failed command: READ FPDMA QUEUED
Nov 23 17:14:06 h2 kernel: [7359700.193180] ata2.00: cmd 60/08:00:58:03:aa/00:00:47:00:00/40 tag 0 ncq 4096 in
Nov 23 17:14:06 h2 kernel: [7359700.193181]          res 41/40:08:58:03:aa/00:00:47:00:00/00 Emask 0x409 (media error) <F>
Nov 23 17:14:06 h2 kernel: [7359700.193247] ata2.00: status: { DRDY ERR }
Nov 23 17:14:06 h2 kernel: [7359700.193265] ata2.00: error: { UNC }
Nov 23 17:14:06 h2 kernel: [7359700.194458] ata2.00: configured for UDMA/133

Oops! My hair, if I had some on my shaved head, stood up. See, it's real effing bad sectors on the second drive. Now what? With two faulty drives, what do I do?

I gave it some thought and decided that I:

Had one drive that I suspect to be faulty
And another that I'm 100% sure to be faulty with the bad sector complaints in the log.

So I replaced the second one, not the one I originally posted the question about. I had several partitions, each set up on a different RAID, and I was hoping that I'd be able to resync at least the root and boot ones, so that I don't have to reinstall everything on the server. I'd probably have to restore the huge data partition from backup, but well, I'd save me some work.

Replaced the drive, started the resyncs. Root and boot partitions (about 50GB) resynced really fast. No errors. I'm a happy camper!

Just for kicks, let's try resyncing the huge data partition -- it's about 2TB with 500GB of data on it. I started the resync and watched it for a while. It seemed to take forever, and I brought the server online, letting users use their stuff. Resync happening in the background. And, what do you know, about 18 hours later the resync is over with no errors. Server is fully alive now.

I wonder if I should be replacing the original drive now. I'm sure the server god of hard drives is laughing his butt off at me.

user over 11 years

Isn't the point of RAID 1 to survive the loss of all but one of the disks in the array without loss of data? Let's just hope that the only remaining disk doesn't break down under the load of rebuilding the array once the dead drive has been replaced.
Tom O'Connor over 11 years

Oh, in theory, that's the plan... But it depends on the grade of the drives.. If they're Seagate Blacks it should be fine.. If they're Green or Blues.. Well.. YMMV, but I'd be checking backup integrity..
Tom O'Connor over 11 years

I'm sorry.. but 10 years of experience with hard disk failures on linux, this is pretty much always a dead drive. It's either the disk platter failing, or the drive electronics.
user over 11 years

Semi-related, shouldn't a scrub catch any data integrity issues regardless of make and model drives? (Hopefully before one of the drives in the array fails...) Of course, that depends on one scrubbing before things go awry. Scrubbing a degraded array would probably make things worse, and certainly not better.
Tom O'Connor over 11 years

Again.. In theory. It might not catch it if the drive chips have gone nuts. I'd be less scared (somewhat) if it were hardware RAID. The make and model thing is really just.. what's the word?.. anecdotal evidence..
Tom O'Connor over 11 years

But there is some truth in Enterprise drives (the SAS controlled ones, and the higher end Black ones having higher MTBF, and the ability to recover from more read errors before going entirely loopy).
Hristo Deshev over 11 years

Tom, thanks a million! I'm scheduling the HDD replacement time with the hosting technicians.
Hristo Deshev over 11 years

Well, it turned out the drive was healthy enough to survive a RAID resync. My other hard drive on the server died, and I could resync the array off the first one... I've updated the question text with details.
Tom O'Connor over 11 years

Yes. You should be replacing the drive that was failed previously and now works. It's trolling you.