Possibly a dying hard drive, but reads, writes work - unsure about log entries

linux hard-drive sata samba smart

5,391

Solution 1

The salient log entries are:

kernel: ata4.00: error: { ICRC ABRT }
kernel: ata4: SError: { UnrecovData 10B8B BadCRC }

These log entries indicate an error is occurring on the SATA interface between the PC and HDD.
The SATA interface carries ATAPI packets for data, commands and status reports that are verified using CRC, Cyclic Redundancy Check, code.
The ICRC ABRT message indicates an "Interface CRC error" event and that the "Command aborted". The other log entries are ancillary information relating to the command that was aborted.
This is not reporting an error relating to the R/W heads or platters of the HDD, since sectors are verified using ECC, not weaker CRC.
More detailed information about these messages is at this libata wiki page

See this similar question on "SATA drives or chipset throwing DRDY ERR and ICRC ABRT", where the source of the problem was attributed to the host side of the SATA interface and not the HDD.

Note that an occasional SATA interface error is not considered problematic:

   For SATA drives, occasional transmission problems are expected even on
   otherwise pretty healthy systems. No need to worry about it too much
   unless the problem repeats itself a lot.

quoted from this Linux post.

smartctl -t extended (S.M.A.R.T. long (maximum) scan) says nothing three times already.

The Extended S.M.A.R.T. test is a self-test that is performed local to the drive, and apparently does not stress the SATA interface. Hence it doesn't help resolve the issue, but does reinforce the notion that the issue is on the interface rather than the media.

You need to look for a disk diagnostic or exerciser that executes from the host PC.
Since the Extended S.M.A.R.T. test can evidently read every sector without error, a near-identical test to read every sector and transfer that sector to the PC over the SATA bus is:

dd if=/dev/sdc of=/dev/null

There would be three sources of hardware failure on the SATA interface:

the SATA cable. e.g. Is my drive dying?
Simple test: replace the cable.
the motherboard's SATA interface.
Test: use a different SATA port, or install an alternate interface, such as a PCI or USB to SATA adapter with a new cable.
the drive's SATA interface.
Test: install the HDD in another PC with a new cable, and see if errors follow the drive.

But besides a hardware fault for this issue, there have been reports that implicated the Linux kernel as the cause of SATA errors:

Bottom Line

If you're only seeing these ICRC ABRT entries in the log at an infrequent "time to time" rate, then you may no longer have a problem. Perhaps the original issues may be attributable to some kernel issues that were eliminated when you updated the system.

Try using the system, and backup diligently.

Solution 2

Regardless of OS, I always find that after anything strange like this starts to happen with a given HDD, it almost certainly will break within the next few months. If possible, I would recommend you to replace the HDD with a new one. Other symptoms with a broken HDD will be unusable files that you can still copy and move around, and programs that have some quirks all of a sudden.

In one of my laptop computers, the HDD was on it's way out. What happened was that I could boot to the OS just fine, but suddenly error messages started to appear about the strangest processes of the OS when doing actions that worked just fine a minute before - one of the OS system files was semi-corrupt due to the breaking HDD. After replacing the HDD, this stopped completely, and the system has run fine for 4 years to date.

You can also try to run a full S.M.A.R.T. scan of the HDD. You can find those from the manufacturer's website. Seagate and Western Digital, at least, have one, but I'm not sure if they are available for Linux. Sometimes the full scan will reveal a broken drive, that a quick scan during POST will not catch.

Edit: I found this one for Linux, but I have no personal experience with it: http://sourceforge.net/apps/trac/smartmontools/wiki

5,391

tomsseisums

Updated on September 18, 2022

Comments

tomsseisums over 1 year

I recently received a Linux box having problems with Samba share - first off, couldn't connect, second ls -la showed some I/O error (close to what can be seen below) with no listing.

Now, I've fully updated the box, and after the update, the RAID is OK, all the data accessible and Samba worked like a charm. Apparently, I didn't save the previous logs.

Now, even if everything works, from time to time this pops up in my journalctl:

kernel: ata4: EH complete
kernel: end_request: I/O error, dev sdc, sector 2839546656
kernel: cdb[0]=0x28: 28 00 a9 40 0b 20 00 00 f0 00
kernel: sd 3:0:0:0: [sdc] CDB:
kernel: ASC=0x47 ASCQ=0x0
kernel: sd 3:0:0:0: [sdc]
kernel:         a9 40 0b a0
kernel:         72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00
kernel: Descriptor sense data with sense descriptors (in hex):
kernel: Sense Key : 0xb [current] [descriptor]
kernel: sd 3:0:0:0: [sdc]
kernel: Result: hostbyte=0x00 driverbyte=0x08
kernel: sd 3:0:0:0: [sdc]
kernel: ata4.00: configured for UDMA/133
kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 330)
kernel: ata4: hard resetting link
kernel: ata4.00: error: { ICRC ABRT }
kernel: ata4.00: status: { DRDY ERR }
kernel: [145B blob data]
kernel: ata4.00: failed command: READ DMA EXT
kernel: ata4: SError: { UnrecovData 10B8B BadCRC }
kernel: ata4.00: BMDMA stat 0x26
kernel: ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6

smartctl -t extended (S.M.A.R.T. long (maximum) scan) says nothing three times already.

By "everything works", I mean:

// Read from drive, write to drive.
find > files.txt

// Another read->write.
du -bc > sizes.txt

// 100 GB random writer
dd if=/dev/urandom of=fillerd bs=512 count=209715200

The files end up uncorrupt, fully readable.

What does the error depict? Should I be worried? How do I fix it?

tvdo over 10 years

Addressing completely random error messages and/or crashes: this could also be indicative of other failures, such as RAM.
Juha Untinen over 10 years

Random Exception Memory ;) It is also possible, so running Memtest86 (memtest86.com/download.htm) or similar is also a good thing to do.
tomsseisums over 10 years

I already ran a S.M.A.R.T. scan with the smartmontools.
Juha Untinen over 10 years

Ah, ok. Then the memory test would be the next thing to do.
tomsseisums over 10 years

This appeared to be the real issue. Something wrong with the connectors, because after some messing around with hardware, the errors stopped.