Possibly a dying hard drive, but reads, writes work - unsure about log entries
Solution 1
The salient log entries are:
- kernel: ata4.00: error: { ICRC ABRT }
- kernel: ata4: SError: { UnrecovData 10B8B BadCRC }
These log entries indicate an error is occurring on the SATA interface between the PC and HDD.
The SATA interface carries ATAPI packets for data, commands and status reports that are verified using CRC, Cyclic Redundancy Check, code.
The ICRC ABRT
message indicates an "Interface CRC error" event and that the "Command aborted". The other log entries are ancillary information relating to the command that was aborted.
This is not reporting an error relating to the R/W heads or platters of the HDD, since sectors are verified using ECC, not weaker CRC.
More detailed information about these messages is at this libata wiki page
See this similar question on "SATA drives or chipset throwing DRDY ERR and ICRC ABRT", where the source of the problem was attributed to the host side of the SATA interface and not the HDD.
Note that an occasional SATA interface error is not considered problematic:
For SATA drives, occasional transmission problems are expected even on
otherwise pretty healthy systems. No need to worry about it too much
unless the problem repeats itself a lot.
quoted from this Linux post.
smartctl -t extended (S.M.A.R.T. long (maximum) scan) says nothing three times already.
The Extended S.M.A.R.T. test is a self-test that is performed local to the drive, and apparently does not stress the SATA interface. Hence it doesn't help resolve the issue, but does reinforce the notion that the issue is on the interface rather than the media.
You need to look for a disk diagnostic or exerciser that executes from the host PC.
Since the Extended S.M.A.R.T. test can evidently read every sector without error, a near-identical test to read every sector and transfer that sector to the PC over the SATA bus is:
dd if=/dev/sdc of=/dev/null
There would be three sources of hardware failure on the SATA interface:
- the SATA cable. e.g. Is my drive dying?
Simple test: replace the cable. - the motherboard's SATA interface.
Test: use a different SATA port, or install an alternate interface, such as a PCI or USB to SATA adapter with a new cable. - the drive's SATA interface.
Test: install the HDD in another PC with a new cable, and see if errors follow the drive.
But besides a hardware fault for this issue, there have been reports that implicated the Linux kernel as the cause of SATA errors:
- [SOLVED] DRDY ERR and ICRC ABRT in dmesg and console
- Repeated DRDY ERR / ICRC ABRT msgs on 2.6.31-19-server
Bottom Line
If you're only seeing these ICRC ABRT
entries in the log at an infrequent "time to time" rate, then you may no longer have a problem. Perhaps the original issues may be attributable to some kernel issues that were eliminated when you updated the system.
Try using the system, and backup diligently.
Solution 2
Regardless of OS, I always find that after anything strange like this starts to happen with a given HDD, it almost certainly will break within the next few months. If possible, I would recommend you to replace the HDD with a new one. Other symptoms with a broken HDD will be unusable files that you can still copy and move around, and programs that have some quirks all of a sudden.
In one of my laptop computers, the HDD was on it's way out. What happened was that I could boot to the OS just fine, but suddenly error messages started to appear about the strangest processes of the OS when doing actions that worked just fine a minute before - one of the OS system files was semi-corrupt due to the breaking HDD. After replacing the HDD, this stopped completely, and the system has run fine for 4 years to date.
You can also try to run a full S.M.A.R.T. scan of the HDD. You can find those from the manufacturer's website. Seagate and Western Digital, at least, have one, but I'm not sure if they are available for Linux. Sometimes the full scan will reveal a broken drive, that a quick scan during POST will not catch.
Edit: I found this one for Linux, but I have no personal experience with it: http://sourceforge.net/apps/trac/smartmontools/wiki
Related videos on Youtube
tomsseisums
Updated on September 18, 2022Comments
-
tomsseisums over 1 year
I recently received a Linux box having problems with Samba share - first off, couldn't connect, second
ls -la
showed someI/O error
(close to what can be seen below) with no listing.Now, I've fully updated the box, and after the update, the RAID is OK, all the data accessible and Samba worked like a charm. Apparently, I didn't save the previous logs.
Now, even if everything works, from time to time this pops up in my
journalctl
:kernel: ata4: EH complete kernel: end_request: I/O error, dev sdc, sector 2839546656 kernel: cdb[0]=0x28: 28 00 a9 40 0b 20 00 00 f0 00 kernel: sd 3:0:0:0: [sdc] CDB: kernel: ASC=0x47 ASCQ=0x0 kernel: sd 3:0:0:0: [sdc] kernel: a9 40 0b a0 kernel: 72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00 kernel: Descriptor sense data with sense descriptors (in hex): kernel: Sense Key : 0xb [current] [descriptor] kernel: sd 3:0:0:0: [sdc] kernel: Result: hostbyte=0x00 driverbyte=0x08 kernel: sd 3:0:0:0: [sdc] kernel: ata4.00: configured for UDMA/133 kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 330) kernel: ata4: hard resetting link kernel: ata4.00: error: { ICRC ABRT } kernel: ata4.00: status: { DRDY ERR } kernel: [145B blob data] kernel: ata4.00: failed command: READ DMA EXT kernel: ata4: SError: { UnrecovData 10B8B BadCRC } kernel: ata4.00: BMDMA stat 0x26 kernel: ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6
smartctl -t extended
(S.M.A.R.T. long (maximum) scan) says nothing three times already.By "everything works", I mean:
// Read from drive, write to drive. find > files.txt // Another read->write. du -bc > sizes.txt // 100 GB random writer dd if=/dev/urandom of=fillerd bs=512 count=209715200
The files end up uncorrupt, fully readable.
What does the error depict? Should I be worried? How do I fix it?
-
tvdo over 10 yearsAddressing completely random error messages and/or crashes: this could also be indicative of other failures, such as RAM.
-
Juha Untinen over 10 yearsRandom Exception Memory ;) It is also possible, so running Memtest86 (memtest86.com/download.htm) or similar is also a good thing to do.
-
tomsseisums over 10 yearsI already ran a
S.M.A.R.T.
scan with thesmartmontools
. -
Juha Untinen over 10 yearsAh, ok. Then the memory test would be the next thing to do.
-
tomsseisums over 10 yearsThis appeared to be the real issue. Something wrong with the connectors, because after some messing around with hardware, the errors stopped.