HDD & SSD Linux: Hard resetting link

15,215

Solution 1

You do have a question here. I think (if I understand correctly) what is the process to determine what is causing this failure?

I'm a Network Security Engineer. So understand I'm cringing while typing this. Eliminate this as a crypto problem. Decrypt the drives and see if you still have the problem. The downside is you'll need to use them for several months decrypted.

Cables are a simple test (and you should start there first). Swap them out, but I have a hard time believing that's the problem unless you have neon lights inside your case.

That leaves the mobo. If it's not the other two...

I'm sure someone will chime in if they disagree with my troubleshooting. It's not costly to change the cables, and disabling encryption temporarily is a security risk that only you can determine if you're willing to accept.

Solution 2

It looks like you have a lot of errors on your SATA link. As a result, the host cannot get commands reliably across the link, and when it does sometimes the data returned is corrupted.

You see that in messages that the speed is limited, or that the expected drive identifier was not received. You are also seeing confusing messages from different layers of the driver which don't necessarily reflect what is going on at the hardware level of SATA. For example, "limiting speed to UDMA/133:PIO3" strictly applies only to parallel ATA drives (it just means the driver is trying a slower interface speed to see if the errors clear up), but the error messages clearly indicate that the lowest level which actually deals with the hardware understands it's talking to a SATA drive.

Your thought that it might be the SATA cables is a good one. Try replacing them, and make sure they're rated for SATA 3.0 Gb/sec (also called "SATA 2" or "SATA II"). I don't think your drives are the problem. Why does it take several months for the errors to show up after you replace the drive? Maybe the cables are coming loose somehow and replacing the drive reseats them. Or maybe it's just random chance.

Share:
15,215

Related videos on Youtube

shanet
Author by

shanet

Updated on September 18, 2022

Comments

  • shanet
    shanet almost 2 years

    my current storage setup consists of two traditional HDD's and two SSD's in my Linux box, each two on their own RAID 1 array which is encrypted via luks. I have a story of sorts, rather than a concrete question.

    For over a year now, I've randomly gotten "hard resetting link" errors in the kernel log from some of my drives. I would RMA the problem drive, and the new drives would cause the problem to stop. A few months later, I would eventually start seeing the same error again at seemingly random times. The drive would be marked as failed in RAID and no longer showed up in fdisk -l. I would reboot the computer and the drive would show up again and I could re-add to the array and it would rebuild. Sooner or later that problem would happen again, usually a few hours later.

    About six months ago, I replaced two of the traditional HDD's with SSD's in the hopes that they wouldn't have nearly as high of a failure rate as the traditional drives. However, over the past few days I started having problems with both one of the new SSD's and one of the traditional drives.

    I'm starting to see a pattern emerge. I get a new drive, a few months later I start having problems with it. I always assumed it was due to HDD's having a high failure rate, but now it's happening with SSD's so I'm thinking it isn't the drive's fault. What else could be problem? I've had multiple OS's installed since I started having the problem so I want to rule out a software issue. This leaves either the SATA cables, or the motherboard. Could the disk encryption be putting too much stress on the drives? Is there anything I can do to determine more info? Thanks as always.

    Below is the dmesg output of the problem from a question I asked a few months ago when I was having the same problem.

    [43161.734107] ata3: ATA_REG 0x41 ERR_REG 0x84
    [43161.734110] ata3: tag : dhfis dmafis sdbfis sactive
    [43161.734113] ata3: tag 0x0: 1 1 0 1  
    [43161.734123] ata3.00: exception Emask 0x1 SAct 0x1 SErr 0x180000 action 0x6 frozen
    [43161.734127] ata3.00: Ata error. fis:0x21
    [43161.734130] ata3: SError: { 10B8B Dispar }
    [43161.734134] ata3.00: failed command: READ FPDMA QUEUED
    [43161.734142] ata3.00: cmd 60/08:00:a8:03:00/00:00:00:00:00/40 tag 0 ncq 4096 in
    [43161.734144]          res 41/84:04:a8:03:00/84:00:00:00:00/40 Emask 0x10 (ATA bus error)
    [43161.734148] ata3.00: status: { DRDY ERR }
    [43161.734150] ata3.00: error: { ICRC ABRT }
    [43161.734155] ata3: hard resetting link
    [43161.734158] ata3: nv: skipping hardreset on occupied port
    [43162.220095] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    [43162.260202] ata3.00: model number mismatch 'WDC WD2002FAEX-007BA0' != 'C WD2002FAEX-007BA0                   �'
    [43162.260206] ata3.00: revalidation failed (errno=-19)
    [43162.260211] ata3.00: limiting speed to UDMA/133:PIO2
    [43167.220123] ata3: hard resetting link
    [43167.220127] ata3: nv: skipping hardreset on occupied port
    [43167.710060] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    [43167.750228] ata3.00: model number mismatch 'WDC WD2002FAEX-007BA0' != 'C WD2002FAEX-007BA0                   �'
    [43167.750232] ata3.00: revalidation failed (errno=-19)
    [43167.750236] ata3.00: disabled
    [43172.710100] ata3: hard resetting link
    [43173.620110] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    [43173.640455] ata3.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
    [43178.620116] ata3: hard resetting link
    [43179.530113] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    [43179.550748] ata3.00: ATA-8: WDC WD2002FAEX-007BA0, 05.01D05, max UDMA/133
    [43179.550753] ata3.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 31/32)
    [43179.570208] ata3.00: model number mismatch 'WDC WD2002FAEX-007BA0' != 'C WD2002FAEX-007BA0                   �'
    [43179.570213] ata3.00: revalidation failed (errno=-19)
    [43179.570220] ata3: limiting SATA link speed to 1.5 Gbps
    [43179.570224] ata3.00: limiting speed to UDMA/133:PIO3
    [43184.530066] ata3: hard resetting link
    [43184.530070] ata3: nv: skipping hardreset on occupied port
    [43185.020091] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    [43185.060949] ata3.00: configured for UDMA/133
    [43185.060969] sd 2:0:0:0: [sdd]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    [43185.060974] sd 2:0:0:0: [sdd]  Sense Key : Aborted Command [current] [descriptor]
    [43185.060980] Descriptor sense data with sense descriptors (in hex):
    [43185.060983]         72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
    [43185.060995]         00 00 03 a8 
    [43185.061000] sd 2:0:0:0: [sdd]  Add. Sense: Scsi parity error
    [43185.061006] sd 2:0:0:0: [sdd] CDB: Read(10): 28 00 00 00 03 a8 00 00 08 00
    [43185.061017] end_request: I/O error, dev sdd, sector 936
    [43185.061023] Buffer I/O error on device sdd, logical block 117
    [43185.061044] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061048] sd 2:0:0:0: killing request
    [43185.061062] ata3: EH complete
    [43185.061075] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061123] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061134] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061140] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061145] sd 2:0:0:0: [sdd] READ CAPACITY(16) failed
    [43185.061147] sd 2:0:0:0: [sdd]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
    [43185.061152] sd 2:0:0:0: [sdd] Sense not available.
    [43185.061155] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061166] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061175] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061185] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061193] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061198] sd 2:0:0:0: [sdd] READ CAPACITY failed
    [43185.061202] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061209] sd 2:0:0:0: [sdd]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
    [43185.061215] sd 2:0:0:0: [sdd] Sense not available.
    [43185.061226] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061235] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061245] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061254] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061263] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061274] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061280] sd 2:0:0:0: [sdd] Asking for cache data failed
    [43185.061283] sd 2:0:0:0: [sdd] Assuming drive cache: write through
    [43185.061289] sdd: detected capacity change from 2000398934016 to 0
    [43185.061610] ata3.00: detaching (SCSI 2:0:0:0)
    [43185.062444] sd 2:0:0:0: [sdd] Stopping disk
    [43249.120042] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
    [43249.120046] ata4.00: failed command: FLUSH CACHE EXT
    [43249.120051] ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
    [43249.120052]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
    [43249.120054] ata4.00: status: { DRDY }
    [43249.120059] ata4: hard resetting link
    [43249.120060] ata4: nv: skipping hardreset on occupied port
    [43249.610042] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    [43249.650323] ata4.00: configured for UDMA/133
    [43249.650326] ata4.00: retrying FLUSH 0xea Emask 0x4
    [43249.650452] ata4.00: device reported invalid CHS sector 0
    [43249.650458] ata4: EH complete
    
  • shanet
    shanet almost 12 years
    Thanks for the reply. Yes, my question is how can I figure out why all these drives keep failing on me. My mobo has been with me since 2008 when I built this system. I wonder if it's feeling the effects of old age. Four years isn't that old though. I do have three cold cathode lights in my case. I've never heard of those causing problems with cables though. More info on this? I have a few spare SATA cables lying around. I'll swap them out and change the SATA ports on my mobo. I can turn off the cold cathodes as well.
  • shanet
    shanet almost 12 years
    I'd really like to avoid decrypting the drives, especially if I have to RMA them in the future. Although, I could take a spare drive, put an unencrypted filesystem on it, and have a cron job write random data to it for a while each day and see what happens.
  • Everett
    Everett almost 12 years
    I worked in a shop where we installed cold cathode lights in machines (all the kewl kids did it ;) One day we set one of the egg timers we used next to one of the lights that was on. To say the timer went bat shiat insane crazy would be an understatement. We discovered the lights were throwing off huge amounts of RF. This was causing some of the problem we were seeing. Could be a faulty connection, or a poorly made product, or maybe it's just old... I can say that since then I've never put one in a computer...
  • shanet
    shanet almost 12 years
    Interesting. I'm a sucker for case lighting so I've had these guys in here since 2008 also when I built the system. I'll leave them off for a while and see what happens. Thanks for your help, I never would have thought of possible interference from the cold cathodes.
  • Everett
    Everett almost 12 years
    Note, I'm not guaranteeing it's them, just saying it's possible, might as well eliminate it. Glad to be of service.
  • shanet
    shanet almost 11 years
    Thanks for the info. I've since upgraded to a new motherboard and new SATA cables and have no problems anymore, but I think the problem was a bad SATA controller on my mobo before. I have problems with some SATA ports, but not others.