Can the hard drive timeout be disabled in Linux (attempting task abort)

12,342

I have found a timeout, which appears to have a default of 30 seconds on most systems. I'm not completely sure that this is the relevant timeout, but I've increased it on some vms, put the system under a significant load and I've not had any hdd timeouts in the vms so far.

Also, some of the comments are expressing confusion as to what hdd I've configured in the vm, so I have added that information to the question. And I have several Linux vms running at the same time, so the errors are not appearing in just one single vm.

Timeout setting (e.g., in /etc/rc.local):

Linux:

TIMEOUT=86400
for f in /sys/block/sd?/device/timeout; do
    echo $TIMEOUT >"$f"
done

If this pattern (sd?) does not match your hardware, search for timeouts and check them manually:

find /sys/ -name timeout

Debian/kBSD (GNU/kFreeBSD 9.0-2-amd64):

sysctl kern.cam.da.default_timeout=86400

(I've significantly increased the timeout rather than disabling it; if this turns out to be the culprit, a more appropriate value might be set.)

Again, I've not confirmed that this is exactly the timeout my vms are running into (or that this is the only timeout), but given that I've put the system under high load (the kind of load that used to trigger hdd timeouts) and no hdd timeout has occurred yet (although network timeouts have, like before), it certainly seems like this might at least be part of the solution.

Share:
12,342

Related videos on Youtube

basic6
Author by

basic6

Updated on September 18, 2022

Comments

  • basic6
    basic6 almost 2 years

    Unfortunately, when a hard drive (usually a virtual drive) is slow, Linux aborts requests to that drive after a timeout, possibly causing data corruption.

    Last time this happened to me, I had 2 vms running (Linux and FreeBSD) on a storage, which had connectivity issues and was frozen for over an hour. The storage itself is fine, no errors there, and after fixing the connection, the vms (which obviously were frozen as well) seemed to be working again.

    However, the Linux vm had decided to abort requests, rendering that system unusable (ls on most directories got stuck, so did mount without options and many other things did not work anymore); a reboot was necessary. These are the errors (dmesg):

    ...
    [86707.916728] Write(10): 2a 00 02 4c 9e 38 00 03 c0 00
    [86707.916732] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff880036865500)
    [86707.916734] mptscsih: ioc0: attempting task abort! (sc=ffff880036866100)
    [86707.916735] sd 2:0:0:0: [sda] CDB: 
    [86707.916736] Write(10): 2a 00 02 4c a1 f8 00 03 c0 00
    [86707.916739] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff880036866100)
    [86707.916741] mptscsih: ioc0: attempting task abort! (sc=ffff880036865c80)
    [86707.916742] sd 2:0:0:0: [sda] CDB: 
    [86707.916743] Write(10): 2a 00 02 4c a5 b8 00 03 c0 00
    [86707.916746] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff880036865c80)
    [86707.916748] mptscsih: ioc0: attempting task abort! (sc=ffff880036864300)
    [86707.916749] sd 2:0:0:0: [sda] CDB: 
    [86707.916750] Write(10): 2a 00 02 4c a9 78 00 02 b0 00
    [86707.916753] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff880036864300)
    

    It's interesting that the FreeBSD vm has no errors logged and is working fine. So apparently, only FreeBSD worked as expected, not aborting anything (although I think I've seen similar kernel messages on FreeBSD systems).

    I don't know why the kernel is killing pending write requests after a timeout. It probably makes sense in some cases, but it certainly does not in my case - it's actually an unnecessary risk (without that timeout, the Linux vm would have continued normally after the connection had been restored, everything would have worked again).

    How can the Linux kernel timeout (vm) for frozen hard drives be DISABLED?


    Edit:

    The Linux vm has 1 hard drive (/dev/sda) only, which should look like a regular (SCSI type of) physical drive to it.
    lspci lists this controller: "SCSI storage controller [0100]: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI [1000:0030] (rev 01)".

    Here's another example (different vm, same host, also Linux) (in this case, the storage wasn't gone, but the host was under heavy load):

    [1179039.664031] ata2: lost interrupt (Status 0x18)
    [1179039.727188] ata2: drained 8 bytes to clear DRQ
    [1179039.727272] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
    [1179039.740720] sr 1:0:0:0: CDB:
    [1179039.740759] Get event status notification: 4a 01 00 00 10 00 00 00 08 00
    [1179039.740768] ata2.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 0 pio 16392 in
             res 40/00:02:00:08:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
    [1179039.740770] ata2.00: status: { DRDY }
    [1179039.748067] ata2: soft resetting link
    [1179039.937757] ata2.00: configured for UDMA/33
    [1179039.943435] ata2: EH complete
    

    Edit:

    And this is what the timeout errors look like on a Debian/kBSD (FreeBSD kernel) system (same host, same situation, different vm):

    mpt0: request 0xffffff80007305d0:62955 timed out for ccb 0xfffffe000a3bb800 (req->ccb 0xfffffe000a3bb800)
    mpt0: request 0xffffff800072fa90:62956 timed out for ccb 0xfffffe000a3d1000 (req->ccb 0xfffffe000a3d1000)
    mpt0: request 0xffffff8000726070:62962 timed out for ccb 0xfffffe000a428000 (req->ccb 0xfffffe000a428000)
    mpt0: attempting to abort req 0xffffff80007305d0:62955 function 0
    mpt0: completing timedout/aborted req 0xffffff8000726070:62962
    mpt0: completing timedout/aborted req 0xffffff80007305d0:62955
    mpt0: completing timedout/aborted req 0xffffff800072fa90:62956
    mpt0: abort of req 0xffffff80007305d0:0 completed
    mpt0: request 0xffffff8000726190:64136 timed out for ccb 0xfffffe000a3d1800 (req->ccb 0xfffffe000a3d1800)
    mpt0: attempting to abort req 0xffffff8000726190:64136 function 0
    mpt0: completing timedout/aborted req 0xffffff8000726190:64136
    mpt0: abort of req 0xffffff8000726190:0 completed
    mpt0: request 0xffffff8000721990:50970 timed out for ccb 0xfffffe00024bf800 (req->ccb 0xfffffe00024bf800)
    mpt0: attempting to abort req 0xffffff8000721990:50970 function 0
    mpt0: completing timedout/aborted req 0xffffff8000721990:50970
    mpt0: abort of req 0xffffff8000721990:0 completed
    mpt0: request 0xffffff80007279c0:61393 timed out for ccb 0xfffffe000a3cf000 (req->ccb 0xfffffe000a3cf000)
    mpt0: request 0xffffff8000732550:61395 timed out for ccb 0xfffffe000a428000 (req->ccb 0xfffffe000a428000)
    mpt0: attempting to abort req 0xffffff80007279c0:61393 function 0
    mpt0: completing timedout/aborted req 0xffffff80007279c0:61393
    mpt0: completing timedout/aborted req 0xffffff8000732550:61395
    mpt0: abort of req 0xffffff80007279c0:0 completed
    
    • psusi
      psusi over 9 years
      Those error messages do not look like what you describe. Normally a vm has an emulated virtio or ide disk, but those messages appear to be coming from the mpt scsi controller driver.
    • psusi
      psusi over 9 years
      Don't have it emulate a scsi disk?
    • Bratchley
      Bratchley over 9 years
      I'm also doubtful that your primary hdd is showing up as MPT. What are you looking at that makes you think this is the only storage device the VM sees? Also, what kind of virtualization are you running on?
    • basic6
      basic6 over 9 years
      This is VMware Workstation and I have configured 1 SCSI hard drive (so I'm looking at the vm's settings). And the kernel errors mention sda, which is exactly this hdd.
    • basic6
      basic6 over 9 years
      And / is mounted from /dev/sda3 (ext4). There is no additional disk layer (like RAID) in the vm.