hard resetting link exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen

14,312

Solution 1

According to Supermicro Support, the defect lies with board :

Quote:

This board may need ECO 16238 update.

Solution 2

What you server experiences is basically a SATA renegotiation at a lower link speed after some problem communicating with the drives.

These factors can be at work here (ordered by probability)

  1. very high-latency IOPS operations (eg: caused by SSD controller's garbage collection) resulting in SATA command timeout. Do your drive supports SATA Trim command? If so, try running fstrim /. Does it change anything?
  2. Bad motherboard/memory: is your memory ECC protected? If not, and if you can, run an extended (2+ hours) memtest86+ test session
  3. hardware/software drivers incompatibility
  4. Bad SATA controller: while quite unlikely, you can not completely exclude it
  5. Bad SATA cables/drives: as all four drives give you problem, this is very unlikely
Share:
14,312

Related videos on Youtube

Dennis Nolte
Author by

Dennis Nolte

Yet another Sys Admin

Updated on September 18, 2022

Comments

  • Dennis Nolte
    Dennis Nolte almost 2 years

    Following situation:

    A productive linux debian 7 server with kernel 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u2 x86_64 GNU/Linux

    Manufacturer: Supermicro Product Name: X10SLL-F Version: 1.02

    SATA controller: Intel Corporation Lynx Point 6-port SATA Controller 1 [AHCI mode] (rev 04)

    2x SSD, 2x hdd

    each drive can do Sata Rev3 (6.0Gb/s)

    hdparm -I /dev/sd[a-d]|egrep "Model|speed|Transport"
        Model Number:       TOSHIBA THNSNH128GBST                   
        Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
           *    Gen1 signaling speed (1.5Gb/s)
           *    Gen2 signaling speed (3.0Gb/s)
           *    Gen3 signaling speed (6.0Gb/s)
           *    SMART Command Transport (SCT) feature set
        Model Number:       TOSHIBA THNSNH128GBST                   
        Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
           *    Gen1 signaling speed (1.5Gb/s)
           *    Gen2 signaling speed (3.0Gb/s)
           *    Gen3 signaling speed (6.0Gb/s)
           *    SMART Command Transport (SCT) feature set
        Model Number:       ST2000VX000-1CU164                      
        Transport:          Serial, SATA Rev 3.0
           *    Gen1 signaling speed (1.5Gb/s)
           *    Gen2 signaling speed (3.0Gb/s)
           *    Gen3 signaling speed (6.0Gb/s)
           *    SMART Command Transport (SCT) feature set
        Model Number:       ST2000VX000-1CU164                      
        Transport:          Serial, SATA Rev 3.0
           *    Gen1 signaling speed (1.5Gb/s)
           *    Gen2 signaling speed (3.0Gb/s)
           *    Gen3 signaling speed (6.0Gb/s)
           *    SMART Command Transport (SCT) feature set
    

    The kernel messages suggest (to me at least) an issue with all 4 drives, which lead's me to believe it's the sata controller who might be at fault.

    ata1: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
    ata1: irq_stat 0x00400040, connection status changed
    ata1: SError: { HostInt PHYRdyChg 10B8B DevExch }
    ata1: hard resetting link
    ata2: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
    ata2: irq_stat 0x00400040, connection status changed
    ata2: SError: { HostInt PHYRdyChg 10B8B DevExch }
    ata2: hard resetting link
    ata4: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
    ata4: irq_stat 0x00400040, connection status changed
    ata4: SError: { HostInt PHYRdyChg 10B8B DevExch }
    ata4: hard resetting link
    ata3: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
    ata3: irq_stat 0x00400040, connection status changed
    ata3: SError: { HostInt PHYRdyChg 10B8B DevExch }
    ata3: hard resetting link
    ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
    ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
    ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
    ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
    ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
    ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
    ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
    ata2.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
    ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
    ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
    ata3.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
    ata3.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
    ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
    ata2.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
    ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
    ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
    ata2.00: configured for UDMA/33
    ata2: EH complete
    ata1.00: configured for UDMA/33
    ata1: EH complete
    ata3.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
    ata3.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
    ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
    ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
    ata3.00: configured for UDMA/33
    ata3: EH complete
    ata4.00: configured for UDMA/33
    ata4: EH complete
    

    What i did already figure out (or believe to have figured out)

    The commands SECURITY FREEZE LOCK and DEVICE CONFIGURATION OVERLAY are not important to the issue.

    While reading about 20 bugreports and lot of documentations, a few linked some did suggest to disable NCQ, which i did .

    First for one device, after waiting 1 day to check if the error repeats it happend again and i disabled it for all 4 devices

    echo "1" >/sys/block/sdc/device/queue_depth
    

    No obvious change in the situation.

    https://ata.wiki.kernel.org/index.php/Libata_error_messages

    https://wiki.archlinux.org/index.php/Solid_State_Drives#Resolving_NCQ_errors

    Others suggest sata cable or even an incompatibility between board + drives.

    However as i seem to either have the issue on one drive and this populates to all 4, or having the issue directly on all 4 devices i am unable to pinpoint the issue further.

    As this is a production server putting this server down for maintenance (aka bios/kernel param changes) is possible, but i like to prevent that if possible.

    According to the hoster this might be power management related:

    https://bugzilla.kernel.org/show_bug.cgi?id=74961 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1318218

    echo "medium_power" >/sys/class/scsi_host/host0/link_power_management_policy 
    

    Before the change this was set to max_performance.

    This did not help either.

    Smart Values of the HDDs/SDDs are OK, nothing too obvious.

    Note that the UDMA Value seems to be 33 now only.

    On boot of the server this were the sata link speed values:

    [    3.161850] ata6: SATA link down (SStatus 0 SControl 300)
    [    3.161867] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
    [    3.161882] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    [    3.161894] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
    [    3.161907] ata5: SATA link down (SStatus 0 SControl 300)
    

    The situation might happen on high load on the HDDs only, i did not test that yet as it would impact the server performance obviously.

    There is no load on the SSDs, they are mounted but not used by any of the processes.

    The RAM is ECC as far as i can tell.

    dmidecode -t 17
    # dmidecode 2.11
    SMBIOS 2.7 present.
    
    Handle 0x0023, DMI type 17, 34 bytes
    Memory Device
        Array Handle: 0x0022
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 8192 MB
        Form Factor: DIMM
        Set: None
        Locator: P1-DIMMA1
        Bank Locator: P0_Node0_Channel0_Dimm0
        Type: DDR3
        Type Detail: Synchronous
        Speed: 1600 MHz
        Manufacturer: Samsung
        Serial Number: 373A6427
        Asset Tag: 9876543210
        Part Number: M391B1G73QH0-CK0  
        Rank: 2
        Configured Clock Speed: 1600 MHz
    

    Please let me know if i can give additional informations as i lack the ideas what to do next.

    • Dennis Nolte
      Dennis Nolte over 8 years
      asking the vendor supermicro directly, possible they can help if the hoster does not.
    • user
      user over 8 years
      Notice that the system is renegotiating at 1.5 Gbps. Try forcing 1.5 Gbps and see if that makes the system stable. It's a data point. Try askubuntu.com/a/146290/11751 for a short writeup on how to.
  • Dennis Nolte
    Dennis Nolte over 8 years
    the ssd(s) are currently not in use, seems ECC is used. from dmidecode -t17: Total Width: 72 bits Data Width: 64 bits
  • EM0
    EM0 over 2 years
    What is an "ECO 16238 update"? Is this some kind of a firmware update or does it mean replacing the board? I'm running into the same problem now with a Supermicro X10SRi-F machine.
  • Dennis Nolte
    Dennis Nolte over 2 years
    the supermicro support answered me back then that the board should be replaced and i should contact my provider to get it changed.
  • Dennis Nolte
    Dennis Nolte over 2 years
    as for what is ECO 16238 update: no idea sorry.