hard resetting link exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
Solution 1
According to Supermicro Support, the defect lies with board :
Quote:
This board may need ECO 16238 update.
Solution 2
What you server experiences is basically a SATA renegotiation at a lower link speed after some problem communicating with the drives.
These factors can be at work here (ordered by probability)
- very high-latency IOPS operations (eg: caused by SSD controller's garbage collection) resulting in SATA command timeout. Do your drive supports SATA Trim command? If so, try running
fstrim /
. Does it change anything? - Bad motherboard/memory: is your memory ECC protected? If not, and if you can, run an extended (2+ hours) memtest86+ test session
- hardware/software drivers incompatibility
- Bad SATA controller: while quite unlikely, you can not completely exclude it
- Bad SATA cables/drives: as all four drives give you problem, this is very unlikely
Related videos on Youtube
Comments
-
Dennis Nolte almost 2 years
Following situation:
A productive linux debian 7 server with kernel
3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u2 x86_64 GNU/Linux
Manufacturer:
Supermicro
Product Name:X10SLL-F
Version:1.02
SATA controller:
Intel Corporation Lynx Point 6-port SATA Controller 1 [AHCI mode] (rev 04)
2x SSD, 2x hdd
each drive can do Sata Rev3 (6.0Gb/s)
hdparm -I /dev/sd[a-d]|egrep "Model|speed|Transport" Model Number: TOSHIBA THNSNH128GBST Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0 * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Gen3 signaling speed (6.0Gb/s) * SMART Command Transport (SCT) feature set Model Number: TOSHIBA THNSNH128GBST Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0 * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Gen3 signaling speed (6.0Gb/s) * SMART Command Transport (SCT) feature set Model Number: ST2000VX000-1CU164 Transport: Serial, SATA Rev 3.0 * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Gen3 signaling speed (6.0Gb/s) * SMART Command Transport (SCT) feature set Model Number: ST2000VX000-1CU164 Transport: Serial, SATA Rev 3.0 * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Gen3 signaling speed (6.0Gb/s) * SMART Command Transport (SCT) feature set
The kernel messages suggest (to me at least) an issue with all 4 drives, which lead's me to believe it's the sata controller who might be at fault.
ata1: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen ata1: irq_stat 0x00400040, connection status changed ata1: SError: { HostInt PHYRdyChg 10B8B DevExch } ata1: hard resetting link ata2: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen ata2: irq_stat 0x00400040, connection status changed ata2: SError: { HostInt PHYRdyChg 10B8B DevExch } ata2: hard resetting link ata4: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen ata4: irq_stat 0x00400040, connection status changed ata4: SError: { HostInt PHYRdyChg 10B8B DevExch } ata4: hard resetting link ata3: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen ata3: irq_stat 0x00400040, connection status changed ata3: SError: { HostInt PHYRdyChg 10B8B DevExch } ata3: hard resetting link ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out ata2.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out ata3.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out ata3.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out ata2.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out ata2.00: configured for UDMA/33 ata2: EH complete ata1.00: configured for UDMA/33 ata1: EH complete ata3.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out ata3.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out ata3.00: configured for UDMA/33 ata3: EH complete ata4.00: configured for UDMA/33 ata4: EH complete
What i did already figure out (or believe to have figured out)
The commands
SECURITY FREEZE LOCK
andDEVICE CONFIGURATION OVERLAY
are not important to the issue.While reading about 20 bugreports and lot of documentations, a few linked some did suggest to disable NCQ, which i did .
First for one device, after waiting 1 day to check if the error repeats it happend again and i disabled it for all 4 devices
echo "1" >/sys/block/sdc/device/queue_depth
No obvious change in the situation.
https://ata.wiki.kernel.org/index.php/Libata_error_messages
https://wiki.archlinux.org/index.php/Solid_State_Drives#Resolving_NCQ_errors
Others suggest sata cable or even an incompatibility between board + drives.
However as i seem to either have the issue on one drive and this populates to all 4, or having the issue directly on all 4 devices i am unable to pinpoint the issue further.
As this is a production server putting this server down for maintenance (aka bios/kernel param changes) is possible, but i like to prevent that if possible.
According to the hoster this might be power management related:
https://bugzilla.kernel.org/show_bug.cgi?id=74961 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1318218
echo "medium_power" >/sys/class/scsi_host/host0/link_power_management_policy
Before the change this was set to
max_performance
.This did not help either.
Smart Values of the HDDs/SDDs are OK, nothing too obvious.
Note that the UDMA Value seems to be 33 now only.
On boot of the server this were the sata link speed values:
[ 3.161850] ata6: SATA link down (SStatus 0 SControl 300) [ 3.161867] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 3.161882] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 3.161894] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 3.161907] ata5: SATA link down (SStatus 0 SControl 300)
The situation might happen on high load on the HDDs only, i did not test that yet as it would impact the server performance obviously.
There is no load on the SSDs, they are mounted but not used by any of the processes.
The RAM is ECC as far as i can tell.
dmidecode -t 17 # dmidecode 2.11 SMBIOS 2.7 present. Handle 0x0023, DMI type 17, 34 bytes Memory Device Array Handle: 0x0022 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 8192 MB Form Factor: DIMM Set: None Locator: P1-DIMMA1 Bank Locator: P0_Node0_Channel0_Dimm0 Type: DDR3 Type Detail: Synchronous Speed: 1600 MHz Manufacturer: Samsung Serial Number: 373A6427 Asset Tag: 9876543210 Part Number: M391B1G73QH0-CK0 Rank: 2 Configured Clock Speed: 1600 MHz
Please let me know if i can give additional informations as i lack the ideas what to do next.
-
Dennis Nolte over 8 yearsasking the vendor supermicro directly, possible they can help if the hoster does not.
-
user over 8 yearsNotice that the system is renegotiating at 1.5 Gbps. Try forcing 1.5 Gbps and see if that makes the system stable. It's a data point. Try askubuntu.com/a/146290/11751 for a short writeup on how to.
-
-
Dennis Nolte over 8 yearsthe ssd(s) are currently not in use, seems ECC is used. from dmidecode -t17: Total Width: 72 bits Data Width: 64 bits
-
EM0 over 2 yearsWhat is an "ECO 16238 update"? Is this some kind of a firmware update or does it mean replacing the board? I'm running into the same problem now with a Supermicro X10SRi-F machine.
-
Dennis Nolte over 2 yearsthe supermicro support answered me back then that the board should be replaced and i should contact my provider to get it changed.
-
Dennis Nolte over 2 yearsas for what is ECO 16238 update: no idea sorry.