Given a kernel ATA exception, how to determine which physical disk is affected?

24,496

Solution 1

I wrote one-liner based on Tobi Hahn answer.

For example, you want to know what device stands for ata3:

ata=3; ls -l /sys/block/sd* | grep $(grep $ata /sys/class/scsi_host/host*/unique_id | awk -F'/' '{print $5}')

It will produce something like this

lrwxrwxrwx 1 root root 0 Jan 15 15:30 /sys/block/sde -> ../devices/pci0000:00/0000:00:1f.5/host2/target2:0:0/2:0:0:0/block/sde

Solution 2

Use this command:

ls -l /sys/block/sd* | sed 's/.*\(sd.*\) -.*\(ata.*\)\/h.*/\2 => \1/'

On my system this produces the output:

ata1 => sda
ata2 => sdb
ata3 => sdc
ata4 => sdd
ata7 => sde
ata8 => sdf

This will work even if all disks have the same drive model (between those 6 disks there are only two different models). Note that this depends on sysfs naming and works in my kernel 3.10.17. I know at some point in the past it wasn't this clean to retrieve the mappings but I'm not sure what the earliest kernel version this will work for.

If it doesn't work for you, see this link for a more roundabout way of determining the mappings: http://www.miriup.de/index.php?option=com_content&view=article&id=84:mapping-linux-kernel-ata-errors-to-a-device&catid=8:linux&Itemid=25

Solution 3

Turns out doing the mapping was easier than I realized.

dmesg | grep ata2 | head gives the kernel's mapping of the drive during the boot process. Or you could just go for ata2.00 right away.

[    2.448300] ata2: SATA max UDMA/133 abar m1024@0xfeb0b000 port 0xfeb0b180 irq 19
[    2.940139] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[    2.942143] ata2.00: ATA-8: ST31000340NS, SN05, max UDMA/133
[    2.942149] ata2.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 31/32)
[    2.944573] ata2.00: configured for UDMA/133
  (and some stuff I'd rather never have to see about drive errors)

As you can see, one of those lines contains my drive model number (ST31000340NS) which I can then use to map to a /dev file:

$ readlink /dev/disk/by-id/*ST31000340NS* | head -n1
../../sda

Solution 4

Here's a onliner to figure out all sd* devices:

LC_ALL=C ls -l /sys/block/sd* | perl -npe 's#^.*?block/(sd[^/ ]+).*?/pci0000:00/0000:([^/]+/(?:ata[0-9]+|usb[0-9]+/[^/]+/[^/]+|[0-9:.]+/[^/]+/[^/]+)).*#$1 = $2#'

For me, the output looks like

sda = 00:01.0/0000:01:00.0/host0/port-0:0
sdb = 00:01.0/0000:01:00.0/host0/port-0:1
sdc = 00:01.0/0000:01:00.0/host0/port-0:2
sdd = 00:01.0/0000:01:00.0/host0/port-0:3
sde = 00:1d.0/usb2/2-1/2-1.5
sdf = 00:1f.2/ata3
sdg = 00:1f.2/ata4
sdh = 00:1f.2/ata6

The output should be pretty readable and PCI Express devices and USB devices get something sane, too. You can then use the start of the device to figure out the actual hardware connection. For example, in the example above 01:00.0 is intel SSD 910 PCI Express card with four 200GB SSD sub devices. The lspci -nn | grep -F 01:00.0 output for the same hardware is

01:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03)

So the kernel thinks that sda...sdd are attached to LSI Logic PCI-Express SAS-2 controller. Sadly, there does not seem to be any easy way to "know" that this device is really the Intel SSD 910 PCI Express card.

Share:
24,496

Related videos on Youtube

user
Author by

user

Updated on September 18, 2022

Comments

  • user
    user almost 2 years

    I woke up this morning to a notification email with some rather disturbing system log entries.

    Dec  2 04:27:01 yeono kernel: [459438.816058] ata2.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x6 frozen
    Dec  2 04:27:01 yeono kernel: [459438.816071] ata2.00: failed command: WRITE FPDMA QUEUED
    Dec  2 04:27:01 yeono kernel: [459438.816085] ata2.00: cmd 61/08:00:70:0d:ca/00:00:08:00:00/40 tag 0 ncq 4096 out
    Dec  2 04:27:01 yeono kernel: [459438.816088]          res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
    Dec  2 04:27:01 yeono kernel: [459438.816095] ata2.00: status: { DRDY }
      (the above five lines were repeated a few times at a short interval)
    Dec  2 04:27:01 yeono kernel: [459438.816181] ata2: hard resetting link
    Dec  2 04:27:02 yeono kernel: [459439.920055] ata2: SATA link down (SStatus 0 SControl 300)
    Dec  2 04:27:02 yeono kernel: [459439.932977] ata2: hard resetting link
    Dec  2 04:27:09 yeono kernel: [459446.100050] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
    Dec  2 04:27:09 yeono kernel: [459446.314509] ata2.00: configured for UDMA/133
    Dec  2 04:27:09 yeono kernel: [459446.328037] ata2.00: device reported invalid CHS sector 0
      ("reported invalid CHS sector 0" repeated a few times at a short interval)
    

    I make full nightly backups of my entire system to an external (USB-connected) drive, and the above happened right in the middle of that backup run. (The backup starts at 04:00 through cron, and tonight's logged completion just before 04:56.) The backup process itself claims to have completed without any errors.

    There are two internally connected SATA drives and two externally (USB) connected drives on my system; one of the external drives is currently dormant. I don't recall off the top of my head which physical SATA ports are used for which of the internal drives.

    When googling I found the AskUbuntu question Is this drive failure or something else? which indicates that a very similar error occured after 8-10 GB had been copied to a drive, but the actual failure mode was different as the drive switched to a read-only state. The only real similarity is that I did add on the order of 7-8 GB of data to my main storage last night, which would have been backed up around the time that the error occured.

    smartd is not reporting anything out of the ordinary on either of the internal drives. Unfortunately smartctl doesn't speak the language of the external backup drive's USB bridge, and simply complains about Unknown USB bridge [0x0bc2:0x3320 (0x100)]. Googling for that specific error was distinctly unhelpful.

    My main data storage as well as the backup is on ZFS and zpool status reports 0 errors and no known data errors. Nevertheless I have initiated a full scrub on both the internal and external drives. It is currently slated to complete in about six hours for the internal drive (main storage pool) and 13-14 hours for the backup drive.

    It seems that the next step should be to determine which drive was having trouble, and possibly replace it. The ata2.00 part probably tells me which drive was having problems, but how do I map that identifier to a physical drive?

    • user
      user over 8 years
      @don_crissti Mine came later, and the two do indeed seem to ask about exactly the same thing, so I'd argue that mine is the duplicate. Good find.
  • terdon
    terdon over 10 years
    I edited to add the way I cam up with to use the output you show to get a /dev name. Is that what you had in mind? Do you have a better way? What if you have multiple drives of the same model?
  • casey
    casey over 10 years
    @terdon Take a look at my answer, which works with drives of the same model be examining the symlinks in /sys/block.
  • user
    user over 10 years
    @terdon Actually in my case I have only a single physical drive of the model in question, so I stopped there. You do make a good point, though. The very best would actually be if there was a way to get the drive serial number as well as the model directly from the kernel, to map against the ATA port identifier.
  • user
    user over 10 years
    Unfortunately, for me, your ls -l | sed gives me the plain, unedited ls -l output (confirmed via diff), and I don't seem to be getting much useful out of adapting the commands at the link you gave either. I'm running Debian's Linux 3.2.46 at the moment (downgraded because I thought the errors were caused by a kernel problem, which looks to have turned out not to be the case).
  • casey
    casey over 10 years
    @MichaelKjörling can you give me an example of what /sys/block/sda points at for your kernel?
  • user
    user over 10 years
    @casey Certainly. ls -l /sys/block/sda outputs: lrwxrwxrwx 1 root root 0 Dec 4 16:37 /sys/block/sda -> ../devices/pci0000:00/0000:00:11.0/host1/target1:0:0/1:0:0:0‌​/block/sda/
  • Mikko Rantalainen
    Mikko Rantalainen over 3 years
    Seems to work with kernel 5.4, too. Note that since you're parsing output of ls you should start the ls command with LC_ALL=C ls .... to avoid parsing to fail due unexpected locale formatting.