ZFS pool reports a missing device, but it is not missing


Solution 1

Just run a zpool clear solaris then post the result of zpool status -v.

It would be nice to know the hardware involved and what controller you're using.


Looking at your blkid output, you have remnants of a previous Linux software RAID. You'll need to mdadm --zero-superblock /dev/sdb1 to clear that.

Solution 2

After searching the internet and server fault and stack overflow for over a day, not finding anything. I ask this question, and the answer shows up in the related questions on the right side. So I found the answer to this on this question :

Upgraded Ubuntu, all drives in one zpool marked unavailable

For some reason, madam runs in the start, and starts md0, even though md0 does not contain any disks (as is shown in the errors), it does cause this error.

So a simple

mdadm --stop /dev/md0

Did the trick, and now my disks are resilvering. Case closed.

Solution 3

I know this is a five year-old question, and your immediate problem was solved. But this is one of the few specific search results that come up in a web search about missing ZFS devices (at least the keywords I used), and it might help others to know this:

This specific problem of devices going "missing", is a known problem with ZFS on Linux. (Specifically on Linux.) The problem, I believe, is two-fold, and although the ZOL team could themselves fix it (probably with a lot of work), it's not entirely a ZOL problem:

  1. While no OS has a perfectly stable way of referring to devices, for this specific use case, Linux is a little worse than, say, Illumos, BSD, or Solaris. Sure, we have device IDs, GUIDs, and even better--the newer 'WWN' standard. But the problem is, some storage controllers--notably some USB (v3 and 4) controllers, eSATA, and others, as well as many types of consumer-grade external enclosures--either can't always see those, or worse, don't pass them through to the OS. Merely plugging a cable into the "wrong" port of an external enclosure can trigger this problem in ZFS, and there's no getting around it.

  2. ZOL for some reason can't pick up that the disks do actually exist and are visible to the OS, just not at any of the previous locations ZFS knew before (e.g. /dev, /dev/disk/by-id, by-path, by-guid, etc.) Or the one specific previous location, more to the point. Even if you do a proper zpool export before moving anything around. This is particularly frustrating about ZOL or ZFS in particular. (I remember this problem even on Solaris, but granted that was a significantly older version of ZFS that would lose the entire pool if the ZIL went missing...which I lost everything once to [but had backups].)

The obvious workaround is to not use consumer-grade hardware with ZFS, especially consumer-grade external enclosures that use some consumer-level protocol like USB, Firewire, eSATA, etc. (External SAS should be fine.)

That specifically--consumer grade external enclosures--has caused me unending headaches. While I did occasionally have this specific problem with slightly more "enterprise"-grade LSI SAS controllers and rackmount chassis with a 5x4 bay, moving to a more portable solution with three external bays pretty much unleashed hell. Thankfully my array is a stripe of three-way mirrors, because at one point it literally lost track of 8 drives (out of 12 total), and the only solution was to resilver them. (Which was mostly reads at GBs/s so at least it didn't take days or weeks.)

So I don't know what the long-term solution is. I wouldn't blame the volunteers working on this mountain of code, if they felt that covering all the edge cases of consumer-grade hardware, for Linux specifically, was out of scope.

But I think that if ZFS did a more exhaustive search of metadata that ZFS manages itself on each disk, would fix many related problems. (Btrfs, for example, doesn't suffer from this problem at all. I can move stuff around willy-nilly completely at random, and it has never once complained. Granted, Btrfs has other shortcomings compared to ZFS (the list of pros and cons is endless), and it's also native Linux--but it at least goes to show that the problem can, in theory, be solved, at least on Linux, specifically by the software itself.

I've cobbled together a workaround to this problem, and I've now implemented on all my ZFS arrays, even at work, even on enterprise hardware:

  1. Turn the external enclosures off, so that ZFS doesn't automatically import the pool. (It is frustrating that there still seems to be no way to tell ZFS not to do this. Renaming the cachefile or setting it to "none" doesn't work. Even without the addressing problems, I almost never want the pools to auto-mount but would rather an automatic script do it.)

  2. Once the system is up and settled down, then turn on the external enclosures.

  3. Run a script that exports and imports the pool a few times in a row (frustratingly sometimes necessary for it to see even legit minor changes). The most important thing here, is to import in read-only mode to avoid an automatic resilver kicking off.

  4. The script then shows the user the output of zpool status of the read-only pool, and prompt the user if it's OK to go ahead and import in full read-write mode.

Doing this has saved me (or my data) countless times. Usually it means I have to move drives and/or usually just cables around, until the addressing gets back to where it was. It also provides me with the opportunity to try different addressing methods with the -d switch. Some combination of that, and changing cables/locations, has solved the problem a few times.

In my particular case, mounting with -d /dev/disk/by-path is usually the optimal choice. Because my former favorite, -d /dev/disk/by-id is actually fairly unreliable with my current setup. Usually a whole bay of drives are simply missing entirely from the /dev/disk/by-id directory. (And in this case it's hard to blame even Linux. It's just a wonky setup that further aggravates the existing shortcomings previously noted.)

Sure, it means the server can't be relied upon to come up automatically without manual intervention. But considering 1) it runs full-time on a big battery backup, 2) I've knowingly made that tradeoff for the benefit of being able to use consumer-grade hardware that doesn't require two people and a dolly to move... that's an OK tradeoff.

(Edit: corrections.)


Related videos on Youtube

Trausti Thor
Author by

Trausti Thor

All around nice guy who likes developing stuff. Every job has different requirements with different tools. There is no one solution.

Updated on September 18, 2022


  • Trausti Thor
    Trausti Thor almost 2 years

    I am running the latest Debian 7.7 x86 and ZFS on linux

    After moving my computer to a different room. If I do a zpool status I get this status :

      pool: solaris
     state: DEGRADED
    status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
    action: Replace the device using 'zpool replace'.
    see: http://zfsonlinux.org/msg/ZFS-8000-4J
    scan: none requested
    NAME                                            STATE     READ WRITE CKSUM
    solaris                                         DEGRADED     0     0     0
      raidz1-0                                      DEGRADED     0     0     0
        11552884637030026506                        UNAVAIL      0     0     0  was /dev/disk/by-id/ata-Hitachi_HDS723020BLA642_MN1221F308BR3D-part1
        ata-Hitachi_HDS723020BLA642_MN1221F308D55D  ONLINE       0     0     0
        ata-Hitachi_HDS723020BLA642_MN1220F30N4JED  ONLINE       0     0     0
        ata-Hitachi_HDS723020BLA642_MN1220F30N4B2D  ONLINE       0     0     0
        ata-Hitachi_HDS723020BLA642_MN1220F30JBJ8D  ONLINE       0     0     0

    The disk it says in unavailable is /dev/sdb1 After a bit of investigating, I found this out, that the ata-Hitachi_HDS723020BLA642_MN1221F308BR3D-part1 is just a smiling to /dev/sdb1, and it does exist :

    lrwxrwxrwx 1 root root 10 Jan  3 14:49 /dev/disk/by-id/ata-Hitachi_HDS723020BLA642_MN1221F308BR3D-part1 -> ../../sdb1

    If I check smart status, like :

    # smartctl -H /dev/sdb
    smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
    Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
    SMART overall-health self-assessment test result: PASSED

    The disk is there. I can do fdisk on it, and everything else.

    If I try to detach it, like :

    zpool detach solaris 11552884637030026506
    cannot detach 11552884637030026506: only applicable to mirror and replacing vdevs

    I also tried with /dev/sdb /dev/sdb1 and the long by-id name. Same error all the time.

    I can't replace it either, or what seems anything else. I have even tried to turn the computer off and on again, to no avail.

    Unless I actually replace the hard disk it self, I can't see any solution to this problem.

    Ideas ?

    [update] balked

    # blkid 
    /dev/mapper/q-swap_1: UUID="9e611158-5cbe-45d7-9abb-11f3ea6c7c15" TYPE="swap" 
    /dev/sda5: UUID="OeR8Fg-sj0s-H8Yb-32oy-8nKP-c7Ga-u3lOAf" TYPE="LVM2_member" 
    /dev/sdb1: UUID="a515e58f-1e03-46c7-767a-e8328ac945a1" UUID_SUB="7ceeedea-aaee-77f4-d66d-4be020930684" LABEL="q.heima.net:0" TYPE="linux_raid_member" 
    /dev/sdf1: LABEL="solaris" UUID="2024677860951158806" UUID_SUB="9314525646988684217" TYPE="zfs_member" 
    /dev/sda1: UUID="6dfd5546-00ca-43e1-bdb7-b8deff84c108" TYPE="ext2" 
    /dev/sdd1: LABEL="solaris" UUID="2024677860951158806" UUID_SUB="1776290389972032936" TYPE="zfs_member" 
    /dev/sdc1: LABEL="solaris" UUID="2024677860951158806" UUID_SUB="2569788348225190974" TYPE="zfs_member" 
    /dev/sde1: LABEL="solaris" UUID="2024677860951158806" UUID_SUB="10515322564962014006" TYPE="zfs_member" 
    /dev/mapper/q-root: UUID="07ebd258-840d-4bc2-9540-657074874067" TYPE="ext4" 

    After disabling mdadm and rebooting, this issue is back Not sure why sdb is marked as linux_raid_member. How to clear that ?

    • ewwhite
      ewwhite over 9 years
      Were you using partitions and not full disks?
    • Trausti Thor
      Trausti Thor over 9 years
      When I created the raidz, I did use only the disks like /dev/sdb /dev/sdc and so forth. This is something the driver did
    • Brian Thomas
      Brian Thomas over 6 years
      I can corroborate this, the driver must have done that, I am running into this same issue, i also see -part1, and its currently unavail. It apparently was always called that, but it didnt go UNAVAIL i replaced another REMOVED drive, which wasn't showing in blkid, so I disconnected it fully but couldn't reboot, so I disconnected all zfs drives, successful reboot, reconnected all, saw that, zpool replaced the removed, but should have looked into that first. It's making me nervous, if the resilver doesnt complete if another drive crashes.
  • Trausti Thor
    Trausti Thor over 9 years
    It did show the same errors. I had already did that
  • ewwhite
    ewwhite over 9 years
    This is messed up. Please post the output of blkid.
  • ewwhite
    ewwhite over 9 years
    You need to get rid of the mdadm signature. mdadm --zero-superblock on /dev/sdb1
  • Trausti Thor
    Trausti Thor over 9 years
    That did the trick. Cleared the type, and set zfs_member on the disk. Thank you so much. Can you add this as an answer ?