How do I recover from a faulted zpool where one device is OK, but was temporarily offline?

22,935

Solution 1

Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'.

Looks like after the initial temporary failure, you may only have needed to do a zpool clear to clear the errors.

If you want to pretend that it's a drive replacement, you probably need to clear the data off the drive first before you try re-adding it to the pool.

Solution 2

If zpool clear doesn't fix it, you can make zfs forget the disk using zpool labelclear <partition> (available in http://zfsonlinux.org since zfs-v0.6.2).

Be aware that even if you created the zpool using a whole device e.g. /dev/sda you have to specify the partition which zfs has created, e.g. /dev/sda1.

(Credits go to DeHackEd, https://github.com/zfsonlinux/zfs/issues/2076)

From a zpool manpage:

zpool labelclear [-f] device

Removes ZFS label information from the specified device. The device
must not be part of an active pool configuration.

  -f     Treat exported or foreign devices as inactive.

Solution 3

What were the outputs of the various commands you tried? Did you try the -f switch on any of them?

Did you run zpool clear poolname device-name?

In your case, zpool clear farcryz1 da4 - That should have gotten the resilvering process underway.

Share:
22,935

Related videos on Youtube

Josh
Author by

Josh

I am Josh Gitlin, CTO and co-founder of Digital Fruition a software as a service eCommerce company. Currently serving as Principal DevOps Engineer at Pinnacle 21, and hacking away at Cinc Server, the free-as-in-beer rebranded distribution of Chef Server.

Updated on September 18, 2022

Comments

  • Josh
    Josh over 1 year

    I have a zpool with 4 2TB USB disks in a raidz config:

    [root@chef /mnt/Chef]# zpool status farcryz1
      pool: farcryz1
     state: ONLINE
     scrub: none requested
    config:
    
        NAME        STATE     READ WRITE CKSUM
        farcryz1    ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
    

    In order to test the pool, I simulated a drive failure by pulling the USB cable from one of the drives without taking it offline:

    [root@chef /mnt/Chef]# zpool status farcryz1
      pool: farcryz1
     state: ONLINE
    status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
    action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
       see: http://www.sun.com/msg/ZFS-8000-9P
     scrub: none requested
    config:
    
        NAME        STATE     READ WRITE CKSUM
        farcryz1    ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            da4     ONLINE      22     4     0
            da3     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
    
    errors: No known data errors
    

    Data's still there, pool still online. Great! Now let's try to restore the pool. I plugged the drive back in, and issued the zpool replace command as I was instructed to above:

    [root@chef /mnt/Chef]# zpool replace farcryz1 da4
    invalid vdev specification
    use '-f' to override the following errors:
    /dev/da4 is part of active pool 'farcryz1'
    

    Um.... That's not helpful... So I tried a zpool clear farcryz1, but that didn't help at all. I still couldn't replace da4. So I tried a combination of onlineing, offlineing, clearing, replaceing, and scrubing. Now I am stuck here:

    [root@chef /mnt/Chef]# zpool status -v farcryz1
      pool: farcryz1
     state: DEGRADED
    status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
    action: Replace the device using 'zpool replace'.
       see: http://www.sun.com/msg/ZFS-8000-4J
     scrub: scrub completed after 0h2m with 0 errors on Fri Sep  9 13:43:34 2011
    config:
    
        NAME        STATE     READ WRITE CKSUM
        farcryz1    DEGRADED     0     0     0
          raidz1    DEGRADED     0     0     0
            da4     UNAVAIL      9     0     0  experienced I/O failures
            da3     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
    
    errors: No known data errors
    [root@chef /mnt/Chef]# zpool replace farcryz1 da4
    cannot replace da4 with da4: da4 is busy
    

    How can I recover from this situation, where one device in my zpool was unexpectedly disconnected (but is not a failed device) and is now back again, ready to be resilvered?


    EDIT: As requested, a tail of dmesg:

    (ses3:umass-sim4:4:0:1): removing device entry
    (da4:umass-sim4:4:0:0): removing device entry
    ugen3.2: <Western Digital> at usbus3
    umass4: <Western Digital My Book 1140, class 0/0, rev 3.00/10.03, addr 1> on usbus3
    da4 at umass-sim4 bus 4 scbus6 target 0 lun 0
    da4: <WD My Book 1140 1003> Fixed Direct Access SCSI-6 device 
    da4: 400.000MB/s transfers
    da4: 1907697MB (3906963456 512 byte sectors: 255H 63S/T 243197C)
    ses3 at umass-sim4 bus 4 scbus6 target 0 lun 1
    ses3: <WD SES Device 1003> Fixed Enclosure Services SCSI-6 device 
    ses3: 400.000MB/s transfers
    ses3: SCSI-3 SES Device
    GEOM: da4: partition 1 does not start on a track boundary.
    GEOM: da4: partition 1 does not end on a track boundary.
    GEOM: da4: partition 1 does not start on a track boundary.
    GEOM: da4: partition 1 does not end on a track boundary.
    ugen3.2: <Western Digital> at usbus3 (disconnected)
    umass4: at uhub3, port 1, addr 1 (disconnected)
    (da4:umass-sim4:4:0:0): lost device
    (da4:umass-sim4:4:0:0): removing device entry
    (ses3:umass-sim4:4:0:1): lost device
    (ses3:umass-sim4:4:0:1): removing device entry
    ugen3.2: <Western Digital> at usbus3
    umass4: <Western Digital My Book 1140, class 0/0, rev 3.00/10.03, addr 1> on usbus3
    da4 at umass-sim4 bus 4 scbus6 target 0 lun 0
    da4: <WD My Book 1140 1003> Fixed Direct Access SCSI-6 device 
    da4: 400.000MB/s transfers
    da4: 1907697MB (3906963456 512 byte sectors: 255H 63S/T 243197C)
    ses3 at umass-sim4 bus 4 scbus6 target 0 lun 1
    ses3: <WD SES Device 1003> Fixed Enclosure Services SCSI-6 device 
    ses3: 400.000MB/s transfers
    ses3: SCSI-3 SES Device
    
  • Josh
    Josh over 12 years
    I tried zpool clear farcryz1 da4, and it produces no output and no change at all. I just did that and I am now seeing da4 UNAVAIL 3 0 0 experienced I/O failures.
  • ewwhite
    ewwhite over 12 years
    And can you reboot? What does a tail of dmesg say?
  • Josh
    Josh over 12 years
    "you may only have needed to do a zpool clear to clear the errors" -- so the extra commands I ran probably caused the situation I'm in now. I suspected that as well. "you probably need to clear the data off the drive first before you try re-adding it to the pool" -- So, dd if=/dev/zero of=/dev/da4 bs=1M?
  • Steve Townsend
    Steve Townsend over 12 years
    Sure, blow it away, pretend its a brand-new drive.
  • Josh
    Josh over 12 years
    This did it, thanks! Before this would work, I had to reboot, and after doing so, zpool replace farcryz1 da4 responded with cannot replace da4 with da4: no such pool or dataset. But hooking up another USB drive as da4 and then the newly-zeroed 2TB drive after as da5 allowed me to zpool replace farcryz1 da4 da5. Thanks!
  • barrymac
    barrymac almost 9 years
    you have to zero the entire drive?
  • barrymac
    barrymac almost 9 years
    I was in a similar situation and was able to replace after zeroing without reboot