Replacing a dead disk in a zpool

67,547

Solution 1

After digging endlessly this night I finally found the solution. The short answer is that you can use the disks' GUIDs (which persist even after disconnecting a drive) with the zpool command.

Long answer: I got the disk's GUID using the zdb command which gave me the following output

root@zeus:/dev# zdb
hermes:
    version: 28
    name: 'hermes'
    state: 0
    txg: 162804
    pool_guid: 14829240649900366534
    hostname: 'zeus'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 14829240649900366534
        children[0]:
            type: 'raidz'
            id: 0
            guid: 5355850150368902284
            nparity: 1
            metaslab_array: 31
            metaslab_shift: 32
            ashift: 9
            asize: 791588896768
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 11426107064765252810
                path: '/dev/disk/by-id/ata-ST3300620A_5QF0MJFP-part2'
                phys_path: '/dev/gptid/73b31683-537f-11e2-bad7-50465d4eb8b0'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 15935140517898495532
                path: '/dev/disk/by-id/ata-ST3300831A_5NF0552X-part2'
                phys_path: '/dev/gptid/746c949a-537f-11e2-bad7-50465d4eb8b0'
                whole_disk: 1
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 7183706725091321492
                path: '/dev/disk/by-id/ata-ST3200822A_5LJ1CHMS-part2'
                phys_path: '/dev/gptid/7541115a-537f-11e2-bad7-50465d4eb8b0'
                whole_disk: 1
                create_txg: 4
            children[3]:
                type: 'disk'
                id: 3
                guid: 17196042497722925662
                path: '/dev/disk/by-id/ata-ST3200822A_3LJ0189C-part2'
                phys_path: '/dev/gptid/760a94ee-537f-11e2-bad7-50465d4eb8b0'
                whole_disk: 1
                create_txg: 4
    features_for_read:

The GUID I was looking for is 15935140517898495532 which enabled me to do

root@zeus:/dev# zpool offline hermes 15935140517898495532
root@zeus:/dev# zpool status
  pool: hermes
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0 in 2h4m with 0 errors on Sun Jun  9 00:28:24 2013
config:

        NAME                         STATE     READ WRITE CKSUM
        hermes                       DEGRADED     0     0     0
          raidz1-0                   DEGRADED     0     0     0
            ata-ST3300620A_5QF0MJFP  ONLINE       0     0     0
            ata-ST3300831A_5NF0552X  OFFLINE      0     0     0
            ata-ST3200822A_5LJ1CHMS  ONLINE       0     0     0
            ata-ST3200822A_3LJ0189C  ONLINE       0     0     0

errors: No known data errors

and then

root@zeus:/dev# zpool replace hermes 15935140517898495532 /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ
root@zeus:/dev# zpool status
  pool: hermes
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Jun  9 01:44:36 2013
    408M scanned out of 419G at 20,4M/s, 5h50m to go
    101M resilvered, 0,10% done
config:

        NAME                            STATE     READ WRITE CKSUM
        hermes                          DEGRADED     0     0     0
          raidz1-0                      DEGRADED     0     0     0
            ata-ST3300620A_5QF0MJFP     ONLINE       0     0     0
            replacing-1                 OFFLINE      0     0     0
              ata-ST3300831A_5NF0552X   OFFLINE      0     0     0
              ata-ST3500320AS_9QM03ATQ  ONLINE       0     0     0  (resilvering)
            ata-ST3200822A_5LJ1CHMS     ONLINE       0     0     0
            ata-ST3200822A_3LJ0189C     ONLINE       0     0     0

errors: No known data errors

After resilvering had been completed everything worked well again. It would have been nice to include this information, that you can use a disk's GUID obtained through zdb with the zpool command, with the manpage of zpool.

Edit

As pointed out by durval below the zdb command may not output anything. Then you may try to use

zdb -l /dev/<name-of-device>

to explicitly list information about the device (even if it already is missing from the system).

Solution 2

The issue is the disks are referenced by ids and not by device.

Here is a workaround that should work:

ln -s /dev/null /dev/ata-ST3300831A_5NF0552X
zpool export hermes
zpool import hermes
zpool status
# note the new device name that should appear here
zpool offline hermes xxxx
zpool replace hermes xxxx /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ

Edit: I was 30 seconds late ...

Solution 3

@Marcus: Thanks for posting this excellent answer to your own question, it helped me a lot.

The other day I found a twist that might interest you (and anyone else that comes here a-googling in the future): I had a cache device that was dropped from the pool (and marked as UNAVAIL) due to this same error (ZFS-8000-4J, label is missing or invalid), and trying to offline/remove/replace it failed with exactly the same "no such device in pool" message.

BUT, when I tried to apply your solution, plain zdb (with no arguments) did not list the device, much less its GUID.

After some digging, I found that zdb -l /dev/DEVICENAME listed the GUID (taking it directly from the device, and not from the pool records), and using that GUID enabled me to do the replacement (actually I did a zpool offline followed by a zpool remove and then a zpool add, which worked perfectly).

Share:
67,547

Related videos on Youtube

Marcus
Author by

Marcus

Updated on September 18, 2022

Comments

  • Marcus
    Marcus almost 2 years

    I'm running Ubuntu Server 13.04 64-bit using native ZFS. I have a zpool consisting of 4 hard drives of which one died yesterday and now is not being recognized by the OS or the BIOS anymore.

    Unfortunately I saw the problem only after the next reboot so now the drive label is missing and I can't replace the disk using the official instructions here and here.

    zpool status hermes -x
    

    prints

    root@zeus:~# zpool status hermes -x
      pool: hermes
     state: DEGRADED
    status: One or more devices could not be used because the label is missing or
            invalid.  Sufficient replicas exist for the pool to continue
            functioning in a degraded state.
    action: Replace the device using 'zpool replace'.
       see: http://zfsonlinux.org/msg/ZFS-8000-4J
      scan: scrub repaired 0 in 2h4m with 0 errors on Sun Jun  9 00:28:24 2013
    config:
    
            NAME                         STATE     READ WRITE CKSUM
            hermes                       DEGRADED     0     0     0
              raidz1-0                   DEGRADED     0     0     0
                ata-ST3300620A_5QF0MJFP  ONLINE       0     0     0
                ata-ST3300831A_5NF0552X  UNAVAIL      0     0     0
                ata-ST3200822A_5LJ1CHMS  ONLINE       0     0     0
                ata-ST3200822A_3LJ0189C  ONLINE       0     0     0
    
    errors: No known data errors
    

    I already replaced the drive with a new one (which got the label /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ)

    Any one of the commands

    zpool replace hermes /dev/disk/by-id/ata-ST3300831A_5NF0552X /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ
    zpool offline hermes /dev/disk/by-id/ata-ST3300831A_5NF0552X
    zpool detatch hermes /dev/disk/by-id/ata-ST3300831A_5NF0552X
    

    fails with

    root@zeus:~# zpool offline hermes /dev/disk/by-id/ata-ST3300831A_5NF0552X
    cannot offline /dev/disk/by-id/ata-ST3300831A_5NF0552X: no such device in pool
    

    because the label of the drive that died does not exist in the system any more.I also tried the commands above omitting path to the drive's label to no avail.

    How can I replace the "ghost" disk?

  • jlliagre
    jlliagre about 11 years
    My suggestion is almost identical to what you did. The only difference is the way to get the device guid. After creating a symlink to /dev/null (which is different from an empty link) and export/importing the pool, it appears in zpool status
  • Marcus
    Marcus about 9 years
    Thanks! A added a hint in my own accepted answer with a reference to your comment.
  • Serrano Pereira
    Serrano Pereira almost 8 years
    Using this method I actually managed to get the "defect" drive back online without replacing it (so I skipped offline and replace steps). I imported the pool a second time after removing the null link. Maybe it was just an issue with the drive label? In fact, the drive name remained the same. I did a complete scrub afterwards and no errors were found.
  • Brian Thomas
    Brian Thomas over 7 years
    Cool,then before running add using -n switch, but also the -g switch will grab the uuid that way as well.
  • xamox
    xamox over 7 years
    Thanks this was quite helpful as poking around the web I couldn't find info on getting stuff gleaned from zdb.
  • Matt
    Matt about 7 years
    I've been searching for weeks and finally this answer did the trick. But the IDs listed by zpool status (names like sdab) were NOT the same as the paths in /dev/disk/by-id (crazy long ID names). But ls -la /dev/disk/by-id reveals that they are all links to /dev/... so I found the one pointing to my UNAVAIL (and subsequently OFFLINE) disk, and I was able to complete these steps successfully. It is now resilvering. Thank you!
  • Matt
    Matt about 7 years
    For me, zdb -l /dev/... always showed "failed to unpack label".
  • StarNamer
    StarNamer over 6 years
    An alternative shorter way to get the GUID is zpool status -g which shows the status using GUIDs for each device. Also, for @Matt, zpool status -L will show the status using the basic device names instead of the long /dev/disk/by-id names.
  • extracrispy
    extracrispy about 5 years
    You're a real MVP coming back with your solution. This worked for me.