mdadm: drive replacement shows up as spare and refuses to sync

10,398

After hours of Googling and some extremely wise help from JyZyXEL in the #linux-raid Freenode channel, we have a solution! There was not a single interruption to the RAID array during this process - exactly what I needed and expected from mdadm.

For some (currently unknown) reason, the RAID state became frozen. The winning command to figure this out is cat /sys/block/md0/md/sync_action:

root@galaxy:~# cat /sys/block/md0/md/sync_action
frozen

Simply put, that is why it was not using the available spares. All my hair is gone at the cost of a simple cat command!

So, just unfreeze the array:

root@galaxy:~# echo idle > /sys/block/md0/md/sync_action

And you're away!

root@galaxy:~# cat /sys/block/md0/md/sync_action
recover
root@galaxy:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdm[6] sdb[5] sda[0] sde[4] sdd[3] sdc[1]
      15627548672 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/5] [UU_UUU]
      [>....................]  recovery =  0.0% (129664/3906887168) finish=4016.8min speed=16208K/sec
      bitmap: 17/30 pages [68KB], 65536KB chunk

unused devices: 
root@galaxy:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Wed Jul 30 13:17:25 2014
     Raid Level : raid6
     Array Size : 15627548672 (14903.59 GiB 16002.61 GB)
  Used Dev Size : 3906887168 (3725.90 GiB 4000.65 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Mar 17 22:05:30 2015
          State : active, degraded, recovering
 Active Devices : 5
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

 Rebuild Status : 0% complete

           Name : eclipse:0
           UUID : cc7dac66:f6ac1117:ca755769:0e59d5c5
         Events : 73562

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       32        1      active sync   /dev/sdc
       6       8      192        2      spare rebuilding   /dev/sdm
       3       8       48        3      active sync   /dev/sdd
       4       8       64        4      active sync   /dev/sde
       5       8       16        5      active sync   /dev/sdb

Bliss :-)

Share:
10,398

Related videos on Youtube

Milos Ivanovic
Author by

Milos Ivanovic

Updated on September 18, 2022

Comments

  • Milos Ivanovic
    Milos Ivanovic almost 2 years

    Prelude

    I had the following devices in my /dev/md0 RAID 6: /dev/sd[abcdef]

    The following drives were also present, unrelated to the RAID: /dev/sd[gh]

    The following drives were part of a card reader that was connected, again, unrelated: /dev/sd[ijkl]

    Analysis

    sdf's SATA cable went bad (you could say it was unplugged while in use), and sdf was subsequently rejected from the /dev/md0 array. I replaced the cable and the drive was back, now at /dev/sdm. Please do not challenge my diagnosis, there is no problem with the drive.

    mdadm --detail /dev/md0 showed sdf(F), i.e., that sdf was faulty. So I used mdadm --manage /dev/md0 --remove faulty to remove the faulty drives.

    Now mdadm --detail /dev/md0 showed "removed" in the space where sdf used to be.

    root@galaxy:~# mdadm --detail /dev/md0
    /dev/md0:
            Version : 1.2
      Creation Time : Wed Jul 30 13:17:25 2014
         Raid Level : raid6
         Array Size : 15627548672 (14903.59 GiB 16002.61 GB)
      Used Dev Size : 3906887168 (3725.90 GiB 4000.65 GB)
       Raid Devices : 6
      Total Devices : 5
        Persistence : Superblock is persistent
    
      Intent Bitmap : Internal
    
        Update Time : Tue Mar 17 21:16:14 2015
              State : active, degraded
     Active Devices : 5
    Working Devices : 5
     Failed Devices : 0
      Spare Devices : 0
    
             Layout : left-symmetric
         Chunk Size : 512K
    
               Name : eclipse:0
               UUID : cc7dac66:f6ac1117:ca755769:0e59d5c5
             Events : 67205
    
        Number   Major   Minor   RaidDevice State
           0       8        0        0      active sync   /dev/sda
           1       8       32        1      active sync   /dev/sdc
           4       0        0        4      removed
           3       8       48        3      active sync   /dev/sdd
           4       8       64        4      active sync   /dev/sde
           5       8       16        5      active sync   /dev/sdb
    

    For some reason the RaidDevice of the "removed" device now matches one that is active. Anyway, let's try add the previous device (now known as /dev/sdm) because that was the original intent:

    root@galaxy:~# mdadm --add /dev/md0 /dev/sdm
    mdadm: added /dev/sdm
    root@galaxy:~# mdadm --detail /dev/md0
    /dev/md0:
            Version : 1.2
      Creation Time : Wed Jul 30 13:17:25 2014
         Raid Level : raid6
         Array Size : 15627548672 (14903.59 GiB 16002.61 GB)
      Used Dev Size : 3906887168 (3725.90 GiB 4000.65 GB)
       Raid Devices : 6
      Total Devices : 6
        Persistence : Superblock is persistent
    
      Intent Bitmap : Internal
    
        Update Time : Tue Mar 17 21:19:30 2015
              State : active, degraded
     Active Devices : 5
    Working Devices : 6
     Failed Devices : 0
      Spare Devices : 1
    
             Layout : left-symmetric
         Chunk Size : 512K
    
               Name : eclipse:0
               UUID : cc7dac66:f6ac1117:ca755769:0e59d5c5
             Events : 67623
    
        Number   Major   Minor   RaidDevice State
           0       8        0        0      active sync   /dev/sda
           1       8       32        1      active sync   /dev/sdc
           4       0        0        4      removed
           3       8       48        3      active sync   /dev/sdd
           4       8       64        4      active sync   /dev/sde
           5       8       16        5      active sync   /dev/sdb
    
           6       8      192        -      spare   /dev/sdm
    

    As you can see, the device shows up as a spare and refuses to sync with the rest of the array:

    root@galaxy:~# cat /proc/mdstat
    Personalities : [raid6] [raid5] [raid4]
    md0 : active raid6 sdm[6](S) sdb[5] sda[0] sde[4] sdd[3] sdc[1]
          15627548672 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/5] [UU_UUU]
          bitmap: 17/30 pages [68KB], 65536KB chunk
    
    unused devices: 
    

    I have also tried using mdadm --zero-superblock /dev/sdm before adding, with the same result.

    The reason I am using RAID 6 is to provide high availability. I will not accept stopping /dev/md0 and re-assembling it with --assume-clean or similar as workarounds to resolve this. This needs to be resolved online, otherwise I don't see the point of using mdadm.

  • Richard Gomes
    Richard Gomes over 4 years
    Thanks for that. Deserved to be bookmarked :-) I'm not having this kind of problem in particular but something similar. I've bought a pair of disks and I'm trying to add them to an existing RAID6 array with two faulty disks. No data loss at this time! :-) ... One of the disks was added OK but the other one is reported as faulty and automagically removed from the array. S.M.A.R.T. does not report anything wrong with the brand new disk... so, I'm still trying to figure out why the disk is refused. I'm doing full tests with the new disk in order to stress it and see if SMART reports anything.