Repair MD (mdadm) RAID

I just happened to notice one day that 2 drives in my RAID 6 configuration had died. Repairing a RAID array uses disks pretty heavily. When you consider that disk drives have a rated MTTF, and that usually an array is built around disks of the same type and age, the chances of having another drive fail while a repair is happening is uncomfortably plausible. This is why I really like RAID 6, it gives me some piece of mind rather than that walking on eggshells feeling I often get repairing a RAID volume (particularly one whose backups are questionable). Unfortunately I cashed-in this insurance policy without realising it. Although mdadm sends an email when a RAID becomes degraded, an improperly configured firewall rule prevented me from ever seeing it.

Luckily, all hope was not lost, as I caught the problem before another drive died and the RAID failed.

Intuition (or paranoia) caused me to print out the status of the RAID array using the mdadm tool.

root@host:~# mdadm --detail /dev/md2
/dev/md2:
        Version : 00.90
  Creation Time : Thu Apr 28 13:37:35 2011
     Raid Level : raid6
     Array Size : 1949216512 (1858.92 GiB 1996.00 GB)
  Used Dev Size : 487304128 (464.73 GiB 499.00 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Fri Sep 21 13:46:56 2012
          State : clean, degraded
 Active Devices : 4
Working Devices : 4
 Failed Devices : 2
  Spare Devices : 0

     Chunk Size : 64K

           UUID : 929bc0fb:73d94f86:fe36db6a:e236fe49
         Events : 0.366048

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       0        0        1      removed
       2       8       65        2      active sync   /dev/sde1
       3       8       81        3      active sync   /dev/sdf1
       4       8       97        4      active sync   /dev/sdg1
       5       8      113        5      active sync   /dev/sdh1
       6       8       49        -      faulty spare   /dev/sdd1
       7       8       33        -      faulty spare   /dev/sdc1

As you can see, both /dev/sdd1 and /dev/sdc1 had fallen out of the array. Checking the the spare drive cupboard I was relieved to find 2 compatible drives ready to go. Before I could swap the drives out however, I had to identify the dead drives. This server holds 8 drives in a 4×2 configuration. The backplane was wired up to the disk controllers in an odd way. The left-hand bottom disk is /dev/sda, then moving towards the right across the bottom we go through /dev/sdb, /dev/sdc, and /dev/sdd. Moving back to the left side, now at the top, we resume with /dev/sde, then move right to /dev/sdf, and so on. This is not the setup I would have assumed and so I was glad that I took a couple steps to ensure I wasn’t replacing the wrong drives.

The hdparm command can show the serial number of a drive, which can then (hopefully) be matched to the serial number printed on the label of the disk.

root@host:~# hdparm -I /dev/sda |head

/dev/sda:

ATA device, with non-removable media
        Model Number:       WDC WD20EFRX-68EUZN0
        Serial Number:      WD-WCC4MFLFBEEF7
        Firmware Revision:  82.00A82
        Transport:          Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
        Supported: 9 8 7 6 5

Unfortunately, we aren’t as lucky trying to read information from a dead drive

root@host:~# hdparm -i /dev/sdd

/dev/sdd:
 HDIO_DRIVE_CMD(identify) failed: Invalid exchange
 HDIO_GET_IDENTITY failed: No message of desired type

The logical thing to do here would be to print out the serial number of all working drives and compare the numbers to all the disk labels. Ideally this would leave us with 2 unidentified drives which would be our problem drives.

That seemed like too much work, plus I didn’t have a pen, so I used the following command instead to cause heavy disk activity on each drive (sda – sdh). CAUTION: Read the explanation below before running this command.

root@host:~# echo {a..h} |xargs -n1 |xargs -i dd if=/dev/sd{} of=/dev/null bs=1M count=500

You can watch in the video as the activity light moves through drives in order /dev/sda through /dev/sdh. The two drives on the bottom right turn out to be our dead drives.

Attempts to read the dead drives fail without delay

root@host:~# dd if=/dev/sdc of=/dev/null
dd: reading `/dev/sdc’: Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.0109307 s, 0.0 kB/s

Although these disks are part of an array, they can still be accessed individually. Writing to a drive directly without using the array would be very bad, but reading a drive might be safe. It’s best to do something like this when the system is idle. MD could decide to throw a drive that isn’t responding fast enough out of the array. If we are asking a drive to read a bunch of data for us, this data throughput competes with the data the RAID might be requesting of the drive, and therefore might cause MD to think something isn’t right with the drive. Although I think it is relatively safe to do this on an idle system, you should use this technique at your own discretion. In retrospect a safer idea that would probably also work would be to run the following command and note which lights did not light.

root@host:~# dd if=/dev/md2 of=/dev/null bs=1M count=500

So that gave me the logical drives that failed (/dev/sdc and /dev/sdd) as well as their physical positions (bottom right). I could have used the mdadm tool to remove those drives from the array, although since I was about to physically remove them anyhow, I just figured MD would probably figure it (which it did). The following command is therefore possibly optional.

root@host:~# mdadm --manage /dev/md2 -r /dev/sdc1
root@host:~# mdadm --manage /dev/md2 -r /dev/sdd1

If you are replacing drives and your drives, controllers, and/or backplane does not support hot-swapping, you should shutdown the computer first before swapping the failed drive(s).

After replacing the drives and powering the computer back up, I checked out the new status.

root@host:~# mdadm --detail /dev/md2
/dev/md2:
        Version : 00.90
  Creation Time : Thu Apr 28 13:37:35 2011
     Raid Level : raid6
     Array Size : 1949216512 (1858.92 GiB 1996.00 GB)
  Used Dev Size : 487304128 (464.73 GiB 499.00 GB)
   Raid Devices : 6
  Total Devices : 4
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Sat Sep 22 16:00:03 2012
          State : clean, degraded
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 64K

           UUID : 929bc0fb:73d94f86:fe36db6a:e236fe49
         Events : 0.389418

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       0        0        1      removed
       2       8       65        2      active sync   /dev/sde1
       3       8       81        3      active sync   /dev/sdf1
       4       8       97        4      active sync   /dev/sdg1
       5       8      113        5      active sync   /dev/sdh1

Well, it’s true, those drives were removed, but you might be surprised to not see the replacement drives show up. A quick check of /dev for my replaced drives showed me that they were available.

root@host:~# ls /dev/sdc
/dev/sdc
root@host:~# ls /dev/sdd
/dev/sdd

MD did not see my drives yet because I hadn’t partitioned them or marked the partition type to be compatible for use with Linux software RAID. The replacement drives should have a capacity large enough to match or exceed the other drives. Using the fdisk command, I listed the partition of one of the working drives belonging to the array (md2).

root@host:~# fdisk -l -u /dev/sde

Disk /dev/sde: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000cd2a3

   Device Boot      Start         End      Blocks   Id  System
/dev/sde1            2048   974610431   487304192   fd  Linux raid autodetect

The -l switch tells fdisk to list the partition information and exit, the -u switch uses units of sectors rather than cylinders. This is important as cylinders is a pretty antiquated way to provision a partition, and might be problematic with newer “Advanced Format” drives.

To setup /dev/sdc to match the settings of /dev/sde I ran the following commands

root@host:~# fdisk -u -c /dev/sdc

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First sector (2048-976773167, default 2048):
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-976773167, default 976773167): 974610431

Command (m for help): t
Selected partition 1
Hex code (type L to list codes): fd
Changed system type of partition 1 to fd (Linux raid autodetect)

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

The -c flag turns off DOS compatibility, which is a legacy feature, and needs to be disabled to work with sectors rather than cylinders. For the ending sector I typed in 974610431 which was the ending sector of /dev/sde1 found in the output of the fdisk query of /dev/sde.

Using fdisk to list /dev/sdc I could see that the new /dev/sdc1 matched the detail of /dev/sde1.

root@host:~# fdisk -l -u /dev/sdc

Disk /dev/sdc: 500.1 GB, 500107862016 bytes
177 heads, 54 sectors/track, 102194 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xf73c7e1d

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1            2048   974610431   487304192   fd  Linux raid autodetect

If you have sfdisk installed a much easier way to clone partition information is to use the -d option of sfdisk with the source drive as an argument. This will output the partition of a good drive in a manner similar to fdisk -l. The key distinction is that this output can be used directly by another instance of sfdisk to read that partition information and apply it a new target drive (through a file or pipe). I used sfdisk as shown below to clone the partition information of /dev/sde to /dev/sdd.

root@host:~# sfdisk -d /dev/sde |sfdisk --force /dev/sdd
Checking that no-one is using this disk right now ...
OK

Disk /dev/sdd: 60801 cylinders, 255 heads, 63 sectors/track
Old situation:
Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0

   Device Boot Start     End   #cyls    #blocks   Id  System
/dev/sdd1          0       -       0          0    0  Empty
/dev/sdd2          0       -       0          0    0  Empty
/dev/sdd3          0       -       0          0    0  Empty
/dev/sdd4          0       -       0          0    0  Empty
New situation:
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sdd1          2048 974610431  974608384  fd  Linux raid autodetect
/dev/sdd2             0         -          0   0  Empty
/dev/sdd3             0         -          0   0  Empty
/dev/sdd4             0         -          0   0  Empty
Warning: partition 1 does not end at a cylinder boundary
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)

Pretty neat trick, sfdisk did all the work for me.

For GPT disks sfdisk won’t do. Sgdisk (available in the gdisk package on Ubuntu) can handle that easily though

apt-get -y install gdisk
sgdisk /dev/sde -R /dev/sdd
sgdisk -G /dev/sdd

This copies the partition information from /dev/sde to /dev/sdd and creates unique GUIDs for /dev/sdd so that it doesn’t get confused with /dev/sde.

To be safe, I checked to make sure the partitions were all setup the same

root@host:~# echo {c..e} | xargs -n1 echo |xargs -i bash -c "fdisk -l -u /dev/sd{} |tail -n1"
/dev/sdc1            2048   974610431   487304192   fd  Linux raid autodetect
/dev/sdd1            2048   974610431   487304192   fd  Linux raid autodetect
/dev/sde1            2048   974610431   487304192   fd  Linux raid autodetect

Everything looked good, and as you can see sfdisk also copied the partition type ‘fd’ to mark /dev/sdd1 for use with the software RAID.

Using the mdadm command I added the new devices to the array

root@host:~# mdadm --manage /dev/md2 -a /dev/sdc1
mdadm: added /dev/sdc1
root@host:~# mdadm --manage /dev/md2 -a /dev/sdd1
mdadm: added /dev/sdd1

The recovery process takes awhile, the time depends on the drive sizes, and several other factors. You can watch the progress however through the /proc filesystem.

root@host:~# watch grep recovery /proc/mdstat
Every 2.0s: grep recovery /proc/mdstat                  Sat Sep 22 19:09:34 2012

      [>....................]  recovery =  1.9% (9723136/487304128) finish=150.0
min speed=53058K/sec

The progress represents the recovery of one drive back into the array, it will repeat for subsequent drives. Hit control+C to exit the watch command.

You can also watch MD repair status in mdadm, although slightly less excitingly

root@host:~# mdadm --detail /dev/md2
/dev/md2:
        Version : 00.90
  Creation Time : Thu Apr 28 13:37:35 2011
     Raid Level : raid6
     Array Size : 1949216512 (1858.92 GiB 1996.00 GB)
  Used Dev Size : 487304128 (464.73 GiB 499.00 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Sat Sep 22 20:05:09 2012
          State : clean, degraded, recovering
 Active Devices : 4
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 2

     Chunk Size : 64K

 Rebuild Status : 39% complete

           UUID : 929bc0fb:73d94f86:fe36db6a:e236fe49
         Events : 0.389454

    Number   Major   Minor   RaidDevice State
       7       8       33        0      spare rebuilding   /dev/sdc1
       1       0        0        1      removed
       2       8       65        2      active sync   /dev/sde1
       3       8       81        3      active sync   /dev/sdf1
       4       8       97        4      active sync   /dev/sdg1
       5       8      113        5      active sync   /dev/sdh1

       6       8       49        -      spare   /dev/sdd1

Eventually afer both of the disks were initiated into the array. the state changed to “clean” as shown below.

root@host:~# mdadm --detail /dev/md2
/dev/md2:
        Version : 00.90
  Creation Time : Thu Apr 28 13:37:35 2011
     Raid Level : raid6
     Array Size : 1949216512 (1858.92 GiB 1996.00 GB)
  Used Dev Size : 487304128 (464.73 GiB 499.00 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Sun Sep 23 01:06:19 2012
          State : clean
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 64K

           UUID : 929bc0fb:73d94f86:fe36db6a:e236fe49
         Events : 0.389514

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1
       2       8       65        2      active sync   /dev/sde1
       3       8       81        3      active sync   /dev/sdf1
       4       8       97        4      active sync   /dev/sdg1
       5       8      113        5      active sync   /dev/sdh1
root@host:~#

The whole process was pretty painless, and certainly helps solidify my opinion that in today’s computing landscape proprietary hardware RAID controllers are no longer the desirable method of doing RAID.

This entry was posted in System Administration. Bookmark the permalink.

One Response to Repair MD (mdadm) RAID

  1. Cameron Nicholson says:

    Turns out you can sort of bypass the non-hotswap problem thusly:

    echo “- – -” > /sys/class/scsi_host/host[0-N]/scan

    This will force a re-scan of the scsi host in question (which is the interfaced sata and sas use). Presto! No reboot required.

Leave a Reply

Your email address will not be published. Required fields are marked *