Fixing software RAID

The primary drive failed in one of my servers – luckily since it was using mirrored drives the system carried on running until I could get PoundHost to replace it.
However, since it was the primary drive when it was replaced the system wouldn’t boot and they had to use a rescue cd. Here’s what was done to fix it…

This assumes that you’re booting from the Gentoo live cd, but it would probably work with Knoppix or some other cd. At this point, the failed drive (/dev/sda) has been replaced with a new one. You will also need to be logged in as root rather than a normal user.
First of all, get the partition table from the working drive (/dev/sdb)
$ fdisk -l /dev/sdb
Then create an identical partition table on /dev/sda. Make sure the partitions are type ‘fd’ (Linux raid autodetect) apart from the partition used for /boot (/dev/sda1) which should be ’83’ (Linux).
Also need to change the type of /dev/sdb1 to 83 – have to be careful since this is the drive with valid data on it! The reason for changing the partition type is because GRUB doesn’t seem to recognise type ‘fd’ as being a valid filesystem type so won’t let you do the grub section below.
Carefully format what will be /boot on the new drive.
$ mke2fs -j /dev/sda1
Next step is to copy the working /boot onto the new disk. This is how I did it, obviously your mileage may vary and there are several other ways to do it. I like cp -a as root because it preserves permissions and ownership.
$ mkdir /system
$ mount -t ext3 /dev/sdb3 /system
$ mount -t ext3 /dev/sdb1 /system/boot
$ mkdir /newboot
$ mount -t ext3 /dev/sda1 /newboot
$ cp -a /system/boot/* /newboot/
$ umount /newboot
$ rmdir /newboot
Now we should have two /boot partitions one on each drive, and are ready to install grub back onto the MBR of the first drive as well as onto the MBR of the second drive in case there are any failures in future! Obviously this depends on /boot/grub/grub.conf actually containing configs for booting both drives.
$ /system/sbin/grub
This will give you a grub prompt (grub>) – the next two lines may not be needed if grub has already detected the drives properly.
grub> device (hd0) /dev/sda
grub> device (hd1) /dev/sdb
Install it onto /dev/sda so that it looks at /dev/sda1 for /boot
grub> root (hd0,0)
grub> setup (hd0)
And the same for /dev/sdb looking at /dev/sdb1 for /boot
grub> root (hd1,0)
grub> setup (hd1)
Type ‘quit’ to exit grub.
Now you need to change the type of /dev/sda1 and /dev/sdb1 back to ‘fd’ using fdisk so that when the machine reboots it can actually detect the array properly.
The fdisk commands to do this are as follows, don’t forget /dev/sdb afterwards
$ fdisk /dev/sda
t, 1, 83, w
Next we need to add the new partitions back into the raid arrays having loaded the kernel modules required to be able to see the arrays.
$ modprobe md
$ modprobe raid1
-c partitions tells mdadm to ignore the config file /etc/mdadm.conf and just scan the disk partitions looking for the special signatures it uses. In my case I had 5 arrays to configure – make sure the number after -m is the same number as in the device name otherwise things will go a bit screwy. Trust me on this, you need to be careful here! I made /dev/md4 have the same devices as md0 accidentally and had to explicitly tell it to use the right partitions for md0 to get it back to normal. Luckily I’d got some notes of what partitions went to which array from before it had been shutdown.
$ mdadm -Ac partitions /dev/md0 -m 0
$ mdadm -Ac partitions /dev/md1 -m 1
$ mdadm -Ac partitions /dev/md2 -m 2
$ mdadm -Ac partitions /dev/md3 -m 3
$ mdadm -Ac partitions /dev/md4 -m 4
If any of these mdadm commands fail, stop and check the manual as something has gone wrong and you need to resolve it before going any further. They should respond with something along the lines of:
mdadm: /dev/md4 has been started with 1 drive (out of 2).
Next you add the new partitions from /dev/sda into the arrays and start the rebuilds. This next command will show you what partitions are currently assigned to which arrays. Handy if you don’t have it documented anywhere
$ cat /proc/mdstat
Next we need to add the relevant partitions into the arrays. Each command will queue up the sync of the drives. Only one sync will run at any time because they’re on the same devices.
$ mdadm –add /dev/md0 /dev/sda1
$ mdadm –add /dev/md1 /dev/sda2
$ mdadm –add /dev/md2 /dev/sda3
$ mdadm –add /dev/md3 /dev/sda5
$ mdadm –add /dev/md4 /dev/sda6
$ cat /proc/mdstat
This should now show you that it’s rebuilding one of the arrays and the others are “delayed” because the drives are already in use. As a rough idea, 160GB of raid array took about 40-45 minutes to fully resync.
In theory you can reboot whilst it’s doing the resync and it’ll just carry on afterwards, but it’s probably safer to let it resync first and then make sure everything is unmounted properly and reboot (without the live cd in the drive).
I hope this information comes in handy for someone. It took me a while with Google to find enough scraps of information to figure it out.