Thursday, 9 July 2009

you never stop learning

I just had to write about todays antics - quite sad i know, but i had one of those days that just started off horrible, but then had a lovely breakthrough around midday which made sitting on my arse waiting for a non-existent engineer to turn up all that much more easier.

So, DL140's. Interesting piece of kit, which, by all accounts, likes to go down more times a week than Jordan. Luckily, i am blessed with only a few in our cage, and what we do have are inherited DL140 G3's. I must confess, i largely followed the fule of 'if it aint broke, dont fix it' and only gave the configuration a cursory glance before getting on with more Interesting Things.

Bad Me. Of course, server decides to fall over. Of course, this means i have to wait $millions of years for $supplier engineer to turn up and change the dodgy SATA controller.

But it still doesnt want to play with me. Its resoloutely *refusing* to do anything post-POST. Its almost like a sulking lover, not giving me an inch, and silently blinking its baleful cursor.

So in goes the rescue disk...interesting note for anyone whos wondering why they cant 'chroot /mnt/sysimage' and they get a '/bin/sh exec format error' - if you use a OS installation disc 1 as a rescue disk, it has to be the same architecture as the current operating system, ie: i386/x86_64. ( I am the queen of assumption when it comes to this kind of stuff and will throw any old thing is, assuming its bootable!)

So chrooted the root filesystem and everything looks gravy. Brilliant, so im starting to think perhapos i wont be reinstalling after all. Ah, but fdisk is reporting that there are no valid partitions on md0...

md0? have i gone mad? wtf? quick call to $bossman...

Me: why are we using Software raid on a production machine with hardware raid controller?
Him: Whut!?!wtf? etc etc
Me: I thought as much...

So looks like this install is older than i thought, tho still on a fairly recent OS/Kernel, it managed to predate both myself and $bossman. And neither of us thought to do more than scrape the surface in regards to processes relevant to our application and nothing more..

I havent used software raid in about 4 years, and even then it was on suitable shaky software at an old music/mobile place i was learning the ropes at.

Given that the raid specified was raid 1 and i could still access the filesystem, this meant that we had somehow failed onto the mirror. Lovely, so the problem was in the booting.

Linux software Raid1 doesnt seem to mirror the boot sector, so there was no grub to boot off.
Interestingly this kind of problem only happens if you lost the primary drive. If you lost the secondary drive it will still boot. Of course, whoever set this up should have copied it over but i guess they were busy or something ;)

A quick and dirty man (hurr hurr) showed me how to copy it over:

run 'grub' from the command line

device (hd0) /dev/sda
root (hd0,0)
setup (hd0)
device (hd1) /dev/sdb
root (hd1,0)
setup (hd1)


Pretty nifty and neither $bossman or i had ever heard of it before. But we dont use software raid normally. But im sure someone out there will find this useful.

Rebooted, and system comes up lovely, fixed all the partition tables, but mdstat was still showing a degraded array, so rebuilt that and machine has been happily whirring away ever since.

Of course i cant help myself checking the Crackberry every few hours for an alert in case it has died again... im going to change the disks just to be sure, but it really was an excellent learning experience.

2 comments:

  1. It's possible it was set up with software raid because whoever set it up didn't want to keep a spare hardware raid card lying around. If the raid card burns out, you're pretty much screwed unless there are replacement cards with the same raid implementation available (Which there often aren't).

    Mind you, since grub wasn't set up on each physical hard drive, maybe whoever set it up just wasn't thinking too hard...

    ReplyDelete
  2. on a supported server i shouldnt have to worry about that :)

    ReplyDelete