In light of the fact that this weekend is Halloween I figured this would be a great time to start this meme: Give us your best database/IT horror story to date.
I’ve been fortunate to date as far as the databases I’ve dealt with not having any crazy problems. For that I’m thankful. Given that fact my story is more of a general IT horror story. It was a dark and stormy night (actually it was clear, humid, hot day day but those don’t work as well for these). I woke up this fine morning to hear the call with the two words every IT pro dreads to hear: major outage. As I got into work, fueled up on coffee I got details of what happened that fateful morning.
Every month our operations staff does a generator load test wherein we switch from commercial power to generator power for testing. On this day, however, the generator felt saucy and fate gave us the finger. They threw the switch as they had done so many times before when “something happened” and a major failure happened in the generator. Normally this wouldn’t be too bad as you can switch right back to commercial power but, nay, not this day. For some reason the switch was unable to cut back so our whole data center went down faster than Balloon Boy’s family credibility. Like over-caffeinated monkeys on speed everyone leapt to action to find out the extent of the affected systems and implement the appropriate DR plans. After some scrambling the picture looked bleak. Despite having an alternate data center it turned out some of the systems on that side relied on the SAN…in the datacenter…that was now down and out. Awesome. Over the next few hours meetings were held to determine which systems needed to come back up, in what order (yes, I know, this should have already been established but as we soon discovered our DR plans were dated). Power was restored by noon and that’s when the real work began.
As we began bringing systems back online a flurry of disk checks and fixes began. Things slowly began shaping back to normal as everyone hunkered down and brought everything back up. But not all was well in Whoville. Ripping out a SAN from underneath servers is not the greatest thing to happen. To make matters really awesome we’re a heavy VMware shop and guess where our VMDK files are? Yeah…well in the midst of the madness we lost 2 LUNs due to corruption. Couple this with the fact that some of those servers turned out not to be backed up and needless to say you have a recipe for pure FUN! The good news is we have a good staff of dedicated folks who stayed as long as it took to get as many systems back online and working again. By 2:00 am (the failure occurred around 7:00 am) we were 95% back up and running with no major losses of data. Over the next few weeks I got the pleasure of working the every living hell out of the restore feature of Arcserve as well as check and double-check servers were being backed up.
Moral of the story is:
Have an up-t0-date DR plan, you never know when disaster is going to strike. Jonathan Kehayias wrote a great article recently about this.
Time to do some tagging:
Jonathan Kehayias (since I mentioned him already)
Jennifer McCown aka MidnightDBA (let’s put that new netbook to work ;-D )
Happy Halloween everyone!