IO You an Explanation

This blog post is going to be relatively short and sweet as my expertise in the storage realm is limited. Yesterday I had the pleasure of learning some new and interesting things about NetApp’s SAN technology that I thought was interesting and would share with you.

For  this month’s T-SQL Tuesday #004 meme is hosted by Mike Walsh of StraightPath Solutions (@Mike_walsh on Twitter). For my post I don’t have a solution so much as a nugget of information to pass along I found interesting. I was speaking to a consultant yesterday about a few things and the topic of his experiences with Oracle DBAs-vs-SQL Server DBAs in terms of his experience with them in regards to storage (SAN) consultations. Clearly this perked my ears up and I asked him to explain. He went on to tell me that in his experience he’s seen that the Oracle DBAs he’s come across come across as rather paranoid and never believe anything he tells them despite showing them whitepapers direct from storage vendor on the matter. On this particular matter we were talking about how NetApp has a best-practice recommendation that seems rather contradictory and (rightfully so) the DBAs were skeptical and would continue asking the same question over and over again despite having it already answered…over and over again. What’s that recommendation you ask? Well in NetApp world they have what are called Aggregates which are  nothing more multiple RAID groups. Here’s the excerpt from Wikipedia about it:

NetApp supports either SATA, Fibre Channel, or SAS disk drives, which it groups into RAID (Redundant Array of Inexpensive Disks or Redundant Array of Independent Disks) groups of up to 28 (26 data disks plus 2 parity disks). Multiple RAID groups form an “aggregate”; and within aggregates Data ONTAP operating system sets up “flexible volumes” to actually store data that users can access. An alternative is “Traditional volumes” where one or more RAID groups form a single static volume. Flexible volumes offer the advantage that many of them can be created on a single aggregate and resized at any time. Smaller volumes can then share all of the spindles available to the underlying aggregate. Traditional volumes and aggregates can only be expanded, never contracted. However, Traditional volumes can (theoretically) handle slightly higher I/O throughput than flexible volumes (with the same number of spindles), as they do not have to go through an additional viritualisation layer to talk to the underlying disk.

Ok, so what’s so different about that? Well that’s not the part that’s interesting. What’s interesting is NetApp’s Best Practices for Oracle 11g explicitly states:

For Oracle databases it is recommended that you pool all your disks into a single large aggregate and use FlexVol volumes for your database datafiles and logfiles as described below. This provides the benefit of much simpler administration, particularly for growing and reducing volume sizes without affecting performance. For more details on exact layout recommendations, refer to [2].

Now think about that for a minute. As a SQL Server DBA you’re probably having a mental breakdown as I did when first slapped with this one as they’re essentially telling you throw all your eggs in the same basket, its better for you. Well this is where our conversation got interesting as he started breaking down for me exactly how Aggregates worked, how NetApp’s algorithms function, and WHY this best practice exists and isn’t as bad as it appears at first glance. Apparently because of the way NetApp’s Aggregates work the more you expand your Aggregate (read also: add more disks) you’re actually helping improve performance as you’re adding more spindles to it and will help performance along. At this point of this post you storage guys are probably ready to tear me a new one as I may or may not be explaining this correctly/accurate to which I re-state, “I’m not a storage guy, I’m a DBA learning something new and attempting to relay this information as best as I understood it.”

Which brings me to the point of my post. As a DBA crazy things like a best practice recommendation that doesn’t make sense can and will come up in your career. Should you question them? Without a doubt! After all, it’s your bacon on the line after these guys are gone. The important part however is the learning. Ask questions, realize the differences between technologies and understand the how’s and whys. In this post I talked about NetApp’s solution but EMC works differently as well as has different terminology. It may not be your job to be a SAN admin but as a DBA I think its essential to understand all the technologies involved in your configuration and work with those responsible to come up with the best solution that works for you. There are plenty of resources out there to garner knowledge from, they’re just a quick Boogle search away

Follow the hashtag #TSQL2sDay on Twitter to check out everyone’s posts.

DBA Horror Stories

Psycho-shower-scream In light of the fact that this weekend is Halloween I figured this would be a great time to start this meme: Give us your best database/IT horror story to date.

I’ve been fortunate to date as far as the databases I’ve dealt with not having any crazy problems. For that I’m thankful. Given that fact my story is more of a general IT horror story. It was a dark and stormy night (actually it was clear, humid, hot day day but those don’t work as well for these). I woke up this fine morning to hear the call with the two words every IT pro dreads to hear: major outage. As I got into work, fueled up on coffee I got details of what happened that fateful morning.

Every month our operations staff does a generator load test wherein we switch from commercial power to generator power for testing. On this day, however, the generator felt saucy and fate gave us the finger. They threw the switch as they had done so many times before when “something happened” and a major failure happened in the generator. Normally this wouldn’t be too bad as you can switch right back to commercial power but, nay, not this day. For some reason the switch was unable to cut back so our whole data center went down faster than Balloon Boy’s family credibility. Like over-caffeinated monkeys on speed everyone leapt to action to find out the extent of the affected systems and implement the appropriate DR plans. After some scrambling the picture looked bleak. Despite having an alternate data center it turned out some of the systems on that side relied on the SAN…in the datacenter…that was now down and out. Awesome. Over the next few hours meetings were held to determine which systems needed to come back up, in what order (yes, I know, this should have already been established but as we soon discovered our DR plans were dated). Power was restored by noon and that’s when the real work began.

As we began bringing systems back online a flurry of disk checks and fixes began. Things slowly began shaping back to normal as everyone hunkered down and brought everything back up. But not all was well in Whoville. Ripping out a SAN from underneath servers is not the greatest thing to happen. To make matters really awesome we’re a heavy VMware shop and guess where our VMDK files are? Yeah…well in the midst of the madness we lost 2 LUNs due to corruption. Couple this with the fact that some of those servers turned out not to be backed up and needless to say you have a recipe for pure FUN! The good news is we have a good staff of dedicated folks who stayed as long as it took to get as many systems back online and working again. By 2:00 am (the failure occurred around 7:00 am) we were 95% back up and running with no major losses of data. Over the next few weeks I got the pleasure of working the every living hell out of the restore feature of Arcserve as well as check and double-check servers were being backed up.

Moral of the story is:

Have an up-t0-date DR plan, you never know when disaster is going to strike. Jonathan Kehayias wrote a great article recently about this.

Time to do some tagging:

Jonathan Kehayias (since I mentioned him already)

Kendal Van Dyke

Jennifer McCown aka MidnightDBA (let’s put that new netbook to work ;-D )

Happy Halloween everyone!