I abuse RAID arrays, and I'm not ashamed to admit it, Episode 1, LSI 9265-8i

Posted by Paul Braren on Feb 3 2012 (updated on Jun 20 2014) in
  • Storage
  • raidabuse

    Before I trust my real data to a RAID array, I do what many otherwise sane "IT Professionals" do. I actually look for trouble, by messing around, inducing failures intentionally.

    In other words, I break things. For fun. There, I said it.

    Well, it's also for for learning. I find that RAID abuse tends to reveal a lot about a RAID controller. If the controller, and my data, survives my abuse, it has earned my trust.

    So, if you'd like to witness my irresponsible behavior, watch the video below, which includes a look at the effect of these bloopers:

    • pulling SATA cable out of a live spinning drive, divorcing it from its RAID5 array (reconciliation happens later)
    • watch me struggle to remember how to get the drive back online, without reading the LSI manual
    • struggling to keyboard-shortcut my way around the WebBIOS menus (mouse jumpy due to recording device I was using)
    • pulling out a BBU (Battery Backup Unit) while it's running
      (not a great idea, it turns out, go figure, ruins one VM's filesystem, spares another, and leaving the rest of VMFS intact)
    • doing disk benchmark testing while the array is rebuilding
      cranking the background rebuild up from 30% to 90%, what kind of fool does that?
    • after the array's health was restored, I working for hours to get the loud alarm to stop beeping for more than just a few minutes of peace
      (most of that useless video on the "cutting room floor", evidence of destroyed)

    Lessons learned?

    Adding the driver to get ESXi 5.0 rollup1 to "see" array health wasn't very difficult, using the same procedure I already documented in Step 2 here: TinkerTry.com/lsi92658iesxi5.

    Yeah, the vSphere Client's ESXi Health pane isn't as fancy a readout (fans, temperatures, etc.) as you'd get with a server class motherboard, such as Supermicro, Tyan, and others, that servethehome.com does such an excellent job covering. But it's also a generally considerably cheaper to buy a Z68/Core i7/memory combination.

    After popping the drive's SATA connector, then reattaching it, I did get the array's health restored, once I found the menu (see also photo gallery and video):
    MegaRAID BIOS/Config Utility/Drive Groups/Make Unconfigured Good

    Having done some rebuilds in the past (RAID abuse without witnesses), even the powerful dual-core LSI 9265-8i could take a day or so to finish a rebuild, at its default of 30% rebuild rate. So cranking it up to 90%, as I did during the video, wasn't such a bad idea after all, since it then took only about 5 hours to bring drive 1 (of 5) back online, rather than a day or so, and the array was usable during those 5 hours.

    You'll see I confirmed RAID health at boot time when the LSI 9260-8i BIOS starts, and inside WebBIOS, and inside ESXi Health view. But a few minutes after silencing the arm, it'd start a slow beep, annoying anyone else in the home, a little too datacenter-like. Turning off the alarm for good, well, for this event anyway, was harder than it should be, given WebBIOS apparently doesn't seem to do it, and I don't have MegaCLI or MegaRAID UI in ESXi 5.0, yet. Not a big deal, I can always shutdown and dual boot to native Windows 7 for now, and/or revisit getting MSM (MegaRAID Storage Management) working in a VM again, at some later date, it's just not a priority for me.

    Next stops, abusing those external MediaSonic RAID arrays, and testing site-to-site replication over VPN. Two personal clouds, doing differential daily syncs, for far more disaster resilience...isn't that the whole idea, after all?

    Please don't report me, honestly, I'm done with the RAID abuse. For tonight, anyway. And now I rest easier, far more confident that ESXi can alert me to array issues, when they inevitably do arise  in the future.