51 Hours in 3 Days

51 hours, that’s the amount to time I spent at work the first three days of this year. Tuesday morning started out normal enough; I got up and drove to work. The commute was a little more difficult than before the holiday but nothing too bad. I got the morning update from my staff, which was slightly diminished due to a couple of guys being out sick. We had a brief staff meeting at 10:00 AM, then at about 10:30 the proverbial *#@& hit the fan.

We had a major server crash. The crash was similar to one we experienced back in September. It involved our second Dell PowerVault 220. We are not sure why the array initially went offline (Dell tech support want to blame a cable, but that is really just a bunch of hogwash), but we subsequently find three drives with media errors so they ship us three new ones. (After the fact we determine that we are one revision behind in firmware on the array, guess what that update fixes. It corrects problems with the array timing out under heavy loads, just what we are experiencing.) It tried to fail over to the hot spare, but the server crashed completely. Oh great, we’ve been here before. We try to bring it up and do a check disk. That fails (just as we expected). So our only recourse it to rebuild the array and restore from backup.

Given our problems last September, we knew that this was not going to be easy. And it wasn’t. Once we got the array rebuilt (and 3 new disks Dell), we started the restore process. We use a wonderful backup product from Symantec (formally Veritas) BackupExec it works great. All they need to now is release a RestoreExec product to go along with it.

The restore process was horrible. Even though the backups were good, the BackupExec Remote Agent would crash periodically. At times it would go for 2 hours and at others it would go for 5 minutes. So even though we’ve never had good luck we tried to call tech support. After a six hour phone call to India where we talked to Dale or was it Devon (like that was really his name), we decided that their ideas were complete hogwash and they had no clue what the problem was. At this point we just plunged ahead with the restore. We turned on detailed logging and whenever it crashed, we would skip the offending file and start it up again.

The only good thing to come out of all of this was that the firm decided to spend the money to bring in a storage specialist to analyze our entire storage and backup systems as well as our data workflow. Needless to say it was not the kind of start I wanted for the New Year. I crashed really hard when I finally got home on Friday night. I still felt really lousy on Saturday, kind of like a bad hangover with the nausea. We’re still dealing with minor repercussions this week as the users encounter corrupted files.

Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: