The 4th Time (Why I Hate Dell)

Well, we’re at it again. For the 4th time in six months our lovely Dell PowerVault 220s drive array has died on us. What a piece of junk these things are. The pattern is the same as before (even with the patch that supposedly would fix our problems). A drive goes bad in the array. Now, I thought these drives were supposed to have 100,000+ hours MTBF, so I don’t know why we’ve had some many fail. Instead of seamlessly going into a degraded RAID5 state and starting to rebuild on the Hot Spare, the entire array crashes and the server blue screens.

We place another call to Dell, this time we’re getting a new drive and new backplane. That will be followed up with an entirely new unit in a couple of days. They have no idea why these things keep failing, and have no guarantee that it won’t happen again. (Note they are only replacing 1 of my 2 PowerVaults).

We have a storage consultant coming in next week to analyze our entire network and storage systems. From that they will be preparing a strategic 3-5 year plan for our storage and disaster recovery needs. This will be one more thing to add to there list of items to review. I think we will need to implement some short term, stop-gap measure to help reduce the outages until we can get a new system on-line

Advertisements

51 Hours in 3 Days

51 hours, that’s the amount to time I spent at work the first three days of this year. Tuesday morning started out normal enough; I got up and drove to work. The commute was a little more difficult than before the holiday but nothing too bad. I got the morning update from my staff, which was slightly diminished due to a couple of guys being out sick. We had a brief staff meeting at 10:00 AM, then at about 10:30 the proverbial *#@& hit the fan.

We had a major server crash. The crash was similar to one we experienced back in September. It involved our second Dell PowerVault 220. We are not sure why the array initially went offline (Dell tech support want to blame a cable, but that is really just a bunch of hogwash), but we subsequently find three drives with media errors so they ship us three new ones. (After the fact we determine that we are one revision behind in firmware on the array, guess what that update fixes. It corrects problems with the array timing out under heavy loads, just what we are experiencing.) It tried to fail over to the hot spare, but the server crashed completely. Oh great, we’ve been here before. We try to bring it up and do a check disk. That fails (just as we expected). So our only recourse it to rebuild the array and restore from backup.

Given our problems last September, we knew that this was not going to be easy. And it wasn’t. Once we got the array rebuilt (and 3 new disks Dell), we started the restore process. We use a wonderful backup product from Symantec (formally Veritas) BackupExec it works great. All they need to now is release a RestoreExec product to go along with it.

The restore process was horrible. Even though the backups were good, the BackupExec Remote Agent would crash periodically. At times it would go for 2 hours and at others it would go for 5 minutes. So even though we’ve never had good luck we tried to call tech support. After a six hour phone call to India where we talked to Dale or was it Devon (like that was really his name), we decided that their ideas were complete hogwash and they had no clue what the problem was. At this point we just plunged ahead with the restore. We turned on detailed logging and whenever it crashed, we would skip the offending file and start it up again.

The only good thing to come out of all of this was that the firm decided to spend the money to bring in a storage specialist to analyze our entire storage and backup systems as well as our data workflow. Needless to say it was not the kind of start I wanted for the New Year. I crashed really hard when I finally got home on Friday night. I still felt really lousy on Saturday, kind of like a bad hangover with the nausea. We’re still dealing with minor repercussions this week as the users encounter corrupted files.