Goodbye Dell, Hello NetApp (almost)

I’ve been writing about how Dell’s PowerVault 220s are junk for quite awhile now. We experienced our 3rd major crash for the year this week. We had everything restored and back in operation 23 hours later.

In the meantime we brought in a data storage consultant to analyze our storage infrastructure. They presented a detailed report about a month ago, and this past Monday the partners decided that it was time change the way we store data. The decision was made to no longer use Dell for our storage system (we will still keep using them for our application servers), and instead move to a Network Attached Storage appliance. In our case we’ll start by moving our headquarters office to NetApp FAS3020c enterprise storage application from Network Appliance (NetApp).

We are still working out the final details and pricing (that’s another story for another time), and hopefully will be placing an order within the next couple of weeks. Then the real fun begins as we get to install and configure the new system.

Advertisements

The 4th Time (Why I Hate Dell)

Well, we’re at it again. For the 4th time in six months our lovely Dell PowerVault 220s drive array has died on us. What a piece of junk these things are. The pattern is the same as before (even with the patch that supposedly would fix our problems). A drive goes bad in the array. Now, I thought these drives were supposed to have 100,000+ hours MTBF, so I don’t know why we’ve had some many fail. Instead of seamlessly going into a degraded RAID5 state and starting to rebuild on the Hot Spare, the entire array crashes and the server blue screens.

We place another call to Dell, this time we’re getting a new drive and new backplane. That will be followed up with an entirely new unit in a couple of days. They have no idea why these things keep failing, and have no guarantee that it won’t happen again. (Note they are only replacing 1 of my 2 PowerVaults).

We have a storage consultant coming in next week to analyze our entire network and storage systems. From that they will be preparing a strategic 3-5 year plan for our storage and disaster recovery needs. This will be one more thing to add to there list of items to review. I think we will need to implement some short term, stop-gap measure to help reduce the outages until we can get a new system on-line

51 Hours in 3 Days

51 hours, that’s the amount to time I spent at work the first three days of this year. Tuesday morning started out normal enough; I got up and drove to work. The commute was a little more difficult than before the holiday but nothing too bad. I got the morning update from my staff, which was slightly diminished due to a couple of guys being out sick. We had a brief staff meeting at 10:00 AM, then at about 10:30 the proverbial *#@& hit the fan.

We had a major server crash. The crash was similar to one we experienced back in September. It involved our second Dell PowerVault 220. We are not sure why the array initially went offline (Dell tech support want to blame a cable, but that is really just a bunch of hogwash), but we subsequently find three drives with media errors so they ship us three new ones. (After the fact we determine that we are one revision behind in firmware on the array, guess what that update fixes. It corrects problems with the array timing out under heavy loads, just what we are experiencing.) It tried to fail over to the hot spare, but the server crashed completely. Oh great, we’ve been here before. We try to bring it up and do a check disk. That fails (just as we expected). So our only recourse it to rebuild the array and restore from backup.

Given our problems last September, we knew that this was not going to be easy. And it wasn’t. Once we got the array rebuilt (and 3 new disks Dell), we started the restore process. We use a wonderful backup product from Symantec (formally Veritas) BackupExec it works great. All they need to now is release a RestoreExec product to go along with it.

The restore process was horrible. Even though the backups were good, the BackupExec Remote Agent would crash periodically. At times it would go for 2 hours and at others it would go for 5 minutes. So even though we’ve never had good luck we tried to call tech support. After a six hour phone call to India where we talked to Dale or was it Devon (like that was really his name), we decided that their ideas were complete hogwash and they had no clue what the problem was. At this point we just plunged ahead with the restore. We turned on detailed logging and whenever it crashed, we would skip the offending file and start it up again.

The only good thing to come out of all of this was that the firm decided to spend the money to bring in a storage specialist to analyze our entire storage and backup systems as well as our data workflow. Needless to say it was not the kind of start I wanted for the New Year. I crashed really hard when I finally got home on Friday night. I still felt really lousy on Saturday, kind of like a bad hangover with the nausea. We’re still dealing with minor repercussions this week as the users encounter corrupted files.

Monitoring Dell Hardware with Nagios

We use the excellent Nagios network, host and service monitoring software at the office to track the status of our servers, routers, and network devices and connections. The program works great and we love it. However, the one area that we have wanted to track was the status of Dell PowerEdge servers, particularly those running Windows Server 2003. We’ve installed Dell’s OpenManage software on all the boxes and that works great, but we were not getting notified when something on the server failed (power supply, fan, or a disk in an array).

The status of server can be gotten through SNMP to the OpenManage so I knew that it could be done, I just didn’t want to have to reinvent the wheel. I did some searching, and I came across three plugins. The first is simply called check_dell.pl. It is checks the overall health of both the system and the array. If either is non-OK then it gives a warning. It is simple, quick, and effective, but I wanted additional reporting so that I know what was component was actually faulty.

The second plugin is called check_om.py and it checks the overall chassis status. If it is non-OK, it will then check other status indicators in order to create an error message that indicates where the problem lies. It has the ability to check for power supply, voltage, cooling device, temperature, memory, and intrusion issues. It works great, and we now us it!

Now I needed to find a way to report on the status of the drive arrays because the check_om.py doesn’t do that. I found a couple of plugins that would check the RAID controller locally or would do it for Linux servers. Then I finally found this check_win_perc plugin posted on a Dell mailing list site. It has a number of really good features, like telling which drive in the RAID array was having problems, but it also has some quirks. For one thing it stores baseline information in a temp that must be manually deleted. In order to work in our environment it needed some clean and modification.

I modified the plugin to better handle passing of SNMP community strings. As it was originally written it reported all the disks and their status, no matter to which array controller it might be attached. I modified the code so that you can select which of two controllers you want to monitor and report on only those disks. Because my coding skills are non-existent, it still has some unresolved quirks, like when it reports the number of Global Hot Spares it is still doing it across all controllers which is wrong.

My modified code is listed below. Please use at your own risk! If you make any modifications or enhancements please let me know.

(more…)