We broke the clustered servers

Man what a rough couple of weeks. After fighting with two Dell PowerEdge 2650s and a PowerVault 220S configured as a clustered file server, we finally had to give in a break the cluster today. Almost a year ago, we installed the hardware and configured the two servers as a clustered file and print server with Windows Server 2003. Everything has worked reasonable well (except for one lingering and annoying problem) up until last week. We had a couple of drive failures and replaced a handful of the disks.

We had been getting an increasing number of drive problems. In particular whenever we would fail the cluster over, we get an error stating hard drive problem on drive 0:1 (the drive in slot 1 in the PowerVault). After replacing the drive once about 5 months ago, Dell said that we were just getting a false positive message. During the past two week that same drive has failed and been replaced 3 times. A week ago today we were down for the entire day, as we replaced the drive and struggled to get the cluster back online. The drive failed again, and was replaced.

Given that this is not normal behavior, we scheduled downtime last night and were going to replace both the backplane and the drive at 5:30 PM. At 5:00 the drive failed again, and things went downhill after that. Long story short, after replacing everything, the drive array would not come up. So we gave up on the clustered file server concept, and spilt it into two different file servers. As I write this I’m this, we’re restoring over 700GB of data from tape. After two complete days of lost productivity we could not take it any more.

I still think there are a number of advantages to clustered servers, but in order to do it right you really need a Storage Area Network (SAN). Also I think that apps like database and email servers that are truly cluster aware are much better choices. By 8:00 AM tomorrow everything should be back online, just in time for our bi-monthly Computer Committee meeting where I will get the opportunity to explain to everyone what when wrong, and what we’re going to do to prevent (or at least minimize our exposure) these types of problems in the future.

Previous Post
Next Post
Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: