Virtualization Begins

For about the past nine months I’ve been working on growing the use of virtualiztion within my firm. We had dabbled with virtualizing a couple of misc. application/development servers with the first release of VMWare Server, but I knew that in order to consolidate the rest of environment, as well as better prepare us for disaster recovery that we needed to expand our server virtualization strategy. The solution that made the most sense at the time was a VMWare Infrastructure solution utilizing their ESX platform.

The question was how to move forward without spending tons of money (and avoiding all of the politics involved in deploying a solution). The solution arose through the timing on our computer leases. We lease the majority of our computer equipment for three years. It just so happened that as I was looking to move forward, one of our major leases came up for replacement. To make it even better, not only did I have three servers that were due for swapping (and great candidates for virtualizing), but three years earlier we were forced in to ordering a large batch of workstation class computers. This time, with the advancement of technology, we no longer needed that class of machine.

With my good fortune, I was able to replace the three servers with two new servers (albeit much beefier units) along with the full ESX suite for both, and workstations with much better (thanks to Intel’s Core2Duo chips) desktop class units for basically the same monthly payment we already spending. It was a really win-win for company.

In some upcoming posts I plan to highlight our journey, cover some of the sites that helped us to get where we are, discuss where we are going, and finally highlight some of the difficulties and frustrations that we still face.

Advertisements

The Meaning of Mean Time Between Failures (MTBF)

One of my guys emailed me this, I wish I knew where it came for so I could give credit. This goes a long way towards explaining why we’ve seen so many problems with supposedly good hardware.

You’ve probably seen hardware manufacturers talk about the Mean Time Between Failures (MTBF) for their equipment. For example, disk drive manufacturers often claim MTBFs of several hundred thousand hours. This statistic sounds great (and it is, compared with hardware MTBFs from 10 or 15 years ago), but if you have more than a handful of servers or disks, you’ll quickly find that the MTBFs have to be carefully considered.

In fact, it helps to know how disk manufacturers calculate the MTBF in the first place—they take a batch of drives (several hundred to a few thousand) and test them under fixed environmental conditions. When the first drive in the batch fails, they use the test run time to calculate the MTBF. Let’s say that there are 1250 disks in the batch, and they run for 28 days before the first one fails. The MTBF can thus be cited as 28 * 24 * 1250 = 840,000 hours. Remember, this is the average time between failures; it’s not a guarantee.

What does this mean for your availability? Say that you have ten servers. Each server has a dozen disks, the MTBF of which is 100,000 hours each. Simply put, that means that you can expect any individual disk to last for 100,000 hours, or about 11 years. However, the MTBF of any one of your servers is actually the sum of the MTBFs of its components: twelve disks, each with a 100,000-hour MTBF, means that on average you can expect one drive to fail every 100,000/12 = 8333 hours, or slightly less than once per year. As you increase the number of servers and disks (and as you factor in other electromechanical components, such as fans and power supplies), you can see that adding servers and disks actually increases the odds of a failure.

Monitoring Dell Hardware with Nagios

We use the excellent Nagios network, host and service monitoring software at the office to track the status of our servers, routers, and network devices and connections. The program works great and we love it. However, the one area that we have wanted to track was the status of Dell PowerEdge servers, particularly those running Windows Server 2003. We’ve installed Dell’s OpenManage software on all the boxes and that works great, but we were not getting notified when something on the server failed (power supply, fan, or a disk in an array).

The status of server can be gotten through SNMP to the OpenManage so I knew that it could be done, I just didn’t want to have to reinvent the wheel. I did some searching, and I came across three plugins. The first is simply called check_dell.pl. It is checks the overall health of both the system and the array. If either is non-OK then it gives a warning. It is simple, quick, and effective, but I wanted additional reporting so that I know what was component was actually faulty.

The second plugin is called check_om.py and it checks the overall chassis status. If it is non-OK, it will then check other status indicators in order to create an error message that indicates where the problem lies. It has the ability to check for power supply, voltage, cooling device, temperature, memory, and intrusion issues. It works great, and we now us it!

Now I needed to find a way to report on the status of the drive arrays because the check_om.py doesn’t do that. I found a couple of plugins that would check the RAID controller locally or would do it for Linux servers. Then I finally found this check_win_perc plugin posted on a Dell mailing list site. It has a number of really good features, like telling which drive in the RAID array was having problems, but it also has some quirks. For one thing it stores baseline information in a temp that must be manually deleted. In order to work in our environment it needed some clean and modification.

I modified the plugin to better handle passing of SNMP community strings. As it was originally written it reported all the disks and their status, no matter to which array controller it might be attached. I modified the code so that you can select which of two controllers you want to monitor and report on only those disks. Because my coding skills are non-existent, it still has some unresolved quirks, like when it reports the number of Global Hot Spares it is still doing it across all controllers which is wrong.

My modified code is listed below. Please use at your own risk! If you make any modifications or enhancements please let me know.

(more…)