Cacti’s Painless Network Monitoring


For the past week I’ve submersed myself in the world of Cacti, and have been have a lot of fun making cool graphs. As my staff will attest, I’m really big into monitoring anything and everything on our network. I find it’s very helpful to be able to track usage, capacity, growth, and a bunch of other things. Without some kind of baseline how do you know if things are operating as they should?

Oh, so you’re wondering what Cacti is, well here is the developer’s description:

Cacti is a complete network graphing solution designed to harness the power of RRDTool‘s data storage and graphing functionality. Cacti provides a fast poller, advanced graph templating, multiple data acquisition methods, and user management features out of the box. All of this is wrapped in an intuitive, easy to use interface that makes sense for LAN-sized installations up to complex networks with hundreds of devices.

Anyway, I’ve been using MRTG for last 8+ years to graph utilization, etc. It was a great product, and I’ve built up a number of useful scripts and hacks to monitor all kinds of things from Windows boxes to printers to email queues. I’ve even built a neat menu system, but it was a real hack. It was hard to manage, add devices, or even make changes. I’ve followed the RRDTool world for a while (and even moved my MRTG configs over to using RRD), but never found a solution that was easy to use and had the flexibility I wanted/needed. That was until I stumbled across Cacti.

Cacti has a templating system that makes adding new devices easy, it as an active user community that is sharing their templates for graphs, and device monitoring. It is really powerful and actually quite easy to use. It even integrates with Nagios, although I have yet to accomplish that integration. In the coming weeks I’ll be sharing my adventures with the installation and configuration as well as some of the templates that I have used and created/modified. So stay tuned for further post about Cacti.

Monitoring Dell Hardware with Nagios

We use the excellent Nagios network, host and service monitoring software at the office to track the status of our servers, routers, and network devices and connections. The program works great and we love it. However, the one area that we have wanted to track was the status of Dell PowerEdge servers, particularly those running Windows Server 2003. We’ve installed Dell’s OpenManage software on all the boxes and that works great, but we were not getting notified when something on the server failed (power supply, fan, or a disk in an array).

The status of server can be gotten through SNMP to the OpenManage so I knew that it could be done, I just didn’t want to have to reinvent the wheel. I did some searching, and I came across three plugins. The first is simply called check_dell.pl. It is checks the overall health of both the system and the array. If either is non-OK then it gives a warning. It is simple, quick, and effective, but I wanted additional reporting so that I know what was component was actually faulty.

The second plugin is called check_om.py and it checks the overall chassis status. If it is non-OK, it will then check other status indicators in order to create an error message that indicates where the problem lies. It has the ability to check for power supply, voltage, cooling device, temperature, memory, and intrusion issues. It works great, and we now us it!

Now I needed to find a way to report on the status of the drive arrays because the check_om.py doesn’t do that. I found a couple of plugins that would check the RAID controller locally or would do it for Linux servers. Then I finally found this check_win_perc plugin posted on a Dell mailing list site. It has a number of really good features, like telling which drive in the RAID array was having problems, but it also has some quirks. For one thing it stores baseline information in a temp that must be manually deleted. In order to work in our environment it needed some clean and modification.

I modified the plugin to better handle passing of SNMP community strings. As it was originally written it reported all the disks and their status, no matter to which array controller it might be attached. I modified the code so that you can select which of two controllers you want to monitor and report on only those disks. Because my coding skills are non-existent, it still has some unresolved quirks, like when it reports the number of Global Hot Spares it is still doing it across all controllers which is wrong.

My modified code is listed below. Please use at your own risk! If you make any modifications or enhancements please let me know.

(more…)

Nagiosgraph with Windows support

After reviewing the four main tools for graphing performance with Nagios (APAN, Nagiosgraph, Nagiostat, and PerfParse), I decided that Nagiosgraph was the easiest for me to get up and running. Out of the box it worked great for my Linux systems and my network tests, but I needed to add support for monitoring my Windows servers.

I have used APAN in the past, but it was really tough to configure. I also tried PerfParse and liked it. However, it required a lot more resources for the database than I was prepared to handle, and I could probably only keep 30 days of data. But it worked great.

To make things easier I installed the latest CVS nightly of the 1.4.0alpha Nagios Plugins. As of 20040817 these plugins supported performance data output for the check_nt plugin (the one that works with the NSClient service). Once these plugins were complied and installed, I updated the nagiosgraph map file. This file is what is used to parse the output for generating the stats.

(more…)