Monitoring Dell Hardware with Nagios

We use the excellent Nagios network, host and service monitoring software at the office to track the status of our servers, routers, and network devices and connections. The program works great and we love it. However, the one area that we have wanted to track was the status of Dell PowerEdge servers, particularly those running Windows Server 2003. We’ve installed Dell’s OpenManage software on all the boxes and that works great, but we were not getting notified when something on the server failed (power supply, fan, or a disk in an array).

The status of server can be gotten through SNMP to the OpenManage so I knew that it could be done, I just didn’t want to have to reinvent the wheel. I did some searching, and I came across three plugins. The first is simply called check_dell.pl. It is checks the overall health of both the system and the array. If either is non-OK then it gives a warning. It is simple, quick, and effective, but I wanted additional reporting so that I know what was component was actually faulty.

The second plugin is called check_om.py and it checks the overall chassis status. If it is non-OK, it will then check other status indicators in order to create an error message that indicates where the problem lies. It has the ability to check for power supply, voltage, cooling device, temperature, memory, and intrusion issues. It works great, and we now us it!

Now I needed to find a way to report on the status of the drive arrays because the check_om.py doesn’t do that. I found a couple of plugins that would check the RAID controller locally or would do it for Linux servers. Then I finally found this check_win_perc plugin posted on a Dell mailing list site. It has a number of really good features, like telling which drive in the RAID array was having problems, but it also has some quirks. For one thing it stores baseline information in a temp that must be manually deleted. In order to work in our environment it needed some clean and modification.

I modified the plugin to better handle passing of SNMP community strings. As it was originally written it reported all the disks and their status, no matter to which array controller it might be attached. I modified the code so that you can select which of two controllers you want to monitor and report on only those disks. Because my coding skills are non-existent, it still has some unresolved quirks, like when it reports the number of Global Hot Spares it is still doing it across all controllers which is wrong.

My modified code is listed below. Please use at your own risk! If you make any modifications or enhancements please let me know.

#!/bin/bash
#
# Script to check the Windows Dell-PERC for current status
#
# Original by:  Lewis Getschel
# Modified by:  Ken Nerhood
# Date:         05/11/2005
# Parameters:   1 - the IP address of the system to check
#               2 - snmp community string
#               3 - controller num (from .1.3.6.1.4.1.674.10893.1.1.130.1.1.1)
#
# Version History:
# 12/29/2004    Keeping a temp file seemed the best way to go on this. This
# LG            allows seeing changes. I initially didn't show the number of
#               Global/Dedicated HotSpares, but I realized that since each
#               "at-that-time-purchased" group had different standards for how
#               they were configured I needed to see the actual numbers of spares
#
# Notes:        The "baseline" (the temp file) is never actually replaced
#               anywhere in this code. If a new baseline is desired, then
#               simply delete the appropriate temp file. This routine will
#               create a NEW baseline (/tmp) file, and use that onward.
#
# Additional note:
#               Whenever something changes on the array (ready to offline, etc)
#               2 things happen:
#               1) Nagios goes to critical state
#               2) Nagios will STAY that way until you delete (or rename) the
#                  'baseline' file in /tmp
#                  I just leave it that way until the new drive arrives, then
#                  I delete the file. I let the "new config" be the Warning
#                  state for the 1st check, that way it shows up better in the
#                  event log.
#
#
# 05/11/2005    Added additional parameters to allow for easier configuration.
# KBN           You need to specify which array controller you want to monitor,
#               currently the script will only handle 2 controllers.
#               The script will now return a warning state if the controller
#               reports a severity level differnt than OK. This is to handle
#               the case where the baseline matches, but controller is not yet
#               OK (i.e. when rebuilding)
#
#
# =================================== Script starts below ================================
#
systemdifferences=0
hostnam=$1
communitystring=$2
arraynum=$3
# echo $1 >> /tmp/nagios_event_debug.txt
# echo --- `date` --- >> /tmp/nagios_event_debug.txt

if [ "$#" -lt "3" ]; then
   echo "Useage: check_win_perc host community arraynumber"
   exit 3
fi

# these system status's don't hold after a reboot!
currentsystemstatus=`/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.1.1.5.$arraynum | awk '{print $NF}'`
previoussystemstatus=`/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.1.1.5.$arraynum | awk '{print $NF}'`
system_serial_number=`snmpwalk -v 1 -c $communitystring $hostnam .1.3.6.1.4.1.674.10892.1.300.10.1.11 | awk '{print $NF}' | sed 's/\"//g'`

if [ $arraynum -eq "1" ]; then
   contl1severity=`/usr/bin/snmpwalk -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.1.1.6.$arraynum | awk '{print $NF}'`
   contl1drives=`/usr/bin/snmpwalk -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.5.1.7 | awk '{print $NF}' | awk 'BEGIN {x=0} /'$arraynum'/ {++x} END {print x}'`
   contl1name=`/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.1.1.2.$arraynum | awk -F\" '{print $2}'`
   for ((a=1; a < = $contl1drives ; a++))  # Double parentheses, and "total_drives" with no "$".
   do
      current_disks_state&#91;${a}&#93;=`/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.4.1.4.${a} | awk '{print $NF}'`
   done                           # A construct borrowed from 'ksh93'.

   # === if there is a previousdata file for previous run, read it in.
   if &#91; -e /tmp/${hostnam}_${arraynum}_$system_serial_number.txt &#93;; then
     for ((a=1; a <= contl1drives ; a++))  # Double parentheses, and "total_drives" with no "$".
      do
         previous_disks_state&#91;${a}&#93;=`/bin/sed -ne ${a}p /tmp/${hostnam}_${arraynum}_$system_serial_number.txt`
      done
      previousdata=1
   else # no previous file data, make it now from current (or should I make it manually as 4 3 3 3 1 ..??)
      currentdrive=1
      previousdata=0
      /bin/touch /tmp/${hostnam}_${arraynum}_$system_serial_number.txt
      while &#91; $currentdrive -le $contl1drives &#93;
      do
         echo ${current_disks_state&#91;$currentdrive&#93;} >> /tmp/${hostnam}_${arraynum}_$system_serial_number.txt
         currentdrive=`expr $currentdrive + 1`
      done
      echo "WARNING - PERC array wrote first status file for /tmp/${hostnam}_${arraynum}_$system_serial_number"
      exit 1
   fi

   totalhotspares=`/usr/bin/snmpwalk -c $communitystring -v 1 $hostnam 1.3.6.1.4.1.674.10893.1.1.130.4.1.22 | awk '{print $NF}'| awk 'BEGIN {x=0} /3/ {++x} END {print x}'`
   #totaldedicatedspares=`/usr/bin/snmpwalk -c $communitystring -v 1 $hostnam 1.3.6.1.4.1.674.10893.1.1.130.4.1.22 | awk '{print $NF}'| awk 'BEGIN {x=0} /4/ {++x} END {print x}'`

   # ========= If current status != previous status then it's Broken, figure out where =============
   # except for the FIRST time this script runs, this code only runs because of a mismatch in states
   # it seems safe to assume that I should check each array position for where the problem is.
   currentdrive=1
   while [ $currentdrive -le $contl1drives ]
   do
      if [ ${current_disks_state[$currentdrive]} -ne ${previous_disks_state[$currentdrive]} ]; then
         systemdifferences=1
         echo -n `/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.4.1.2.$currentdrive | awk -F\" '{print $2}'`" "
         case "${current_disks_state[$currentdrive]}" in
            "0" )
               echo -n "Unknown";;
            "1" )
               echo -n "Ready"
               case "`/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.4.1.22.$currentdrive | awk '{print $NF}'`" in
                  "1" )
                     echo -n "-member of virtual disk.";;
                  "2" )
                     echo -n "-member of disk group.";;
                  "3" )
                     echo -n "-global hot spare.";;
                  "4" )
                     echo -n "-dedicated hot spare.";;
                   * )
                     echo -n "Bad_ERROR_Code.";;
               esac;;
            "2" )
               echo -n "Failed";;
            "3" )
               echo -n "Online";;
            "4" )
               echo -n "Offline";;
            "6" )
               echo -n "Degraded";;
            "7" )
               echo -n "Recovering";;
            "11" )
               echo -n "Removed";;
            "15" )
               echo -n "Resyncing";;
            "24" )
               echo -n "Rebuild";;
            "25" )
               echo -n "No Media";;
            "26" )
               echo -n "Formatting";;
            "28" )
               echo -n "Diagnostics";;
            "35" )
               echo -n "Initializing";;
            * )
               echo -n "Bad_ERROR_Code";;
         esac
         echo -n " Was: "
         case "${previous_disks_state[$currentdrive]}" in
            "0" )
               echo -n "Unknown. ";;
            "1" )
               echo -n "Ready. ";;
            "2" )
               echo -n "Failed. ";;
            "3" )
               echo -n "Online. ";;
            "4" )
               echo -n "Offline. ";;
            "6" )
               echo -n "Degraded. ";;
            "7" )
               echo -n "Recovering. ";;
            "11" )
               echo -n "Removed. ";;
            "15" )
               echo -n "Resyncing. ";;
            "24" )
               echo -n "Rebuild. ";;
            "25" )
               echo -n "No Media. ";;
            "26" )
               echo -n "Formatting. ";;
            "28" )
               echo -n "Diagnostics. ";;
            "35" )
               echo -n "Initializing. ";;
            * )
               echo -n "Bad_ERROR_Code. ";;
         esac
      fi
      currentdrive=`expr $currentdrive + 1`
   done
   if [ $systemdifferences -eq 0 ];
   then
      case $contl1severity in
         "0" )
            echo "OK - $contl1name Drives=$contl1drives, Global HotSpares=$totalhotspares"; exit 0;;
         "1" )
            echo "Warning - $contl1name Controller"; exit 1;;
         "2" )
            echo "Error  - $contl1name Controller"; exit 2;;
         "3" )
            echo "Failure - $contl1name Controller"; exit 2;;
         esac
   else
      echo ""
      exit 2
   fi

else
   contl2severity=`/usr/bin/snmpwalk -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.1.1.6.$arraynum | awk '{print $NF}'`
   contl2name=`/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.1.1.2.$arraynum | awk -F\" '{print $2}'`
   contl1drives=`/usr/bin/snmpwalk -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.5.1.7 | awk '{print $NF}' | awk 'BEGIN {x=0} /1/ {++x} END {print x}'`
   contl2drives=`/usr/bin/snmpwalk -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.5.1.7 | awk '{print $NF}' | awk 'BEGIN {x=0} /'$arraynum'/ {++x} END {print x}'`
   d=$contl1drives
   for ((a=1; a < = $contl2drives ; a++))  # Double parentheses, and "total_drives" with no "$".
   do
      let d=$contl1drives+$a
      current_disks_state&#91;${a}&#93;=`/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.4.1.4.$d | awk '{print $NF}'`
   done                           # A construct borrowed from 'ksh93'.

   # === if there is a previousdata file for previous run, read it in.
   if &#91; -e /tmp/${hostnam}_${arraynum}_$system_serial_number.txt &#93;; then
      for ((a=1; a <= contl2drives ; a++))  # Double parentheses, and "total_drives" with no "$".
      do
         previous_disks_state&#91;${a}&#93;=`/bin/sed -ne ${a}p /tmp/${hostnam}_${arraynum}_$system_serial_number.txt`
      done
      previousdata=1
   else # no previous file data, make it now from current (or should I make it manually as 4 3 3 3 1 ..??)
      currentdrive=1
      previousdata=0
      /bin/touch /tmp/${hostnam}_${arraynum}_$system_serial_number.txt
      while &#91; $currentdrive -le $contl2drives &#93;
      do
         echo ${current_disks_state&#91;$currentdrive&#93;} >> /tmp/${hostnam}_${arraynum}_$system_serial_number.txt
         currentdrive=`expr $currentdrive + 1`
      done
      echo "WARNING - PERC array wrote first status file for /tmp/${hostnam}_${arraynum}_$system_serial_number"
      exit 1
   fi

   totalhotspares=`/usr/bin/snmpwalk -c $communitystring -v 1 $hostnam 1.3.6.1.4.1.674.10893.1.1.130.4.1.22 | awk '{print $NF}'| awk 'BEGIN {x=0} /3/ {++x} END {print x}'`
   #totaldedicatedspares=`/usr/bin/snmpwalk -c $communitystring -v 1 $hostnam 1.3.6.1.4.1.674.10893.1.1.130.4.1.22 | awk '{print $NF}'| awk 'BEGIN {x=0} /4/ {++x} END {print x}'`

   currentdrive=1
   while [ $currentdrive -le $contl2drives ]
   do
      if [ ${current_disks_state[$currentdrive]} -ne ${previous_disks_state[$currentdrive]} ]; then
         systemdifferences=1
         let c2currentdrive=$contl1drives+$currentdrive
         echo -n `/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.4.1.2.$c2currentdrive | awk -F\" '{print $2}'`" "
         case "${current_disks_state[$currentdrive]}" in
            "0" )
               echo -n "Unknown";;
            "1" )
               echo -n "Ready"
               case "`/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.4.1.22.$c2currentdrive | awk '{print $NF}'`" in
                  "1" )
                     echo -n "-member of virtual disk.";;
                  "2" )
                     echo -n "-member of disk group.";;
                  "3" )
                     echo -n "-global hot spare.";;
                  "4" )
                     echo -n "-dedicated hot spare.";;
                   * )
                     echo -n "Bad_ERROR_Code.";;
               esac;;
            "2" )
               echo -n "Failed";;
            "3" )
               echo -n "Online";;
            "4" )
               echo -n "Offline";;
            "6" )
               echo -n "Degraded";;
            "7" )
               echo -n "Recovering";;
            "11" )
               echo -n "Removed";;
            "15" )
               echo -n "Resyncing";;
            "24" )
               echo -n "Rebuild";;
            "25" )
               echo -n "No Media";;
            "26" )
               echo -n "Formatting";;
            "28" )
               echo -n "Diagnostics";;
            "35" )
               echo -n "Initializing";;
            * )
               echo -n "Bad_ERROR_Code";;
         esac
         echo -n " Was: "
         case "${previous_disks_state[$currentdrive]}" in
            "0" )
               echo -n "Unknown. ";;
            "1" )
               echo -n "Ready. ";;
            "2" )
               echo -n "Failed. ";;
            "3" )
               echo -n "Online. ";;
            "4" )
               echo -n "Offline. ";;
            "6" )
               echo -n "Degraded. ";;
            "7" )
               echo -n "Recovering. ";;
            "11" )
               echo -n "Removed. ";;
            "15" )
               echo -n "Resyncing. ";;
            "24" )
               echo -n "Rebuild. ";;
            "25" )
               echo -n "No Media. ";;
            "26" )
               echo -n "Formatting. ";;
            "28" )
               echo -n "Diagnostics. ";;
            "35" )
               echo -n "Initializing. ";;
            * )
               echo -n "Bad_ERROR_Code. ";;
         esac
      fi
      currentdrive=`expr $currentdrive + 1`
   done
   if [ $systemdifferences -eq 0 ];
   then
      case $contl2severity in
         "0" )
            echo "OK - $contl2name Drives=$contl2drives, Global HotSpares=$totalhotspares"; exit 0;;
         "1" )
            echo "Warning - $contl2name Controller"; exit 1;;
         "2" )
            echo "Error  - $contl2name Controller"; exit 2;;
         "3" )
            echo "Failure - $contl2name Controller"; exit 2;;
         esac
   else
      echo ""
      exit 2
   fi
fi
Advertisements
Previous Post
Leave a comment

12 Comments

  1. Raymond

     /  April 4, 2006

    Great post, very nice plugins! Thanks man, you saved me at least a few hours of searching πŸ™‚

    I also really like the post on Nagiosgraph… You totally changed my opinion on blogs πŸ˜‰

  2. I’m glad that you found what I did helpful. In both cases (with the Dell plugins and the Nagiosgraph) it has been a long time since I’ve even looked at the code. It just works for us. Hopefully it will work as reliably for you. If you make any modifications please let me know.

  3. Alex

     /  May 3, 2006

    Hi,
    i’d like to use the check_dell.pl but unfortunately i can’t download the plugin from sourceforge.
    Can you please e-mail the plugin to me.

    Thank you very much.

    Alex

  4. Alex,
    The plugin is not on sourceforge, but my site. It is listed above at the very bottom of the post, but I’ll list it here as well. Download the code check_win_perc. I’ll also email the code as well.

    I hope it works for you.

  5. electro93

     /  February 20, 2007

    I’m having an issues running the code above. For some reason the version of dell openmanage software 4.x doesnt seem to like the snmp oid .1.3.6.1.4.1.674.10893.1.1.130.1.1.1

    Has this been updated to reflect the newer version of dell openmanage?

    Thanks!

  6. Electro,

    I’m running successfully running this on new (last 6 months) Dell PowerEdge 1850 w/ OM version 4.5 without any problems. It has been well over a year since I’ve looked at this code. So its all a little rusty for me, but has been fine across my 18 different servers.

    Can you get anything off the Dell MIB variables when querying them? Let me ask a dumb questions, if you are running Windows on the box do you have SNMP service started and configured to allow requests from your management/nagios station. Do you have the firewall running and have the right ports open?

    Let me know how it goes.

    –ken

  7. Oscar

     /  February 21, 2007

    Hi, I would like to use the check_dell.pl script, but unfortunately, I cannot find it. The link above (in the post) points to source forge – which is inaccessible to non-members of the project.

    Could you kindly mail it to me, or point me to an alternative resource from where I could get it?

    Thanks,

    Oscar

  8. nerdybails

     /  October 7, 2007

    Hi, to second that.
    the scripts are no longer available. are you able to mirror them or send them to me?

    Cheers.

  9. If you’re looking for some of the scripts that I mentioned in the original article, you may want to check out NagiosExchange. They have tons of stuff, including an entire section dell hardware monitoring.

  10. Hi Ken,

    I just came across your code. I have nagios server running on Linux and put your code there. My client a Dell 2650 running Win2k3 enterprise SNMP enabled, OpenManage Server running and everything. When I inquire running your code I get:
    Error in packet
    Reason: (noSuchName) There is no such variable name in this MIB.
    Failed object: SNMPv2-SMI::enterprises.674.10893.1.1.130.1.1.5.1

    Any ideas?

  11. Matthew

     /  November 26, 2010

    Any chance you can repost/host the scripts? The links are mostly broken now, including the nagiosexchange ones.

  1. Ramblings and Testings :: :: links for 2008-04-17

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: