Monitoring NetApp with Nagios and Nagiosgraph

With the installation of our new Network Appliance (NetApp) filers, I needed to be able to monitor them. Yes I know that they have an autosupport feature where they email you as well as NetApp whenever anything happens, but I still like to do my own monitoring.

The first thing that I did was check at the Nagios Exchange to see if they had any plugins for NetApps. They actually had two different plugins. The first worked and the second didn’t (if it offered a failed disk check). So I modified the first to add the additional feature, and becuase I knew I was going to be using Nagiosgraph, I corrected the performance data output for two of the checks to be compliant with the Nagios Plugin Development Guidelines.

Download my modified check_netapp Nagios plugin.

In order to graph the NetApp data with Nagiosgraph you will need to use my modified check_netapp plugin so please download and test before proceeding.

The following nagiosgraph map entries will allow you to graph both CPU Load and Disk Space Used per volume (by name):

# Service type: netapp-cpuload
#   check command: check_netapp -H Address -C community -v CPULOAD -w 75 -c 90
#   output: CPULOAD OK - CPU load: 1%
#   perfdata: netapp-cpuload=1%;75;90;0;100
/perfdata:netapp-cpuload=(\d+)%/
and push @s, [ netappcpuload,
	[ cpuload, GAUGE, $1 ] ];

# Service type: netapp-disk-used
#   check command:  check_netapp -H Address -C community -v DISKUSED -o /vol/volume/ -w 75 -c 90
#   output: DISKUSED OK - /vol/volume/ - total: 33554432 Kb - used 190692 Kb (1%) - free: 33363740 Kb
#   perfdata: NetApp /vol/root/ Used Space=190692KB;25165824;30198988;0;33554432
/perfdata:NetApp.*Space=(\d+)KB;(\d+);(\d+);\d+;(\d+)/
and push @s, [ netappdisk,
	[ diskused, GAUGE, $1*1024 ],
	[ diskwarn, GAUGE, $2*1024 ],
	[ diskcrit, GAUGE, $3*1024 ],
	[ diskmaxi, GAUGE, $4*1024 ] ];

Here are the entries that need to be created in your serviceextinfo.cfg file to produce the corresponding graphs:

define serviceextinfo {
  service_description  NetApp-Load
  host_name       netapp1,netapp2
  notes_url       /nagiosgraph/show.cgi?host=$HOSTNAME$&service=$SERVICEDESC$&db=netappcpuload,cpuload
  icon_image      graph.png
  icon_image_alt  View graphs
}

define serviceextinfo {
  service_description  NetApp-DiskUsed-/vol/volume
  host_name       netapp1,netapp2
  notes_url       /nagiosgraph/show.cgi?host=$HOSTNAME$&service=$SERVICEDESC$&db=netappdisk,diskused,diskwarn,diskcrit,diskmaxi
  icon_image      graph.png
  icon_image_alt  View graphs
}

Here is my modified check_netapp Nagios plugin, you will need to upload this to your Nagios server.


#!/usr/bin/perl -w

# Copyright (c) 2006 Dy 4 Systems Inc.
#
# Parameter checks and SNMP v3 based on code by Christoph Kron
#  and S. Ghosh (check_ifstatus)
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
#
#
# Report bugs to ken.mckinlay@curtisswright.com, nagiosplug-help@lists.sf.net
#
# 2006.05.01 Version 1.0
#
#
#############################################################
#
# Updated by Ken Nerhood - http://nerhood.wordpress.net/
# 2006.06.19
#
# Added check for Failed Disks
# Corrected perfdata output for CPULOAD and DISKUSED
#    to make it compliant with Nagios Plugin Guiodlines
#
#############################################################
#
# $Id: check_netapp,v 1.2 2006/05/01 13:44:16 root Exp root $

use strict;
use lib "/usr/local/nagios/libexec";
use utils qw($TIMEOUT %ERRORS &print_revision &support);
use Net::SNMP;
use Getopt::Long;
Getopt::Long::Configure('bundling');

my $PROGNAME = 'check_netapp';
my $PROGREVISION = '1.0';

sub print_help ();
sub usage ();
sub process_arguments ();

my ($status,$timeout,$answer,$perfdata,$hostname,$volume);
my ($seclevel,$authproto,$secname,$authpass,$privpass,$snmp_version);
my ($auth,$priv,$session,$error,$response,$snmpoid,$variable);
my ($warning,$critical,$opt_h,$opt_V);
my %snmpresponse;

my $state = 'UNKNOWN';
my $community='public';
my $maxmsgsize = 1472; # Net::SNMP default is 1472
my $port = 161;

my $snmpFailedFanCount = '.1.3.6.1.4.1.789.1.2.4.2';
my $snmpFailPowerSupplyCount = '.1.3.6.1.4.1.789.1.2.4.4';
my $snmpFailedDiskCount = '.1.3.6.1.4.1.789.1.6.4.7';
my $snmpUptime = '.1.3.6.1.2.1.1.3';
my $snmpcpuBusyTimePerCent = '.1.3.6.1.4.1.789.1.2.1.3';
my $snmpenvOverTemperature = '.1.3.6.1.4.1.789.1.2.4.1';
my $snmpnvramBatteryStatus = '.1.3.6.1.4.1.789.1.2.5.1';
my $snmpfilesysvolTable = '.1.3.6.1.4.1.789.1.5.8';
my $snmpfilesysvolTablevolEntryOptions = '.1.3.6.1.4.1.789.1.5.8.1.7';
my $snmpfilesysvolTablevolEntryvolName = '.1.3.6.1.4.1.789.1.5.8.1.2';
my $snmpfilesysdfTabledfEntry = '.1.3.6.1.4.1.789.1.5.4.1';
my $snmpfilesysdfTabledfEntrydfFileSys = '.1.3.6.1.4.1.789.1.5.4.1.2';
my $snmpfilesysdfTabledfEntrydfKBytesTotal = '.1.3.6.1.4.1.789.1.5.4.1.3';
my $snmpfilesysdfTabledfEntrydfKBytesUsed = '.1.3.6.1.4.1.789.1.5.4.1.4';
my $snmpfilesysdfTabledfEntrydfKBytesAvail = '.1.3.6.1.4.1.789.1.5.4.1.5';
my $snmpfilesysdfTabledfEntrydfPercentKBytesCapacity = '.1.3.6.1.4.1.789.1.5.4.1.6';

my %nvramBatteryStatus = (
        1 => 'ok',
        2 => 'partially discharged',
        3 => 'fully discharged',
        4 => 'not present',
        5 => 'near end of life',
        6 => 'at end of life',
        7 => 'unknown',
        8 => 'over charged',
        9 => 'fully charged',
);

# Just in case of problems, let's not hang Nagios
$SIG{'ALRM'} = sub {
        print "ERROR: No snmp response from $hostname (alarm timeout)\n";
        exit $ERRORS{'UNKNOWN'};
};

$status = process_arguments();
if ( $status != 0 ) {
        print_help();
        exit $ERRORS{'OK'};
}

alarm($timeout);

# do the query
if ( ! defined ( $response = $session->get_table($snmpoid) ) ) {
        $answer=$session->error;
        $session->close;
        $state = 'CRITICAL';
        print "$state:$answer for $snmpoid with snmp version $snmp_version\n";
        exit $ERRORS{$state};
}
$session->close;
alarm(0);

foreach my $snmpkey (keys %{$response} ) {
        my ($oid,$key) = ( $snmpkey =~ /(.*)\.(\d+)$/ );
        $snmpresponse{$oid}{$key} = $response->{$snmpkey};
}

if ( $variable eq 'FAN' ) {
        $state = 'OK';
        $state = 'WARNING' if ( ( defined $warning ) && ( $snmpresponse{$snmpFailedFanCount}{0} >= $warning ) );
        $state = 'CRITICAL' if ( ( defined $critical ) && ( $snmpresponse{$snmpFailedFanCount}{0} >= $critical ) );
        $answer = sprintf("Fans failed: %d",$snmpresponse{$snmpFailedFanCount}{0});
        $perfdata = sprintf("failedfans=%d",$snmpresponse{$snmpFailedFanCount}{0});
} elsif ( $variable eq 'UPTIME' ) {
        $state = 'OK';
        $answer = sprintf("System Uptime: %s",$snmpresponse{$snmpUptime}{0});
        $perfdata = sprintf("uptime=%s",$snmpresponse{$snmpUptime}{0});
} elsif ( $variable eq 'FAILEDDISK' ) {
        $state = 'OK';
        $state = 'WARNING' if ( ( defined $warning ) && ( $snmpresponse{$snmpFailedDiskCount}{0} >= $warning ) );
        $state = 'CRITICAL' if ( ( defined $critical ) && ( $snmpresponse{$snmpFailedDiskCount}{0} >= $critical ) );
        $answer = sprintf("Disks failed: %d",$snmpresponse{$snmpFailedDiskCount}{0});
        $perfdata = sprintf("faileddisks=%d",$snmpresponse{$snmpFailedDiskCount}{0});
} elsif ( $variable eq 'PS' ) {
        $state = 'OK';
        $state = 'WARNING' if ( ( defined $warning ) && ( $snmpresponse{$snmpFailPowerSupplyCount}{0} >= $warning ) );
        $state = 'CRITICAL' if ( ( defined $critical ) && ( $snmpresponse{$snmpFailPowerSupplyCount}{0} >= $critical ) );
        $answer = sprintf("Power supplies failed: %d",$snmpresponse{$snmpFailPowerSupplyCount}{0});
        $perfdata = sprintf("failedpowersupplies=%d",$snmpresponse{$snmpFailPowerSupplyCount}{0});
} elsif ( $variable eq 'CPULOAD' ) {
        $state = 'OK';
        $state = 'WARNING' if ( ( defined $warning ) && ( $snmpresponse{$snmpcpuBusyTimePerCent}{0} >= $warning ) );
        $state = 'CRITICAL' if ( ( defined $critical ) && ( $snmpresponse{$snmpcpuBusyTimePerCent}{0} >= $critical ) );
        $answer = sprintf("CPU load: %d%%",$snmpresponse{$snmpcpuBusyTimePerCent}{0});
        #$perfdata = sprintf("cpuload=%d",$snmpresponse{$snmpcpuBusyTimePerCent}{0});
        $perfdata = sprintf("netapp-cpuload=%d%%;%d;%d;0;100",$snmpresponse{$snmpcpuBusyTimePerCent}{0},$warning,$critical);
} elsif ( $variable eq 'TEMP' ) {
        $state = 'OK';
        $state = 'CRITICAL' if ( $snmpresponse{$snmpenvOverTemperature}{0} ==  2 );
        $answer = sprintf ("Over temperature: %s",($snmpresponse{$snmpenvOverTemperature}{0} == 1 ? 'no':'yes'));
        $perfdata = sprintf("overtemperature=%d",$snmpresponse{$snmpenvOverTemperature}{0});
} elsif ( $variable eq 'NVRAM' ) {
        $state = 'OK';
        $state = 'CRITICAL' if (( $snmpresponse{$snmpnvramBatteryStatus}{0} > 1 ) && ( $snmpresponse{$snmpnvramBatteryStatus}{0} < 9 ));
        $answer = sprintf ("NVRAM battery status: %s",$nvramBatteryStatus{$snmpresponse{$snmpnvramBatteryStatus}{0}});
        $perfdata = sprintf("nvrambatterystatus=%d",$snmpresponse{$snmpnvramBatteryStatus}{0});
} elsif ( $variable eq 'SNAPSHOT' ) {
        $state = 'OK';
        $answer = 'Snapshot status:';
        foreach my $key ( keys %{$snmpresponse{$snmpfilesysvolTablevolEntryOptions}} ) {
                if ( defined $volume ) {
                        if ( $snmpresponse{$snmpfilesysvolTablevolEntryvolName}{$key} eq $volume ) {
                                if ( $snmpresponse{$snmpfilesysvolTablevolEntryOptions}{$key} !~ /nosnap=off/ ) {
                                        $state = 'CRITICAL';
                                        $answer = sprintf ("%s %s Snapshots disabled;",
                                                        $answer,
                                                        $snmpresponse{$snmpfilesysvolTablevolEntryvolName}{$key});
                                } else {
                                        $answer = sprintf ("%s volume %s enabled",$answer,$snmpresponse{$snmpfilesysvolTablevolEntryvolName}{$key}) if $state ne 'CRITICAL';
                                }
                                last;
                        }
                } else {
                        if ( $snmpresponse{$snmpfilesysvolTablevolEntryOptions}{$key} !~ /nosnap=off/ ) {
                                $state = 'CRITICAL';
                                $answer = sprintf ("%s %s Snapshots disabled;",$answer,$snmpresponse{$snmpfilesysvolTablevolEntryvolName}{$key});
                        }
                }
        }
        $answer = sprintf ("%s all enabled",$answer) if $answer eq 'Snapshot status:';
        $perfdata = sprintf("");
} elsif ( $variable eq 'DISKUSED' ) {
        $state = 'OK';
        foreach my $key ( keys %{$snmpresponse{$snmpfilesysdfTabledfEntrydfFileSys}} ) {
                if ( defined $volume ) {
                        if ( $snmpresponse{$snmpfilesysdfTabledfEntrydfFileSys}{$key} eq $volume ) {
                                my $volume = $snmpresponse{$snmpfilesysdfTabledfEntrydfFileSys}{$key};
                                my $used = $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesUsed}{$key};
                                my $total = $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesTotal}{$key};
                                my $avail = $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesAvail}{$key};
                                my $percent = $snmpresponse{$snmpfilesysdfTabledfEntrydfPercentKBytesCapacity}{$key};
                                $answer = sprintf("%s - total: %d Kb - used %d Kb (%d%%) - free: %d Kb",$volume,$total,$used,$percent,$avail);
                                $perfdata = sprintf("NetApp %s Used Space=%dKB;%d;%d;0;%d",$volume,$used,$total*$warning/100,$total*$critical/100,$total);
                                $state = 'WARNING' if ( ( defined $warning ) && ( $percent >= $warning ) );
                                $state = 'CRITICAL' if ( ( defined $warning ) && ( $percent >= $critical ) );
                                last;
                        }
                } else {
                        my $volume = $snmpresponse{$snmpfilesysdfTabledfEntrydfFileSys}{$key};
                        my $used = $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesUsed}{$key};
                        my $total = $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesTotal}{$key};
                        my $avail = $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesAvail}{$key};
                        my $percent = $snmpresponse{$snmpfilesysdfTabledfEntrydfPercentKBytesCapacity}{$key};
                        $answer .= sprintf("%s - total: %d Kb - used %d Kb (%d%%) - free: %d Kb\n",$volume,$total,$used,$percent,$avail);
                        $perfdata .= sprintf("NetApp %s Used Space=%dKB;%d;%d;0;%d",$volume,$used,$total*$warning/100,$total*$critical/100,$total);
                        $state = 'WARNING' if ( ( defined $warning ) && ( $percent >= $warning ) && ( $state ne 'CRITICAL') );
                        $state = 'CRITICAL' if ( ( defined $warning ) && ( $percent >= $critical ) );
                }
        }
        if ( ( ! defined $answer ) && ( defined $volume ) ) {
                $state = 'UNKNOWN';
                $answer = "unknown volume: $volume";
                $perfdata = '';
        }
}

print "$variable $state - $answer|$perfdata\n";
exit $ERRORS{$state};

sub usage () {
        print "\nMissing arguments!\n\n";
        print "check_netapp -H <ip_address> -v variable [-w warn_range] [-c crit_range]\n";
        print "             [-C community] [-t timeout] [-p port-number]\n";
        print "             [-P snmp version] [-L seclevel] [-U secname] [-a authproto]\n";
        print "             [-A authpasswd] [-X privpasswd] [-o volume]\n\n";
        support();
        exit $ERRORS{'UNKNOWN'};
}

sub print_help () {
        print "check_netapp plugin for Nagios monitors the status\n";
        print "of a NetApp system\n\n";
        print "Usage:\n";
        print "  -H, --hostname\n\thostname to query (required)\n";
        print "  -C, --community\n\tSNMP read community (defaults to public)\n";
        print "  -t, --timeout\n\tseconds before the plugin tims out (default=$TIMEOUT)\n";
        print "  -p, --port\n\tSNMP port (default 161\n";
        print "  -P, --snmp_version\n\t1 for SNMP v1 (default), 2 for SNMP v2c\n\t\t3 for SNMP v3 (requires -U)\n";
        print "  -L, --seclevel\n\tchoice of \"noAuthNoPriv\", \"authNoPriv\", \"authpriv\"\n";
        print "  -U, --secname\n\tuser name for SNMPv3 context\n";
        print "  -a, --authproto\n\tauthentication protocol (MD5 or SHA1)\n";
        print "  -A, --authpass\n\tauthentication password\n";
        print "  -X, --privpass\n\tprivacy password in hex with 0x prefix generated by snmpkey\n";
        print "  -V, --version\n\tplugin version\n";
        print "  -w, --warning\n\twarning level\n";
        print "  -c, --critical\n\tcritical level\n";
        print "  -v, --variable\n\tvariable to query, can be:\n";
        print "\t\tCPULOAD - CPU load\n";
        print "\t\tDISKUSED - disk space used\n";
        print "\t\tFAILEDDISK - failed disks\n";
        print "\t\tFAN - fail fan state\n";
        print "\t\tNVRAM - nvram battery status\n";
        print "\t\tPS - power supply\n";
        print "\t\tSNAPSHOT - volume snapshot status\n";
        print "\t\tTEMP - over temperature check\n";
        print "\t\tUPTIME - up time\n";
        print "  -o, --volume\n\tvolume to query (defaults to all)\n";
        print "  -h, --help\n\tusage help\n\n";
        print_revision($PROGNAME,"\$Revision: 1.2 $PROGREVISION\$");
}

sub process_arguments () {
        $status = GetOptions (
                'V' => \$opt_V, 'version' => \$opt_V,
                'h' => \$opt_h, 'help' => \$opt_h,
                'P=i' => \$snmp_version, 'snmp_version=i' => \$snmp_version,
                'C=s' => \$community, 'community=s' => \$community,
                'L=s' => \$seclevel, 'seclevel=s' => \$seclevel,
                'a=s' => \$authproto, 'authproto=s' => \$authproto,
                'U=s' => \$secname, 'secname=s' => \$secname,
                'A=s' => \$authpass, 'authpass=s' => \$authpass,
                'X=s' => \$privpass, 'privpass=s' => \$privpass,
                'H=s' => \$hostname, 'hostname=s' => \$hostname,
                't=i' => \$timeout, 'timeout=i' => \$timeout,
                'v=s' => \$variable, 'variable=s' => \$variable,
                'w=i' => \$warning, 'warning=i' => \$warning,
                'c=i' => \$critical, 'critical=i' => \$critical,
                'o=s' => \$volume, 'volume=s' => \$volume,
        );

        if ( $status == 0 ) {
                print_help();
                exit $ERRORS{'OK'};
        }

        if ( $opt_V ) {
                print_revision($PROGNAME,"\$Revision: 1.2 $PROGREVISION\$");
                exit $ERRORS{'OK'};
        }

        if ( ! utils::is_hostname($hostname) ) {
                usage();
                exit $ERRORS{'UNKNOWN'};
        }

        unless ( defined $timeout ) {
                $timeout = $TIMEOUT;
        }

        if ( ! $snmp_version ) {
                $snmp_version = 1;
        }

        if ( $snmp_version =~ /3/ ) {
                if ( defined $seclevel && defined $secname ) {
                        unless ( $seclevel eq ('noAuthNoPriv' || 'authNopriv' || 'authPriv' ) ) {
                                usage();
                                exit $ERRORS{'UNKNOWN'};
                        }

                        if ( $seclevel eq ('authNoPriv' || 'authPriv' ) ) {
                                unless ( $authproto eq ('MD5' || 'SHA1') ) {
                                        usage();
                                        exit $ERRORS{'UNKNOWN'};
                                }
                                if ( ! defined $authpass ) {
                                        usage();
                                        exit $ERRORS{'UNKNOWN'};
                                } else {
                                        if ( $authpass =~ /^0x/ ) {
                                                $auth = "-authkey => $authpass";
                                        } else {
                                                $auth = "-authpassword => $authpass";
                                        }
                                }
                        }

                        if ( $seclevel eq 'authPriv' ) {
                                if ( ! defined $privpass ) {
                                        usage();
                                        exit $ERRORS{'UNKNOWN'};
                                } else {
                                        if ( $privpass -~ /^0x/ ) {
                                                $priv = "-privkey => $privpass";
                                        } else {
                                                $priv = "-privpassword => $privpass";
                                        }
                                }
                        }
                } else {
                        usage();
                        exit $ERRORS{'UNKNOWN'};
                }
        }

        # create the SNMP session
        if ( $snmp_version =~ /[12]/ ) {
                ($session,$error) = Net::SNMP->session(
                                        -hostname => $hostname,
                                        -community => $community,
                                        -port => $port,
                                        -version => $snmp_version,
                );
                if ( ! defined $session ) {
                        $state = 'UNKNOWN';
                        $answer = $error;
                        print "$state:$answer";
                        exit $ERRORS{$state};
                }
        } elsif ( $snmp_version  =~ /3/ ) {
                if ( $seclevel eq 'noAuthNoPriv' ) {
                        ($session,$error) = Net::SNMP->session(
                                                -hostname => $hostname,
                                                -community => $community,
                                                -port => $port,
                                                -version => $snmp_version,
                                                -username => $secname,
                        );
                } elsif ( $seclevel eq 'authNoPriv' ) {
                        ($session,$error) = Net::SNMP->session(
                                                -hostname => $hostname,
                                                -community => $community,
                                                -port => $port,
                                                -version => $snmp_version,
                                                -username => $secname,
                                                -authprotocol => $authproto,
                                                $auth
                        );
                } elsif ( $seclevel eq 'authPriv' ) {
                        ($session,$error) = Net::SNMP->session(
                                                -hostname => $hostname,
                                                -community => $community,
                                                -port => $port,
                                                -version => $snmp_version,
                                                -username => $secname,
                                                -authprotocol => $authproto,
                                                $auth,
                                                $priv
                        );
                }
                if ( ! defined $session ) {
                        $state = 'UNKNOWN';
                        $answer = $error;
                        print "$state:$answer";
                        exit $ERRORS{$state};
                }
        } else {
                $state = 'UNKNOWN';
                print "$state: No support for SNMP v$snmp_version\n";
                exit $ERRORS{$state};
        }

        # check the supported variables
        if ( ! defined $variable ) {
                print_help();
                exit $ERRORS{'UNKNOWN'};
        } else {
                if ( $variable eq 'UPTIME' ) {
                        $snmpoid = $snmpUptime;
                } elsif ( $variable eq 'FAN' ) {
                        $snmpoid = $snmpFailedFanCount;
                } elsif ( $variable eq 'FAILEDDISK' ) {
                        $snmpoid = $snmpFailedDiskCount;
                } elsif ( $variable eq 'PS' ) {
                        $snmpoid = $snmpFailPowerSupplyCount;
                } elsif ( $variable eq 'CPULOAD' ) {
                        $snmpoid = $snmpcpuBusyTimePerCent;
                } elsif ( $variable eq 'TEMP' ) {
                        $snmpoid = $snmpenvOverTemperature;
                } elsif ( $variable eq 'NVRAM' ) {
                        $snmpoid = $snmpnvramBatteryStatus;
                } elsif ( $variable eq 'SNAPSHOT' ) {
                        $snmpoid = $snmpfilesysvolTable;
                } elsif ( $variable eq 'DISKUSED' ) {
                        $snmpoid = $snmpfilesysdfTabledfEntry;
                } else {
                        print_help();
                        exit $ERRORS{'UNKNOWN'};
                }
        }

        return $ERRORS{'OK'};
}

Explore posts in the same categories: Net Management, Work

Tags: , , , ,

You can comment below, or link to this permanent URL from your own site.

23 Comments on “Monitoring NetApp with Nagios and Nagiosgraph”

  1. Ian Collier Says:

    Most helpful.

    BTW, I realised today that you can monitor the snapshots by using something like:

    check_netapp-2 -H netapp -C public -v DISKUSED -o /vol/volume/.snapshot -w 10 -c 5

  2. David Says:

    I have found NFS and CIFS ops/s to be very helpful to check.

  3. Willem Says:

    check_netapp reports something about snmp v1 back to me. I am running W2K3 servers. Is netapp not compatible with othe snmp versions ?? I am not familiar with snmp at all. I only report what I see in Nagios. Thought I had the right stuff when I saw netapp…. Or is there a little trick ??

  4. Ken M Says:

    I like the addition you made to the script I originally wrote when I first started playing with Nagios and plugins. I was planning on doing the same but just never got around to it.

  5. Ian Collier Says:

    Hi,

    Discovered an interesting problem. If volumes are over 2TB then netapp’s snmp reports negative values – approx the right magnitude, but negative. So, inserting ‘abs’ ahead of the appropriate values in the diskused section ensures that you get something that makes sense in nagios – and more importantly doesn’t make nagiosgraph barf.

    –Ian

  6. Ian Collier Says:

    Hi,

    Discovered an interesting problem. If volumes are over 2TB then netapp’s snmp reports negative values – approx the right magnitude, but negative. So, inserting ‘abs’ ahead of the appropriate values in the diskused section ensures that you get something that makes sense in nagios – and more importantly doesn’t make nagiosgraph barf.

    eg

    my $used = abs $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesUsed}{$key};

    –Ian

  7. Brian Says:

    I’m trying to get this going, are there any further updated info for nagios 3?

    How do i define the netapp host configs?
    do I add the “service types” to the commands.cfg?

    Thanks
    brian

  8. kbn Says:

    Brian,

    I’ve been on holiday for the past week. I will have to look at this when I get back to the office.

    As for Nagios 3 I have not looked at that yet and probably won’t for a couple of months. In meantime check out the Nagios Exchange version for the service config definitions. If I remember correctly I didn’t really change any of them.

    –ken

  9. Charles Richmond Says:

    Below is the diff of the check-netapp that corrects the negative numbers with volumes over 2TB and which converts output to GB for readability. The complete text can be found at http://www.iisc.com/check_netapp . The original check_netapp is by Ken Mckinlay with additional Nagios compatibility work by Ken Nerhood. My change is a relatively minor incremental one.

    The result of the change can be seen in this output:
    OLD/check_netapp -H x.x.x.x -C public -v DISKUSED -o /vol/vol_bpdimage/ -w 80 -c 90
    DISKUSED OK – /vol/vol_bpdimage/ – total: 1090519040 Kb – used 589970728 Kb (54%) – free: 500548312 Kb|NetApp /vol/vol_bpdimage/ Used Space=589970728KB;872415232;981467136;0;1090519040

    libexec/check_netapp -H x.x.x.x -C public -v DISKUSED -o /vol/vol_bpdimage/ -w 80 -c 90
    DISKUSED OK – /vol/vol_bpdimage/ – total: 1040 Gb – used 562 Gb (54%) – free: 477 Gb|NetApp /vol/vol_bpdimage/ Used Space=562GB;832;936;0;1040

    Note: I am calculating real GB using 1024*1024. If you want manufacturer’s GB then change the ‘1048576′ to ‘1000000′ in the diff below.

    [nagios@lrdcsvcdsk1 libexec]$ diff check_netapp ../OLD/check_netapp
    39,46d38
    < # Updated by Charles Richmond – http://www.iisc.com/
    < # 2008.09.11
    < #
    < # Modified to correct negative values for volumes larger than
    < # 2TB and modified ’sprintf’ output to be Gb instead of Kb
    < #
    < #############################################################
    < #
    203,205c195,197
    < my $used = abs $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesUsed}{$key};
    < my $total = abs $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesTotal}{$key};
    my $used = $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesUsed}{$key};
    > my $total = $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesTotal}{$key};
    > my $avail = $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesAvail}{$key};
    207,208c199,200
    < $answer = sprintf(”%s – total: %d Gb – used %d Gb (%d%%) – free: %d Gb”,$volume,$total/1048576,$used/1048576,$percent,$avail/1048576);
    $answer = sprintf(”%s – total: %d Kb – used %d Kb (%d%%) – free: %d Kb”,$volume,$total,$used,$percent,$avail);
    > $perfdata = sprintf(”NetApp %s Used Space=%dKB;%d;%d;0;%d”,$volume,$used,$total*$warning/100,$total*$critical/100,$total);
    215,217c207,209
    < my $used = abs $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesUsed}{$key};
    < my $total = abs $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesTotal}{$key};
    my $used = $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesUsed}{$key};
    > my $total = $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesTotal}{$key};
    > my $avail = $snmpresponse{$snmpfilesysdfTabledfEntrydfKBytesAvail}{$key};
    219,220c211,212
    < $answer .= sprintf(”%s – total: %d Gb – used %d Gb (%d%%) – free: %d Gb\n”,$volume,$total/1048576,$used/1048576,$percent,$avail/1048576);
    $answer .= sprintf(”%s – total: %d Kb – used %d Kb (%d%%) – free: %d Kb\n”,$volume,$total,$used,$percent,$avail);
    > $perfdata .= sprintf(”NetApp %s Used Space=%dKB;%d;%d;0;%d”,$volume,$used,$total*$warning/100,$total*$critical/100,$total);
    450d441
    <
    [nagios@lrdcsvcdsk1 libexec]$

    Charles Richmond http://www.iisc.com
    VDR2 Pit-os Talamban, Cebu City, RP


  10. Looks like Ian found the ‘abs’ before I did…

  11. andrewrivett Says:

    I’m trying to get this working but the script seems to be missing some lines around 166. Perhaps something got lost in adding to this site?

    andrew.

  12. kbn Says:

    Andrew,

    Yes there was a problem with the script when I moved it to WordPress.com from my other host. I think I’ve fixed it now, so please try again and let me know if you have any problems.

    –ken


  13. Hi I’m Thomas and i’m a developer of a new open source project named BrainPDM. As you can see from our web site this open source application can store performances data from Nagios and graph the values making Hourly, Daily, Weekly, Montly and Yearly charts. If you want you can try it and give our some feedback….

  14. Ian Collier Says:

    Of course what I never got round to adding was that although wrapping that result in abs gets rid of the-ve and so stops nagios returning errors – the overflow still means that the reported sizes are wrong – it starts going down between 2 and 4 TB and then up again at 4, but 4.5 TB, for example reports as .5 TB. I suspect this is a limitation of the information available via snmp.

    –Ian


  15. [...] Monitoring NetApp with Nagios and Nagiosgraph Potential hooks into GW? (tags: monitor netapp nagios rrd cacti) [...]


  16. A quick commercial plug, so take with a grain of salt, but if you wish to solve your NetApp monitoring issues with no configuration or coding, take a look at http://www.logicmonitor.com. It provides complete performance and fault monitoring (including per volume latency and IO operations), it requires no configuration, and deals with volume instance renumbering.
    Not free like Nagios, but requires no investment of time.

  17. Chris Wicklein Says:

    I don’t think the use of abs to correct negative numbers is correct. It looks like the problem is with unsigned 32-bit ints being misinterpreted as signed 32-bit ints. An easy way to fix this with with pack/unpack:

    $used = unpack(”I”, pack(”i”, $used));

  18. morbid Says:

    Latency and time outs

    The systems that are <10ms away checks are coming back just fine
    Anything further away I’m doing -t 150.
    During manual check checks come back fine, but automatic checks still come back timed out.
    Any ideas?

    Also volumes over 10TB come back with funky numbers, free space is OK, but the totals are incorrect.

  19. Scott Murphy Says:

    The problem is the SNMP v1 and v2c MIBs have 32 bit integers and v3 has 64 bit integers. You need DOT 7.3 to get use the v3 MIB.

    To get valid data between 2 and 4TB, you need to use unsigned integers. Larger volume/aggregate sizes are also supported in the v1 and v2c MIB but they are split into a low and high value, so you need to combine them. If you look through the MIB, you will see entries like:

    dfHighAvailKBytes
    dfLowAvailKBytes

    in the dfTable section so you need to shift the first value left 32 bits and add them together.

    I was hoping someone had already put this into the plugin but I guess not yet. I only started looking at this a couple of days ago when I got an insane response for a 6TB volume.

  20. Scott Murphy Says:

    slight modification to that, the high order value is the number of times to multiply by 2^32 and add the low order value to it.


  21. To be accurate, given that NetApp returns signed integers with unsigned content, you have to be careful, otherwise you end up with adding the raw dfLowAvailKBytes, instead of the sign corrected version, or multiplying the negative dfHighAvailKBytes, and ending up with crazy stuff.
    In RPN form, what you want is:
    dfLowTotalKBytes,0,LT,4294967296,dfLowTotalKBytes,+,dfLowTotalKBytes,IF,dfHighTotalKBytes,4294967296,*,+,1024,*

    Other NetApp gotchas are that they change the units they report metrics in between releases. e.g. latency for reads and write is reported in milliseconds before 7.3, microseconds after. And should not even be collected on early 7.3 code, due to bugs.
    This is some of the stuff that LogicMonitor automates and saves lots of time on.


  22. Hm. My RPN got truncated:
    it should be (all one line):
    dfLowTotalKBytes,0,LT,4294967296,dfLowTotalKBytes,+,dfLowTotalKBytes,IF,
    dfHighTotalKBytes,4294967296,*,+,1024,*

  23. morbid Says:

    any idea how to fix latency?
    I’ve removed

    # # Just in case of problems, let’s not hang Nagios
    # $SIG{’ALRM’} = sub {
    # print “ERROR: No snmp response from $hostname (alarm timeout)\n”;
    # exit $ERRORS{’UNKNOWN’};

    and it seems to work while doing the manual checks. Since it’s a distributed system when DMS sends the info to CMS it still errors out


Comment: