shawn-sterling / graphios Goto Github PK

View Code? Open in Web Editor NEW

289.0 33.0 100.0 460 KB

A program to send nagios perf data to graphite (carbon) / statsd / librato / influxdb

Python 94.94% Shell 5.06%

graphios's Introduction

Graphios

Oct 15, 2014

New graphios 2.0!

What's new?

Support for multiple backends (graphite, statsd, librato) (and multiples of each backend if you want)
Support for using your service descriptions instead of custom variables
Install options (pip, setup.py, rpms)
Bugfixes
- mulitple perfdata in 1 line sometimes did weird things
- quotes in your labels/metrics were sometimes in carbon
- labels with multiple '::' could mess up

Introduction

Graphios is a script to emit nagios perfdata to various upstream metrics processing and time-series (graphing) systems. It's currently compatible with [graphite], [statsd], [Librato] and [InfluxDB], with possibly [Heka], and [RRDTool] support coming soon. Graphios can emit Nagios metrics to any number of supported upstream metrics systems simultaenously.

Requirements

A working nagios / icinga / naemon server
A functional carbon or statsd daemon, and/or Librato credentials
Python 2.6 or later (but not python 3.x) (Is anyone still using 2.4? Likely very little work to make this work under 2.4 again if so. Let me know)

License

Graphios is released under the GPL v2.

Documentation

The goal of graphios is to get nagios perf data into a graphing system like graphite (carbon). Systems like these typically use a dot-delimited metric name to store each metric hierarcicly, so it can be easily located later.

Graphios creates these metric names one of two ways.

by reading a pair of custom variables that you configure for services and hosts called _graphiteprefix and _graphitepostfix. Together, these custom variables enable you to control the metric name that gets sent to whatever back-end metrics system you're using. You don't have to set them both, but things will certainly be less confusing for you if you set at least one or the other.
by using your service description in the format:

_graphiteprefix.hostname.service-description._graphitepostfix.perfdata

so if you didn't feel like setting your graphiteprefix and postfix, it would just use:

hostname.service-description.perfdata

If you are using option 2, that means EVERY service will be sent to graphite. You will also want to make sure your service descriptions are consistant or your backend naming will be really weird.

I think most people will use the first option, so let's work with that for a bit. What gets sent to graphite is this:

graphiteprefix.hostname.graphitepostfix.perfdata

The specific content of the perfdata section depends on each particular Nagios plugin's output.

Simple Example

A simple example is the check_host_alive command (which calls the check_icmp plugin by default). The check_icmp plugin returns the following perfstring:

rta=4.029ms;10.000;30.000;0; pl=0%;5;10;; rtmax=4.996ms;;;; rtmin=3.066ms;;;;

If we configured a host with a custom graphiteprefix variable like this:

define host {
    host_name                   myhost
    check_command               check_host_alive
    _graphiteprefix             ops.nagios01.pingto
}

Graphios will construct and emit the following metric name to the upstream metric system:

ops.nagios01.pingto.myhost.rta 4.029 nagios_timet
ops.nagios01.pingto.myhost.pl 0 nagios_timet
ops.nagios01.pingto.myhost.rtmax 4.996 nagios_timet
ops.nagios01.pingto.myhost.rtmin 3.066 nagios_timet

Where nagios_timet is the a unix epoch time stamp from when the plugin results were received by Nagios core. Your prefix is of course, entirely up to you. In our example, our prefix refers to the Team that created the metric (Ops), becuause our upstream metrics system is used by many different teams. Afer the team name, we've identified the specific Nagios host that took this measurement, because we actually have several Nagios boxes, and finally, 'pingto' is the name of this specific metric: the ping time from nagios01 to myhost.

Another example

Lets take a look at the check_load plugin, which returns the following perfdata:

load1=8.41;20;22;; load5=6.06;18;20;; load15=5.58;16;18

Our service is defined like this:

define service {
    service_description         Load
    host_name                   myhost
    _graphiteprefix             datacenter01.webservers
    _graphitepostfix            nrdp.load
}

With this confiuration, graphios generates the following metric names:

datacenter01.webservers.myhost.nrdp.load.load1 8.41 nagios_timet
datacenter01.webservers.myhost.nrdp.load.load5 6.06 nagios_timet
datacenter01.webservers.myhost.nrdp.load.load15 5.58 nagios_timet

As you can probably guess, our custom prefix in this example identifies the specific data center, and server-type from which these metrics originated, while our postfix refers to the check_nrdp plugin, which is the means by which we collected the data, followed finally by the metric-type.

You should think carefully about how you name your metrics, because later on, these names will enable you to easily combine metrics (like load1) across various sources (like all webservers).

Using metric_base_path to add a universal prefix

In an environment where multiple things are feeding metrics into your backend service, it can be handy to differentiate by source. Normally, you would need to prepend the graphiteprefix to all services and hosts, but in some cases, this isn't possible or feasible.

When you want everything to be prepended with the same string, use the metric_base_path setting:

metric_base_path	= mycorp.nagios

Note that quotes will be preserved. Also, _graphiteprefix and _graphitepostfix will be applied in addition to this string, so if you are already adding mycorp.nagios to your prefix, you will end up with mycorp.nagios.mycorp.nagios.metricname

A few words on Naming things for Librato

The default configuration that works for Graphite also does what you'd expect for Librato, so if you're just getting started, and you want to check out Librato, don't worry about it, ignore this section and forge ahead.

But you're a power user, you should be aware that the Librato Backend is actually generating a differet metric name than the other plugins. Librato is a very metrics-centric platform. Metrics are the first-class entity, and sources (like hosts), are actually a separate dimension in their system. This is very cool when you're monitoring ephemeral things that aren't hosts, like threads, or worker processes, but it slightly complicates things here.

So, for example, where the Graphite plugin generates a name like this (from the example above):

datacenter01.webservers.myhost.nrdp.load.load1

The Librato plugin will generate a name that omits the hostname:

datacenter01.webservers.nrdp.load.load1

And then it will automatically send the hostname as the source dimension when it emits the metric to Librato. For 99% of everyone, this is exactly what you want. But if you're a 1%'er you can influence this behavior by modifying the "namevals" and "sourcevals" lists in the librato section of the graphios.cfg

Automatic names

Version 2.0: Graphios now supports automatic names, because custom variables are hard. :)

This is an all or nothing setting, meaning if you turn this on all services will now send to graphios (instead of just the ones with the prefix and postfix setup). This will work fine, so long as you have very consistent service descriptions.

To turn this on, modify the graphios.cfg and change:

use_service_desc = False

to use_service_desc = True

You can still use the graphite prefix and postfix variables but you don't have to.

Big Fat Warning

Graphios assumes your checks are using the same unit of measurement. Most plugins support this, some do not. check_icmp) always reports in ms for example.

Installation

This is recommended for intermediate+ Nagios administrators. If you are just learning Nagios this might be a difficult pill to swallow depending on your experience level.

Hundreds of people have emailed me their success stories on getting graphios working. I have been using this in production on a medium size nagios installation for a couple years.

There are now a few ways to get graphios installed.

1 - Use pypi

    pip install graphios

NOTE: This will attempt to find your nagios.cfg and add the configuration
steps 1 and 2 for you (Don't worry we back up the file before touching it)

NOTE2: If you get the error:
Could not find a version that satisfies the requirement graphios
This is a because graphios is still in the beta category. I will remove
this in a few weeks, so until then you need to:

    pip install --pre graphios

2 - Clone it yourself

    git clone https://github.com/shawn-sterling/graphios.git
    cd graphios

Then do one of the following three things (depending what you like best):

1 - Python setup

    python setup.py install

2 - Create + Install RPM

    python setup.py bdist_rpm
    yum localinstall bdist/graphios-$version.rpm

3 - Copy the files where you want them to be

    cp graphios*.py /my/dir
    cp graphios.cfg /my/dir

Configuration

Setting this up on the nagios front is very much like pnp4nagios with npcd. (You do not need to have any pnp4nagios experience at all). If you are already running pnp4nagios , check out my pnp4nagios notes (below).

Steps:

(1) graphios.cfg

The default location for graphios.cfg is in /etc/graphios/graphios.cfg, it also checks the same directory as the graphios.py is.

Your graphios.cfg can live anywhere you want, but if it's not in the above locations you will need to modify your init script to match.

Out of the box, it enables the carbon back-end and sends pickled metrics to 127.0.0.1:2004. It also specifies the location of the graphios log and spool directories, and controls things like log levels, sleep intervals, and of course, backends like carbon, statsd, and librato.

The config file is well commented, adding/changing backends is very simple.

(2) nagios.cfg

Your nagios.cfg is going to need to modified to send the graphite data to the perfdata files. Depending on how you installed graphios this step may have been done for you.

The following needs to be put into your nagios.cfg

service_perfdata_file=/var/spool/nagios/graphios/service-perfdata
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$\tGRAPHITEPREFIX::$_SERVICEGRAPHITEPREFIX$\tGRAPHITEPOSTFIX::$_SERVICEGRAPHITEPOSTFIX$

service_perfdata_file_mode=a
service_perfdata_file_processing_interval=15
service_perfdata_file_processing_command=graphite_perf_service

host_perfdata_file=/var/spool/nagios/graphios/host-perfdata
host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tGRAPHITEPREFIX::$_HOSTGRAPHITEPREFIX$\tGRAPHITEPOSTFIX::$_HOSTGRAPHITEPOSTFIX$

host_perfdata_file_mode=a
host_perfdata_file_processing_interval=15
host_perfdata_file_processing_command=graphite_perf_host

Which sets up some custom variables, specifically: for services: $_SERVICEGRAPHITEPREFIX $_SERVICEGRAPHITEPOSTFIX

for hosts: $_HOSTGRAPHITEPREFIX $_HOSTGRAPHITEPOSTFIX

The prepended HOST and SERVICE is just the way nagios works, _HOSTGRAPHITEPREFIX means it's the _GRAPHITEPREFIX variable from host configuration.

(3) nagios commands

There are 2 commands we setup in the nagios.cfg, which if you used pip or the rpm/deb may have already been setup for you. We need:

graphite\_perf\_service
graphite\_perf\_host

Which we now need to define:

I use include dirs, so I make a new file called graphios_commands.cfg inside my include dir. Do that, or add the below commands to one of your existing nagios config files.

NOTE: Your spool directory may be different, this is setup in step (2) the service_perfdata_file, and host_perfdata_file.

define command {
    command_name            graphite_perf_host
    command_line            /bin/mv /var/spool/nagios/graphios/host-perfdata /var/spool/nagios/graphios/host-perfdata.$TIMET$

}

define command {
    command_name            graphite_perf_service
    command_line            /bin/mv /var/spool/nagios/graphios/service-perfdata /var/spool/nagios/graphios/service-perfdata.$TIMET$
}

All these commands do is move the current files to a different filename that we can process without interrupting nagios. This way nagios doesn't have to sit around waiting for us to process the results.

(4) Run it!

We recommend running graphios.py from the console for the first time, this will make sure things are sending the way you think they are. A good example would be:

./graphios.py --spool-directory /var/spool/nagios/graphios \
--log-file /tmp/graphios.log \
--backend carbon \
--server 127.0.0.1:2004 \
--test

and if there are problems add

--verbose

Other command line options:

Usage: graphios.py [options]
sends nagios performance data to carbon.

Options:
  -h, --help            show this help message and exit
  -v, --verbose         sets logging to DEBUG level
  --spool-directory=SPOOL_DIRECTORY
                        where to look for nagios performance data
  --log-file=LOG_FILE   file to log to
  --backend=BACKEND     sets which storage backend to use
  --config=CONFIG       set custom config file location
  --test                Turns on test mode, which won't send to backends
  --replace_char=REPLACE_CHAR
                        Replacement Character (default '_'
  --sleep_time=SLEEP_TIME
                        How much time to sleep between checks
  --sleep_max=SLEEP_MAX
                        Max time to sleep between runs
  --server=SERVER       Server address (for backend)
  --no_replace_hostname
                        Replace '.' in nagios hostnames, default on.
  --reverse_hostname    Reverse nagios hostname, default off.

)

** NOTE: If you use --config on the command line, we ignore every other command line, your --config will overwrite everything else.

(5) Optional init script: graphios

Remember: screen is not a daemon management tool.

If you installed with pip/setup.py/rpm this part should be done for you!

Take a look in the init/ directory and find your OS of choice.

For debian/ubuntu: cp init/debian/graphios /etc/init.d/ cp init/debian/graphios.conf /etc/init chmod 755 /etc/init.d/graphios

For rhel/centos/sl < 6: cp init/rhel/graphios /etc/init.d chmod 755 /etc/init.d/graphios

for systems with systemd: cp init/systemd/graphios.service /usr/lib/systemd/system

NOTE: You may need to change the location and username that the script runs as. this varies slightly depending on where you decided to put graphios.py

The lines you will likely have to change:

prog="/opt/nagios/bin/graphios.py"
# or use the command line options:
#prog="/opt/nagios/bin/graphios.py --log-file=/dir/mylog.log --spool-directory=/dir/my/sool"
GRAPHIOS_USER="nagios"

(6) Your host and service configs

Once you have done the above you need to add a custom variable to the hosts and services that you want sent to graphite. (Unless you are using service descriptions, in which case you can skip this step)

The format that will be sent to carbon is:

_graphiteprefix.hostname._graphitepostfix.perfdata

You do not need to set both graphiteprefix and graphitepostfix. Just one or the other will do. If you do not set at least one of them, the data will not be sent to graphite at all (unless you are using the service descriptions)

Examples:

define host {
    name                        myhost
    check_command               check_host_alive
    _graphiteprefix             monitoring.nagios01.pingto
}

Which would create the following graphite entries with data from the check_host_alive plugin:

monitoring.nagios01.pingto.myhost.rta
monitoring.nagios01.pingto.myhost.rtmin
monitoring.nagios01.pingto.myhost.rtmax
monitoring.nagios01.pingto.myhost.pl

define service {
    service_description         MySQL threads connected
    host_name                   myhost
    check_command               check_mysql_health_threshold!threads-connected!3306!1600!1800
    _graphiteprefix             monitoring.nagios01.mysql
}

Which gives us:

monitoring.nagios01.mysql.myhost.threads_connected

See the Documentation (above) for more explanation on how this works.

Upgrading

To upgrade from the old version of graphios, you need to:

Look at the things you changed in the old graphios.py (carbon_server, spool_directory, log_file location, etc)
Edit your new graphios.cfg and put those options there instead. You should NOT have to modify the new graphios.py.

Why Upgrade?

The new version has fixed some bugs, and has cooler optional backends; and support for multiple backends, including multiple carbon servers. I don't think any major performance increases have been made, so if it isn't broken don't fix it.

PNP4Nagios Notes:

Are you already running pnp4nagios? And want to just try this out and see if you like it? Cool! This is very easy to do without breaking your PNP4Nagios configuration (but do a backup just in case).

Steps:

(1) In your nagios.cfg:

Add the following at the end of your:

host_perfdata_file_template
\tGRAPHITEPREFIX::$_HOSTGRAPHITEPREFIX$\tGRAPHITEPOSTFIX::$_HOSTGRAPHITEPOSTFIX$

service_perfdata_file_template
\tGRAPHITEPREFIX::$_SERVICEGRAPHITEPREFIX$\tGRAPHITEPOSTFIX::$_SERVICEGRAPHITEPOSTFIX$

This will add the variables to your check results, and will be ignored by pnp4nagios.

(2) Change your commands:

(find your command names under host_perfdata_file_processing_command and service_perfdata_file_processing_command in your nagios.cfg)

You likely have 2 commands setup that look something like these two:

define command{
       command_name    process-service-perfdata-file
       command_line    /bin/mv /usr/local/pnp4nagios/var/service-perfdata /usr/local/pnp4nagios/var/spool/service-perfdata.$TIMET$
}

define command{
       command_name    process-host-perfdata-file
       command_line    /bin/mv /usr/local/pnp4nagios/var/host-perfdata /usr/local/pnp4nagios/var/spool/host-perfdata.$TIMET$
}

Instead of just moving the file; move it then copy it, then we can point graphios at the copy.

You can do this by either:

(1) Change the command_line to something like:

command_line    "/bin/mv /usr/local/pnp4nagios/var/host-perfdata /usr/local/pnp4nagios/var/spool/host-perfdata.$TIMET$ && cp /usr/local/pnp4nagios/var/spool/host-perfdata.$TIMET$ /usr/local/pnp4nagios/var/spool/graphios"

(2) Make a script:

#!/bin/bash
/bin/mv /usr/local/pnp4nagios/var/host-perfdata /usr/local/pnp4nagios/var/spool/host-perfdata.$TIMET$
cp /usr/local/pnp4nagios/var/spool/host-perfdata.$TIMET$ /usr/local/pnp4nagios/var/spool/graphios

change the command_line to be:
command_line    /path/to/myscript.sh

You should now be able to skip steps 2 and 3 on the configuration instructions.

OMD (Open Monitoring Distribution) Notes:

These instructions are for OMD >= 1.2x (including the current nightly builds).

Note: All steps below are assumed to be carried out under your OMD site's user.

(1) Change PNP4NAGIOS to use "NPCD with Bulk Mode" instead of NPCDMOD. This is done by redirecting the symlink for pnp4nagios.cfg:

ln -sf ~/etc/pnp4nagios/nagios_npcd.cfg ~/etc/nagios/nagios.d/pnp4nagios.cfg

(2) Update ~/etc/pnp4nagios/nagios_npcd.cfg (remember to replace SITENAME).

#
# PNP4Nagios Bulk Mode with npcd
#
process_performance_data=1

#
# service performance data
#
service_perfdata_file=/omd/sites/SITENAME/var/pnp4nagios/service-perfdata
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$\tGRAPHITEPREFIX::$_SERVICEGRAPHITEPREFIX$\tGRAPHITEPOSTFIX::$_SERVICEGRAPHITEPOSTFIX$
service_perfdata_file_mode=a
service_perfdata_file_processing_interval=15
service_perfdata_file_processing_command=omd-process-service-perfdata-file

#
# host performance data
#
host_perfdata_file=/omd/sites/SITENAME/var/pnp4nagios/host-perfdata
host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tGRAPHITEPREFIX::$_HOSTGRAPHITEPREFIX$\tGRAPHITEPOSTFIX::$_HOSTGRAPHITEPOSTFIX$
host_perfdata_file_mode=a
host_perfdata_file_processing_interval=15
host_perfdata_file_processing_command=omd-process-host-perfdata-file

(3) Update etc/nagios/conf.d/pnp4nagios.cfg (remember to replace SITENAME).

define command{
       command_name    omd-process-service-perfdata-file
       command_line    /bin/mv /omd/sites/SITENAME/var/pnp4nagios/service-perfdata /omd/sites/SITENAME/var/pnp4nagios/spool/service-perfdata.$TIMET$ && cp /omd/sites/SITENAME/var/pnp4nagios/spool/service-perfdata.$TIMET$ /omd/sites/SITENAME/var/graphios/spool/
}

define command{
       command_name    omd-process-host-perfdata-file
       command_line    /bin/mv /omd/sites/SITENAME/var/pnp4nagios/host-perfdata /omd/sites/SITENAME/var/pnp4nagios/spool/host-perfdata.$TIMET$ && cp /omd/sites/SITENAME/var/pnp4nagios/spool/host-perfdata.$TIMET$ /omd/sites/SITENAME/var/graphios/spool/
}

(4) Optional: If you don't want PNP4NAGIOS to ever see perfdata for checks that Graphios is exporting data for, you can modify the ~/etc/nagios/conf.d/pnp4nagios.cfg command lines to remove data with a grep. In the below case, we grep out a specific string (GRAPHITEPREFIX::lustre) to remove perfdata containing that string. This involves a little move moving around of files, but nothing excessive and stops PNP4NAGIOS from trying to genearte RRD files with that data. (Again remember to change SITENAME).

define command{
       command_name    omd-process-service-perfdata-file
       #command_line    /bin/mv /omd/sites/SITENAME/var/pnp4nagios/service-perfdata /omd/sites/SITENAME/var/pnp4nagios/spool/service-perfdata.$TIMET$
###GRAPHITE SETTING### ADDED REDIRECTION TO REMOVE exportstats
       command_line    /bin/mv /omd/sites/SITENAME/var/pnp4nagios/service-perfdata /omd/sites/SITENAME/var/pnp4nagios/service-perfdata.$TIMET$ && /bin/cp /omd/sites/SITENAME/var/pnp4nagios/service-perfdata.$TIMET$ /omd/sites/SITENAME/var/graphios/spool/ && grep -v GRAPHITEPREFIX\:\:lustre /omd/sites/SITENAME/var/pnp4nagios/service-perfdata.$TIMET$ > /omd/sites/SITENAME/var/pnp4nagios/spool/service-perfdata.$TIMET$ && /bin/rm /omd/sites/SITENAME/var/pnp4nagios/service-perfdata.*

}

define command{
       command_name    omd-process-host-perfdata-file
       #command_line    /bin/mv /omd/sites/SITENAME/var/pnp4nagios/host-perfdata /omd/sites/SITENAME/var/pnp4nagios/spool/host-perfdata.$TIMET$
####GRAPHITE SETTING### ADDED REDIRECTION TO REMOVE exportstats
       command_line    /bin/mv /omd/sites/SITENAME/var/pnp4nagios/host-perfdata /omd/sites/SITENAME/var/pnp4nagios/host-perfdata.$TIMET$ && /bin/cp /omd/sites/SITENAME/var/pnp4nagios/host-perfdata.$TIMET$ /omd/sites/SITENAME/var/graphios/spool/ && grep -v GRAPHITEPREFIX\:\:lustre /omd/sites/SITENAME/var/pnp4nagios/host-perfdata.$TIMET$ > /omd/sites/SITENAME/var/pnp4nagios/spool/host-perfdata.$TIMET$ && /bin/rm /omd/sites/SITENAME/var/pnp4nagios/host-perfdata.*

Check_MK Notes:

How to set custom variables for services and hosts using check_mk config files. (For OMD please don't overlook the notes above).

(1) For host perf data just create a new file named "extra_host_conf.mk" inside your check_mk conf.d dir.

extra_host_conf["_graphiteprefix"] = [
  ( "DESIREDPREFIX.ping", ALL_HOSTS),
]

(2) Run check_mk -O to updated and reload Nagios.

(3) Test via "check_mk -N hostname | less", to see if your prefix or postfix is there.

For service perf data create a file called, "extra_service_conf.mk". Remember you can use your host tags or any of kinds of tricks with check_mk config files.

extra_service_conf["_graphiteprefix"] = [
  ( "DESIREDPREFIX.check_mk", ALL_HOSTS, ["Check_MK"]),
  ( "DESIREDPREFIX.cpu.load", ALL_HOSTS, ["CPU load"]),
]

Tip: An easy way to produce graphite keys in the format: $company.$server.$metric is:

(1) Set metric_base_path to $company in graphios.cfg.

(2) In your 'extra' check_mk config files set your graphiteprefix to $metric, and set no graphiteprefix.

extra_host_conf["_graphitepostfix"] = [
  # e.g. mycompany.server123.ping
  ( "ping", ALL_HOSTS),
]

extra_service_conf["_graphitepostfix"] = [
  # e.g. mycompany.server123.cpu.load
  ( "cpu.load", ALL_HOSTS, ["CPU load"]),
]

Trouble getting it working?

Many people are running graphios now (cool!), but if you are having trouble getting it working let me know. I am not offering to teach you how to setup Nagios, this is for intermediate+ nagios users. Email me at [email protected] and I will do what I can to help.

Got it working?

Cool! Drop me a line and let me know how it goes.

Find a bug?

Open an Issue on github and I will try to fix it asap.

Contributing

I'm open to any feedback / patches / suggestions.

Special Thanks

Special thanks to Dave Josephsen who added the multiple backend support and worked with me to design and build the new version of graphios.

Shawn Sterling [email protected]

graphios's People

Contributors

Stargazers

Watchers

Forkers

ahhdem bairagi accelerationnet david-caro tr0nic7 chjohnst jbraeuer diyan sbambach kcorupe saz deathowl duylong hufman junaid18183 sunilsankar jameswhite rtbhouse alexanders beezly tarakranjan hackranger slyness gjedeer opsgit instaedu slillibri yacolinux igivefirst glancesx banzayats garykings meadhikari rrana johanwiren egwynn cncook001 roelvs jerdmann noyiba gummiboll cheribral iplantcollaborativeopensource shaps warrior1724 thotruong sreimers standalonesa rendicott madonius sandeep-sidhu neilbryant danhoerst markri adrianlzt acobaugh uepoch joelgriffiths endticket phsmith yimeng hakoerber michalmankowski djosephsen ramspoluri serration fortunejuggle miedzwiedz richardpooleedb vlaadbrain drgkill steverweber hao-ua morph027 c0psrul3 wrossmann tiagojferreira amc90 qifenggang kgbheart chevellabv aedahl-zz maubertin saumyajit sp0rus orchestra-ts starmood w3bservice cappyt 0dadj1an muffl0n justino grumbert jpajicek merouanekhalili

graphios's Issues

Proposal: Move metric manipulation to the backend modules

Upon further reading, backends like influxdb 0.9 can support spaces and other characters that things like graphite can't. It would be useful to have the metric/tag manipulation done in each backend to allow full use of that backend's capabilities.

Optionally, a configuration section could be added to each backend config when character replacement needs to be performed.

Thoughts?

carbon_servers does not work for multiple servers

carbon_servers = server1:2004,server2:2004

throws error:

April 22 19:36:29 graphios_backends.py WARNING Can't connect to carbon: server2:2004 [Errno 9] Bad file descriptor
April 22 19:36:29 graphios_backends.py CRITICAL Can't send message to carbon error:[Errno 9] Bad file descriptor
April 22 19:36:29 graphios.py CRITICAL keeping ********/host-perfdata.1429730053, insufficent metrics sent from carbon.

Sorry not setup for pulls but here's a proposed fix

# diff /usr/lib/python2.6/site-packages/graphios_backends.py ~/git/graphios/graphios_backends.py
363a364
>         sock = socket.socket()
366d366
<             sock = socket.socket()

[Errno 2] No such file or directory - processing perfdata

Graphios works well then this error is showing up:

December 16 16:09:45 graphios.py CRITICAL couldn't remove file /omd/sites/prod/var/pnp4nagios/spool/service-perfdata.1450278571 error:[Errno 2] No such file or directory: '/omd/sites/prod/var/pnp4nagios/spool/service-perfdata.1450278571'
couldn't remove file /omd/sites/prod/var/pnp4nagios/spool/service-perfdata.1450278571 error:[Errno 2] No such file or directory: '/omd/sites/prod/var/pnp4nagios/spool/service-perfdata.1450278571'
Traceback (most recent call last):

File "/usr/local/bin/graphios.py", line 567, in module
main()
File "/usr/local/bin/graphios.py", line 546, in main
process_spool_dir(spool_directory)
File "/usr/local/bin/graphios.py", line 444, in process_spool_dir
if check_skip_file(perfdata_file, file_dir):
File "/usr/local/bin/graphios.py", line 476, in check_skip_file
if os.stat(file_dir)[6] == 0:

OSError: [Errno 2] No such file or directory: '/omd/sites/prod/var/pnp4nagios/spool/host-perfdata.1450278571'

After this error, graphios hangs, no perfdata is processed.

Graphios metrics limitations?

Looks like we're having a problem and I'm wondering if there is a limitation to the number of metrics graphios can handle. We've setup some metric endpoints that return many values and in graphios log we're now getting a lot of:

October 08 08:33:16 graphios.py WARNING message not sent to graphite, file not deleted.
October 08 08:33:16 graphios.py CRITICAL Can't send message to carbon error:[Errno 32] Broken pipe

I'm also wondering if it could be a naming format issue (with a "/" or "" etc). Have you seen this issue before? Any thoughts?

Thanks.

Hostnames should be inverted or flattened

Hi,

The current metric setup causes the wrong ordering of hosts.

Imagine a dns structure like

host.datacenter.company.tld

The current ordering means that the top level of the hierarch is "host".
It would be better to organize the hosts in inverse order (or in the way DNS would resolve them):

tld.company.datacenter.host

Alternatly it might make sense to flatten the hosts, e.g. like this

host_datacenter_company_tld

It would be nice if graphios could support those modes.

Refuses to enable InfluxDB

Following the instructions I couldnt get it to connect to influxdb.

Running an InfluxDB 0.8.8 fresh install, no ssl, simple password.

Steps I followed.

Cloned repo, copied graphios*py to /opt/graphios, graphios.cfg to /etc/graphios/
Set up all the required nagios stuff.
Edited cfg file
./graphios.py --verbose

It doesnt show InfluxDB as backend

My influxdb settings on graphios.cfg are as follows:

#------------------------------------------------------------------------------
# InfluxDB Details (comment in if you are using InfluxDB)
#------------------------------------------------------------------------------

#enable_influxdb = False

# Comma separated list of server:ports
# defaults to 127.0.0.1:8086 (:8087 if using SSL).
influxdb_servers = 127.0.0.1:8086

# SSL, defaults to False
#influxdb_use_ssl = True

# Database-name, defaults to nagios
influxdb_db = nagios

# Credentials (required)
influxdb_user = root
influxdb_password = notmyrealpasswd

# Max metrics to send / request, defaults to 250
influxdb_max_metrics = 500

# Flag the InfluxDB backend as 'non essential' for the purposes of error checking
#nerf_influxdb = False

Running it:

root@ip-10-0-1-63:/opt/graphios# ./graphios.py --verbose
Enabled backends: []
graphios startup.
Processing spool directory /var/spool/nagios/graphios
Processed 0 files (0 metrics) in /var/spool/nagios/graphios
graphios sleeping.

My /var/log/graphios.log

June 09 14:54:20 graphios.py INFO Enabled backends: []
June 09 14:54:20 graphios.py INFO graphios startup.
June 09 14:54:20 graphios.py DEBUG Processing spool directory /var/spool/nagios/graphios
June 09 14:54:20 graphios.py INFO Processed 0 files (0 metrics) in /var/spool/nagios/graphios
June 09 14:54:20 graphios.py DEBUG graphios sleeping.
June 09 14:54:22 graphios.py INFO ctrl-c pressed. Exiting graphios.

Any ideas?? If I made some dumb mistake please let me know.

graphios config statsd server line uses wrong key

Hello,

I tried to set my statsd server in /etc/graphios/graphios.cfg by changing that line from

statsd_servers = 127.0.0.1:8125

statsd_servers = MY_SERVER_IP:8125

it would still send statsd data to localhost. See logs below (debug mode was set to True)

December 01 21:56:42 graphios_backends.py DEBUG sending to statsd at 127.0.0.1:8125
December 01 21:56:42 graphios.py DEBUG deleted /var/spool/nagios/graphios/host-perfdata.1449006992
December 01 21:56:42 graphios.py INFO Processed 2 files (0 metrics) in /var/spool/nagios/graphios
December 01 21:56:42 graphios.py DEBUG graphios sleeping.

I was looking at the source and noticed that the key used to get the statsd server is not the same as in the default config

The following snipet is from graphios_backends.py line 401

class statsd(object):
    def __init__(self, cfg):
        self.log = logging.getLogger("log.backends.statsd")
        self.log.info("Statsd backend initialized")
        try:
            cfg['statsd_servers']
        except:
            self.statsd_servers = '127.0.0.1'
        else:
            self.statsd_servers = cfg['statsd_servers']

Looks like it uses "statsd_servers" instead of "statsd_server" key. Once I changed that in my /etc/graphios/graphios.cfg file, all worked swell.

I installed nagios using option num 1 (pip install method).

-Daniel

Graphios not parsing spaces in perfdata (?)

Shawn, I'd sent you an email about this...but I got a little farther...

failed to parse label: 'physical' part of perfstring 'physical memory %=23%;80;90 physical memory=955.941M;3276.441;3685.996;0;4095.551 virtual memory %=0%;80;90 virtual memory=355.348M;6710886.3;7549747.087;0;8388607.875 paged bytes %=18%;80;90 paged bytes=886.578M;3839.641;4319.596;0;4799.551 page file %=18%;80;90 page file=886.578M;3839.641;4319.596;0;4799.551'
failed to parse label: 'virtual' part of perfstring 'physical memory %=23%;80;90 physical memory=955.941M;3276.441;3685.996;0;4095.551 virtual memory %=0%;80;90 virtual memory=355.348M;6710886.3;7549747.087;0;8388607.875 paged bytes %=18%;80;90 paged bytes=886.578M;3839.641;4319.596;0;4799.551 page file %=18%;80;90 page file=886.578M;3839.641;4319.596;0;4799.551'
failed to parse label: 'memory' part of perfstring 'physical memory %=23%;80;90 physical memory=955.941M;3276.441;3685.996;0;4095.551 virtual memory %=0%;80;90 virtual memory=355.348M;6710886.3;7549747.087;0;8388607.875 paged bytes %=18%;80;90 paged bytes=886.578M;3839.641;4319.596;0;4799.551 page file %=18%;80;90 page file=886.578M;3839.641;4319.596;0;4799.551'
failed to parse label: 'virtual' part of perfstring 'physical memory %=23%;80;90 physical memory=955.941M;3276.441;3685.996;0;4095.551 virtual memory %=0%;80;90 virtual memory=355.348M;6710886.3;7549747.087;0;8388607.875 paged bytes %=18%;80;90 paged bytes=886.578M;3839.641;4319.596;0;4799.551 page file %=18%;80;90 page file=886.578M;3839.641;4319.596;0;4799.551'
failed to parse label: 'paged' part of perfstring 'physical memory %=23%;80;90 physical memory=955.941M;3276.441;3685.996;0;4095.551 virtual memory %=0%;80;90 virtual memory=355.348M;6710886.3;7549747.087;0;8388607.875 paged bytes %=18%;80;90 paged bytes=886.578M;3839.641;4319.596;0;4799.551 page file %=18%;80;90 page file=886.578M;3839.641;4319.596;0;4799.551'
failed to parse label: 'bytes' part of perfstring 'physical memory %=23%;80;90 physical memory=955.941M;3276.441;3685.996;0;4095.551 virtual memory %=0%;80;90 virtual memory=355.348M;6710886.3;7549747.087;0;8388607.875 paged bytes %=18%;80;90 paged bytes=886.578M;3839.641;4319.596;0;4799.551 page file %=18%;80;90 page file=886.578M;3839.641;4319.596;0;4799.551'
failed to parse label: 'paged' part of perfstring 'physical memory %=23%;80;90 physical memory=955.941M;3276.441;3685.996;0;4095.551 virtual memory %=0%;80;90 virtual memory=355.348M;6710886.3;7549747.087;0;8388607.875 paged bytes %=18%;80;90 paged bytes=886.578M;3839.641;4319.596;0;4799.551 page file %=18%;80;90 page file=886.578M;3839.641;4319.596;0;4799.551'

I'm now seeing metrics into Graphite, but anything Windows-related (check_NRPE?) (has a space in the perfdata label?) is broken.

I am guessing it's part of this:

 for metric in mobj.PERFDATA.split():
                try:
                    nobj = copy.copy(mobj)
                    (nobj.LABEL, d) = metric.split('=')
                    v = d.split(';')[0]
                    u = v
                    nobj.VALUE = re.sub("[a-zA-Z%]", "", v)
                    nobj.UOM = re.sub("[^a-zA-Z]+", "", u)
                    processed_objects.append(nobj)
                except:
                    log.critical("failed to parse label: '%s' part of perf"
                                 "string '%s'" % (metric, nobj.PERFDATA))
                    continue

Here's some relevant raw spool/graphios perfdata:

DATATYPE::SERVICEPERFDATA       TIMET::1418912534       HOSTNAME::servernameHere  SERVICEDESC::Memory Load        SERVICEPERFDATA::physical memory %=30%;80;90 physical memory=1.201G;3.19899;3.599;0;3.999 virtual memory %=0%;80;90 virtual memory=353.613M;6710886.3;7549747.087;0;8388607.875 paged bytes %=10%;80;90 paged bytes=1.01899G;8.13499;9.15199;0;10.168 page file %=10%;80;90 page file=1.01899G;8.13499;9.15199;0;10.168       SERVICECHECKCOMMAND::check_nrpe!alias_mem       HOSTSTATE::UP   HOSTSTATETYPE::HARD     SERVICESTATE::OK        SERVICESTATETYPE::HARD GRAPHITEPREFIX::nagios.01.service       GRAPHITEPOSTFIX::$_SERVICEGRAPHITEPOSTFIX$

...because there are spaces in the perfdata labels?

When I look at the Graphite metric names...clearly the delimiter is breaking. e.g.

nagios.01.service.servernameHere.30s

My Nagios template definitions have: (respectively)

        _graphiteprefix                 nagios.01.host
        _graphiteprefix                 nagios.01.service

I found a setting that allows me to get around this for the time being...

# use service description, most people will NOT want this, read documentation!
use_service_desc = True

Now I see valid metrics & selections coming up in Graphite...Shawn, if you have any ideas how to fix the template stuff, that'd be great. If not, this will work, too!

Graphios.py not parsing performance data files

Hello,

I tried to use it, but something is not working as expected:

Processing spool directory /usr/local/nagios/var/graphios
Processed 0 files in /usr/local/nagios/var/graphios

and performance data are in:

:/usr/local/nagios/var/graphios# ls -la
total 2756
drwxr-xr-x 2 nagios nagcmd 4096 Apr 29 14:38 .
drwxrwxr-x 6 nagios nagios 4096 Apr 29 15:52 ..
-rw-rw-r-- 1 nagios nagios 124492 Apr 29 15:51 host-perfdata
-rw-rw-r-- 1 nagios nagios 2675162 Apr 29 15:52 service-perfdata

Can you help to understand what is wrong.

Thank you

_graphiteprefix don't work?

I wrote _graphiteprefix in host-pnp object define. But don't saw GRAPHITEPREFIX:: in perfdata. so nothing output to graphite.

I use omd 1.0.

rhel init script uses sudo - cannot be run without a tty

Using a config management tool like puppet, cfengine, etc doesn't work when using the provided init script:

sudo: sorry, you must have a tty to run sudo

I suggest using the daemon function provided by /etc/init.d/functions instead, as it will handle redirecting stdout as well.

pip install makes problematic changes to nagios.cfg

Installing with pip on an openSUSE 13.2 box, these lines were automatically written in the nagios.cfg file:

service_perfdata_file_processing_command=graphios_perf_service
host_perfdata_file_processing_command=graphios_perf_host

They differ from the commands in configuration steps 2 and 3, which are written as "graphite_perf" instead of "graphios_perf," so adding the commands listed in step 3 to the commands.cfg file will produce warnings:

Warning: Service performance file processing command 'graphios_perf_service' was not found - service performance data file will not be processed!
Warning: Host performance file processing command 'graphios_perf_host' was not found - host performance data file will not be processed!

It's easy to alter these, but perhaps either the pip installation or the configuration steps should be changed so as they both reflect the same command names.

Also, the pip install wrote:

cfg_dir=/etc/nagios/objects

at the bottom of my nagios.cfg file. Already having config settings for files in that directory, it produced warnings from Nagios about duplicate commands, as it was trying to read them twice. I'm not really sure it's necessary for graphios to write that line.

pip install Python 2/3 conflicts on openSUSE

I used pip to install graphios on my openSUSE 13.2 box, and it created conflicting files in different locations. graphios.py was installed in two places: /usr/bin and /usr/local/bin. The version installed in /usr/bin has this header:

#!/usr/bin/python3 -tt

when the file is clearly not Python 3 compatible. Running that version at the command line complains of syntax errors due to the print function not using parenthesis.

The version of graphios.py in /usr/local/bin is Python 2, but running it fails to import graphios_backends, as graphios_backends.py, which is Python 3, is installed in the Python 3 module directory, but there is no version in the Python 2 module directory.

I copied graphios_backends.py to the Python 2 module directory and the module imports correctly as far as I can tell. I'm still getting errors when running graphios, but they don't seem to be related to that issue.

Config file: log_level not respected.

As far as I can see, the log_level key in the configuration file is not used at all. The level for the python logging module is set to INFO by default, or set to DEBUG if verbose is set to True.

I'd submit a pull request, but was not sure about the best way to go about this. I propose two solutions:

A new quiet config key (and maybe a --quiet/-q command line switch) that sets the log level to WARNING. The existing log_level key may be removed/deprecated.
Using the log_level key, with the allowed values being the same as the logging levels for the logging module (CRITICAL, ERROR, WARNING, INFO, DEBUG).

After consensus on which was is preferred, I will submit a pull request.

hostgroup as a prefix (or a custom tag)

I am using OMD (Check_mk) with graphios, and able to send the metrics to carbon/graphite.

I am also trying to use grafana for graphing the metrics, and I feel it would help if the metrics have a hostgroup in the name so I can easily fetch them (say .. webservers*).

I am using this in pnp4nagios.cfg file:
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$\tGRAPHITEPREFIX::$_SERVICEGRAPHITEPREFIX$\tGRAPHITEPOSTFIX::$_SERVICEGRAPHITEPOSTFIX$

graphite tree looks like: mycompany-->myorg-->omd-->10_8_18_2->CPULoad->Load1

Works on Python 2.6.6

I was worried when I saw Python 2.7 as a requirement. But seems to work without issue on RHEL 6.5 w/ Python 2.6.6.

$ cat /etc/redhat-release ; python --version
Red Hat Enterprise Linux Server release 6.5 (Santiago)
Python 2.6.6

remove_character function for Graphios

Hello everybody,

first I will say a fantastic and good project.

I have a question for modifying the output from graphios to graphite. I use some nagios instances to check windows and linux server. The perfdata output is sending by graphios to graphite.
For templating of user graphs we use Grafana. The datasource of Grafana is the graphite backend.
The setup works great.

After the last nagios core update we found out a problem.
The perfdata from nagios has now apostrophes in the metrics, for example (CPU metric before Nagios Core Update) 10m.wsp and example '10m'.wsp (CPU metric after Nagios Core Update).

In my opinion, it is not advisable to use apostrophe in metrics or file names (of wsp files). Do you agree?
The second problem based on the apostrophes in the metrics and file names is the query definition in Grafana. In my tests I can't choose metrics with apostrophe in the name. Only if I use asterisk in the Grafana query it works for me but is not user-friendly.
My idea or question is now, why don't include in graphios a parameter for removing characters (e.g. apostrophe)? Similar to parameter "replacement_character".

At the moment I use a quick and dirty function to solve the problem with the apostrophe in the source code. Here the modifications in graphios_backends.py (first block modification beginning line 342 and second block modification beginning line 428):

    path = "%s%s%s.%s" % (pre, hostname, post, m.LABEL)
    path = re.sub(r"\.$", '', path)  # fix paths that end in dot
    path = re.sub(r"\.\.", '.', path)  # fix paths with double dots
    path = re.sub('\'', '', path) # Remove ' char
    path = self.fix_string(path)
    return path

…………………………..

    for m in metrics:
        path = '%s.%s.%s.%s' % (m.GRAPHITEPREFIX, m.HOSTNAME,
                                m.GRAPHITEPOSTFIX, m.LABEL)
        path = re.sub(r'\.$', '', path)  # fix paths that end in dot
        path = re.sub(r'\.\.', '.', path)  # fix paths with empty values
        path = re.sub('\'', '', path) # Remove ' char
        mtype = self.set_type(m)  # gauge|counter|timer|set
        value = "%s|%s" % (m.VALUE, mtype)  # emit literally this to statsd
        metric_tuple = "%s:%s" % (path, value)
        out_list.append(metric_tuple)

A better and generic solution would be a parameter to remove "not valid" characters from metric names, default could be the apostrophe.
Or simply inluding path = re.sub(''', '', path)" in the graphios source code?!

I think it should also be helpfully in other monitoring environments using the actual Nagios or ICINGA core.

Thanks and best regards
Matthias

Performance bottleneck

I'm running about 30k services and noticing that graphios is falling behind in processing compared to our collectd poller. Graphios is about 10-15m behind while collectd is current. Graphite/carbon does not appear to be the bottleneck. What is graphios' bottleneck? Is there anything that can be done to increase processing speed?

Graphios doesn't send data to graphite

Hello,

thank you for you script.
For now, perfdata are not sent from my Nagios.
They are written in /var/spool/nagios/graphios/ :

#grep GRAPHITEPREFIX::nagios /var/spool/nagios/graphios/service-perfdata.* 
.......................
DATATYPE::SERVICEPERFDATA       TIMET::1410870970       HOSTNAME::localhost     SERVICEDESC::PING       SERVICEPERFDATA::rta=0.060000ms;100.000000;500.000000;0.000000 pl=0%;20;60;0 SERVICECHECKCOMMAND::check_ping!100.0,20%!500.0,60%     HOSTSTATE::UP   HOSTSTATETYPE::HARD     SERVICESTATE::OK        SERVICESTATETYPE::HARD       GRAPHITEPREFIX::nagios.pingto   GRAPHITEPOSTFIX::$_SERVICEGRAPHITEPOSTFIX$
DATATYPE::SERVICEPERFDATA       TIMET::1410871270       HOSTNAME::localhost     SERVICEDESC::PING       SERVICEPERFDATA::rta=0.057000ms;100.000000;500.000000;0.000000 pl=0%;20;60;0 SERVICECHECKCOMMAND::check_ping!100.0,20%!500.0,60%     HOSTSTATE::UP   HOSTSTATETYPE::HARD     SERVICESTATE::OK        SERVICESTATETYPE::HARD       GRAPHITEPREFIX::nagios.pingto   GRAPHITEPOSTFIX::$_SERVICEGRAPHITEPOSTFIX$
.......................................................

In graphios.log :

graphios.py DEBUG perfdata:rta=0.053000ms;100.000000;500.000000;0.000000 pl=0%;20;60;0
September 16 14:49:12 graphios.py DEBUG parsed_perfdata:[{'perfdata': 'rta=0.053000ms;100.000000;500.000000;0.000000 ', 'value': '0.053000', 'label': 'rta'}, {'perfdata': 'pl=0%;20;60;0', 'value': '0', 'label': 'pl'}]
September 16 14:49:12 graphios.py DEBUG new line = nagios.pingto.localhost.rta 0.053000 1410867970
September 16 14:49:12 graphios.py DEBUG new line = nagios.pingto.localhost.pl 0 1410867970
September 16 14:49:12 graphios.py DEBUG graphite_lines:['nagios.pingto.localhost.rta 0.053000 1410867970', 'nagios.pingto.localhost.pl 0 1410867970']
September 16 14:49:12 graphios.py DEBUG Starting on '/var/spool/nagios/graphios/service-perfdata.1410867171'

Wireshark between nagios & graphite only shows TCP ACK SYN ACK ,no data are transferred after.

I surely miss something. Could you help me?

support for metric types, statsd

hi shawn,
are there any plans on supporting different metric types like (gauge, counter etc) which could be appended to the performance data and sent to statsd instead of carbon?

Cannot use Nagios macros in _graphiteprefix/_graphiteppostfix

I'd like to do this:

define host {
    host_name                   myhost
    check_command               check_host_alive
    _graphiteprefix             $HOSTGROUPALIAS$
}

I tried $$HOSTGROUPALIAS$$ and $HOSTGROUPALIAS$ but '' and '$' are translated to '_'.

Graphios Rearchitecting Proposal

Hi,

I wanted to send a proposal to the core Graphios developer for a bit of re-architecting of the script and see what interest there was. If there is interest to continue, then I'd be happy to fork/do the work/submit a pull request back to upstream; but I did want to check first because it'll be fairly changing, and if there isn't any interest in receiving the work upstream then I may not bother.

In sum, I like the way that Graphios approaches the problem and fundamentally I like what it does, it just is a little bit different than I would need to support the things that I want to do with it. Specifically, the way that it calculates the graphite metric name isn't very flexible.

What I would like to do is:

Replace _graphiteprefix and _graphitepostfix with a single _graphiostemplate Nagios configuration variable that allows interpolation of variables in order to pull in the necessary data/place the metric name.
Add _graphiosnamemap, which should end up being some manner of mapping (encoded in a way that makes it work easily) for the perfdata => metric name.

As an example, supposing we have this Nagios configuration:

define service {
  check_command       check_apache
  host_name           www
  process_perf_data   1
  service_description Apache
  _graphiostemplate   Services.Apache.#{host}.#{name}
  _graphiosnamemap    sb-_ => waiting, sb-R => reading, sb-W => writing
}

... and then we have this performance data:

'accesses'=30926c;;;0
'busyworkers'=4;90.0;125.0;0
'bytes'=223452160b;;;0
'bytesperreq'=7225.38;;;0
'bytespersec'=14762;;;0
'reqpersec'=2.04307;;;0
'workers'=12;;;0 'sb-_'=8;;;0
'sb-S'=0;;;0
'sb-R'=2;;;0
'sb-W'=1;;;0
'sb-K'=1;;;0
'sb-D'=0;;;0
'sb-C'=0;;;0
'sb-L'=0;;;0
'sb-G'=0;;;0
'sb-I'=0;;;0
'sb-.'=244;;;0

What I would like is for the following graphite keys to be created:

Services.Apache.www.waiting <value for sb-_>
Services.Apache.www.reading <value for sb-R>
Services.Apache.www.writing <value for sb-W>

Note that in this way fields can actually be DROPPED (i.e. not sent through to Graphite), if they are not listed in _graphiosnamemap. If _graphiosnamemap is not specified, then the current behaviour (send all performance data through on the same names) would hold.

Please also note that the example format that I give above for both _graphiostemplate and _graphiosnamemap are completely arbitrary and only for the sake of this example; I'm not sure exactly what sort of standard would be wanted for the interpolation of _graphiostemplate and the parsing of _graphiosnamemap. Of particular note is that it would be good if there weren't hardcoded field requirements for the performance data that Nagios outputs, so if the template variables were just something like #{fieldX} (where X is a number), that should work fairly well... although beyond that, it would be great if some intelligent "transforms" could be done on the data (i.e. say that the full hostname field is "www.example.com", it would be great if you could separate off the hostname from the FQDN), and that starts to get really complicated when combined with arbitrary field positions.

Anyway, that's an abstract overview of what I would like to do. If there IS interest on behalf of the Graphios project, then I am happy to fork the repo and start work on it, but if there's no interest then I will leave you in peace.

Thanks!

Graphios crashes with OSError No such file or directory

I keep getting the error below:

August 23 11:54:44 graphios.py CRITICAL couldn't remove file /omd/sites/prod/var/pnp4nagios/spool/service-perfdata.1440356069 error:[Errno 2] No such file or directory: '/omd/sites/prod/var/pnp4nagios/spool/service-perfdata.1440356069'
Traceback (most recent call last):
  File "/usr/bin/graphios.py", line 553, in <module>
    main()
  File "/usr/bin/graphios.py", line 532, in main
    process_spool_dir(spool_directory)
  File "/usr/bin/graphios.py", line 430, in process_spool_dir
    if check_skip_file(perfdata_file, file_dir):
  File "/usr/bin/graphios.py", line 462, in check_skip_file
    if os.stat(file_dir)[6] == 0:
OSError: [Errno 2] No such file or directory: '/omd/sites/prod/var/pnp4nagios/spool/host-perfdata.1440356069'

I installed and configured Graphios per the instructions for OMD. I thought maybe it was the version I installed via pip and tried installing from source. Although it took longer for the source install to crash it did eventually do so with that message. I keep thinking it has something to do with the interval between when the data is generated to when it is removed by graphios. I set the sleep_time to 10 seconds and the service_perfdata_file_processing_interval in nagios_npcdmod.cfg to 15. But it eventually just crashes. Any thoughts on what I am doing wrong here?

Whereto add metric_base_path

Can you please point me out in which file do I need to introduce this variable to have global prefix.

carbon_max_metrics not in graphios.cfg

carbon_max_metrics is used in carbon class but don't see it described in graphios.cfg. Is this intentional?

Host perft data is not parsed when use_service_desc = True

Question - check_mp and alternative idea

Hi @shawn-sterling ,

I'm exploring the options to get nagios perfdata -> graphite. So I stumbled upon your project :)
It mentions check_mp to 'normalize' the values of checks. Could you put the code up somewhere to get an idea? Also what plugins do change their values sizes ? Be interested in the know.

I'm actually thinking of an alternative route putting nagios -> graphite by creating a NEB (nagios event broker) like https://github.com/jedi4ever/nagios-zmq . Saves you from the hassle from moving files and having daemons lying around. Just send UDP to graphite/statsd .

For this approach I'm wondering why the autonaming did not work for you. Was is the cluttering of files, or why do you need to rewrite the names? Thanks for any insight.

Patrick

Unable to see data in griffin UI

Hi There,

I am sending data to cc_relay. I have grafana interface running over graphite. So when i am sending data from nagios, i cannot see anything appearing on UI. I can see following lines in log. Can you please check if the config is correct. If i am convinced that my graphics part is correct, I will put my energy in troubleshooting it from graphite perspective.

November 16 07:02:45 graphios_backends.py INFO Carbon Backend Initialized
November 16 07:02:45 graphios.py INFO Enabled backends: ['carbon']
November 16 07:02:45 graphios.py INFO graphios startup.
November 16 07:18:30 graphios_backends.py INFO Carbon Backend Initialized
November 16 07:18:30 graphios.py INFO Enabled backends: ['carbon']
November 16 07:18:30 graphios.py INFO graphios startup.
November 16 07:18:30 graphios.py INFO Processed 70 files (0 metrics) in /var/spool/nagios/graphios
November 16 07:18:46 graphios.py INFO Processed 72 files (0 metrics) in /var/spool/nagios/graphios
November 16 07:18:51 graphios.py INFO ctrl-c pressed. Exiting graphios.
November 16 07:37:54 graphios_backends.py INFO Carbon Backend Initialized
November 16 07:37:54 graphios.py INFO Enabled backends: ['carbon']
November 16 07:37:54 graphios.py INFO graphios startup.
November 16 07:37:54 graphios.py INFO Processed 32 files (0 metrics) in /var/spool/nagios/graphios
November 16 07:37:58 graphios.py INFO ctrl-c pressed. Exiting graphios.
/tmp/graphios.log (END)

build_carbon_metric return

Hi,

First of all, thanks for the script and for the good installation documentation.

I'm wondering why you return "" in build_carbon_metric when graphite_prefix == "" and graphite_postfix == ""

The reason why I'm asking is that setting up graphios on a nagios where you don't set any prefix/postfix simply fails to send data to graphite.

So, I've edited the function to << return "%s." % host_name >> instead

Another workaround is to change the service_perfdata_file_template line with the following at the end: \tGRAPHITEPOSTFIX::$SERVICEDESC$

Theses are not really "Issues", theses are just a few suggestions to make graphios easy to integrate to existing Nagios setup.

Arnaud

graphios not honoring statsd address in graphios.cfg

Just installed graphios-2.0.3 via pip. Host is ubuntu 14.0.4. It seems to be ignoring the IP address for statsd in /etc/graphios/graphios.cfg.

/etc/graphios/graphios.cfg sets statsd as backend with IP address:

root% cat /etc/graphios/graphios.cfg | grep statsd
# Statsd Details (comment in if you are using statsd)
enable_statsd = true 
# Comma separated list of statsd server IP:Port 's
statsd_server = "10.0.0.19:8125"
#flag the statsd backend as 'non essential' for the purposes of error checking
#nerf_statsd = False

Running graphios either from init script or by hand shows the following:

root% /usr/local/bin/graphios.py --config_file=/etc/graphios/graphios.cfg

root% /usr/local/bin/graphios.py --config_file=/etc/graphios/graphios.cfg
Statsd backend initialized
Enabled backends: ['statsd']
graphios startup.
Processing spool directory /var/spool/nagios/graphios
sending to statsd at 127.0.0.1:8125
graphite_lines:6
sending to statsd at 127.0.0.1:8125
graphite_lines:0
sending to statsd at 127.0.0.1:8125
graphite_lines:18
sending to statsd at 127.0.0.1:8125
graphite_lines:12

I need graphios to honor the statsd IP address in /etc/graphite/graphite.cfg.

[Feature Request] UDP plaintext protocol

Please implement UDP plaintext protocol.

OMD python install - 'no post install could be performed'

This is OMD v1.10...but I suspect that any current version of OMD will have this issue.
git clone https://github.com/shawn-sterling/graphios
python setup.py install

I get this:

copying build/scripts-2.6/graphios.py -> /usr/bin
changing mode of /usr/bin/graphios.py to 755
Running post install task
sorry I couldn't find the nagios.cfg file
NO POST INSTALL COULD BE PERFORMED

So I edited setup.py: (see omd path)

def _post_install():
    """
    tries to find the nagios.cfg and insert graphios perf commands/cfg
    """
    lookin = ['/etc/nagios/', '/opt/nagios/', '/usr/local/nagios',
              '/usr/nagios', '/omd/sites/SITENAMEHERE/etc/nagios']
    nag_cfg = find_nagios_cfg(lookin)
    if nag_cfg is None:
        print("sorry I couldn't find the nagios.cfg file")
        print("NO POST INSTALL COULD BE PERFORMED")

Ran it again: python setup.py install

Running post install task
found nagios.cfg in /omd/sites/SITENAMEHERE/etc/nagios/nagios.cfg
Traceback (most recent call last):
  File "setup.py", line 176, in <module>
    https://github.com/shawn-sterling/graphios'
  File "/usr/lib64/python2.6/distutils/core.py", line 152, in setup
    dist.run_commands()
  File "/usr/lib64/python2.6/distutils/dist.py", line 975, in run_commands
    self.run_command(cmd)
  File "/usr/lib64/python2.6/distutils/dist.py", line 995, in run_command
    cmd_obj.run()
  File "setup.py", line 131, in run
    self.execute(_post_install, [], msg="Running post install task")
  File "/usr/lib64/python2.6/distutils/cmd.py", line 358, in execute
    util.execute(func, args, msg, dry_run=self.dry_run)
  File "/usr/lib64/python2.6/distutils/util.py", line 408, in execute
    func(*args)
  File "setup.py", line 103, in _post_install
    print("parsed nagcfg, nagios_log is at %s" % nconfig['log_file'])
KeyError: 'log_file'

So it's something in here:

 print("found nagios.cfg in %s" % nag_cfg)
        nconfig = parse_nagios_cfg(nag_cfg)
        print("parsed nagcfg, nagios_log is at %s" % nconfig['log_file'])
        if backup_file(nag_cfg):
            add_perfdata_config(nconfig, nag_cfg)
        else:
            print("Backup failed, add modify nagios.cfg manually.")

Checking that nagios.cfg file - it's OMD-generated:

# Nagios main configuration file

# This file will be read in after the files in nagios.d.
# Variables you set here will override settings in those
# files. Better do not edit the files in nagios.d but rather
# copy variables from there to here. That will save you
# trouble when updating your sites to new versions.

#use_regexp_matching=1

So it's expecting to see something...but OMD uses conf.d and nagios.d for configs.
I suspect you could just put stuff straight in here... (the OMD documentation appears a little pieced-together - initially mentions OMD 1.2x, later on mentions OMD 5.6...)

Guessing this function is actually what's failing...

def add_perfdata_config(nconfig, nag_cfg):
    """
    adds the graphios perfdata cfg to the nagios.cfg
    """

I'll play around, see if I can get it working - will update.

Update: Oh yeah, here's the folder structure if you were curious:

[root@server nagios]# pwd
/omd/sites/SITENAMEHERE/etc/nagios
[root@server nagios]# ls -h
apache.conf  cgi.cfg  conf.d  config.inc.php  nagios.cfg  nagios.d  resource.cfg  ssi
[root@server nagios]# ls conf.d/
check_mk_objects.cfg  check_mk_templates.cfg  commands.cfg  jmx4perl_nagios.cfg  notification_commands.cfg  pnp4nagios.cfg  templates.cfg  thruk_templates.cfg  timeperiods.cfg
[root@server nagios]# ls nagios.d
dependency.cfg  eventhandler.cfg  flapping.cfg  freshness.cfg  logging.cfg  misc.cfg  mk-livestatus.cfg  obsess.cfg  omd.cfg  pnp4nagios.cfg  retention.cfg  timing.cfg  tuning.cfg

Gaps in data sent to Carbon

Thanks for your help in getting Graphios setup. I had seen gaps in data sent to Graphite but ignored it, hoping it was a side effect of running with PNP4Nagios. Since we've been very happy with Graphite, I decided to rip out PNP4Nagios and upgrade to the latest version of Graphios. Unfortunately, it appears the gaps have, for some reason, worsened.

For example, the following is combined output from both Nagios and Graphios in debug mode. Lines with breaks before and aft are Nagios perfdata that Graphios didn't see/process.

Wed May 9 14:57:17 UTC 2012 GRAPHIOS May 09 14:57:21 graphios.py DEBUG new line = app.backup.sqs.dcc.scheduled.ops01.processed 182
Wed May 9 14:57:17 UTC 2012 production_classic_daemon SQS Messages Processed, 182 messages; 36559 in queue | processed=182 queued=36559n
Wed May 9 15:02:22 UTC 2012 GRAPHIOS May 09 15:02:23 graphios.py DEBUG new line = app.backup.sqs.dcc.scheduled.ops01.processed 147
Wed May 9 15:02:22 UTC 2012 production_classic_daemon SQS Messages Processed, 147 messages; 36412 in queue | processed=147 queued=36412n
Wed May 9 15:07:23 UTC 2012 GRAPHIOS May 09 15:07:24 graphios.py DEBUG new line = app.backup.sqs.dcc.scheduled.ops01.processed 159
Wed May 9 15:07:23 UTC 2012 production_classic_daemon SQS Messages Processed, 159 messages; 36253 in queue | processed=159 queued=36253n

Wed May 9 15:12:21 UTC 2012 production_classic_daemon SQS Messages Processed, 235 messages; 36018 in queue | processed=235 queued=36018n

Wed May 9 15:17:24 UTC 2012 GRAPHIOS May 09 15:17:27 graphios.py DEBUG new line = app.backup.sqs.dcc.scheduled.ops01.processed 246
Wed May 9 15:17:24 UTC 2012 production_classic_daemon SQS Messages Processed, 246 messages; 35772 in queue | processed=246 queued=35772n
Wed May 9 15:22:28 UTC 2012 GRAPHIOS May 09 15:22:43 graphios.py DEBUG new line = app.backup.sqs.dcc.scheduled.ops01.processed 268
Wed May 9 15:22:28 UTC 2012 production_classic_daemon SQS Messages Processed, 268 messages; 35504 in queue | processed=268 queued=35504n
Wed May 9 15:27:24 UTC 2012 GRAPHIOS May 09 15:27:29 graphios.py DEBUG new line = app.backup.sqs.dcc.scheduled.ops01.processed 191
Wed May 9 15:27:24 UTC 2012 production_classic_daemon SQS Messages Processed, 191 messages; 35313 in queue | processed=191 queued=35313n

Wed May 9 15:32:21 UTC 2012 production_classic_daemon SQS Messages Processed, 151 messages; 35162 in queue | processed=151 queued=35162n

Wed May 9 15:37:26 UTC 2012 production_classic_daemon SQS Messages Processed, 202 messages; 34960 in queue | processed=202 queued=34960n

Wed May 9 16:02:00 UTC 2012 production_classic_daemon SQS Messages Processed, 858 messages; 34102 in queue | processed=858 queued=34102n

Wed May 9 16:07:02 UTC 2012 production_classic_daemon SQS Messages Processed, 258 messages; 33844 in queue | processed=258 queued=33844n

Wed May 9 16:35:58 UTC 2012 GRAPHIOS May 09 16:36:01 graphios.py DEBUG new line = app.backup.sqs.dcc.scheduled.ops01.processed 1962
Wed May 9 16:35:58 UTC 2012 production_classic_daemon SQS Messages Processed, 1962 messages; 31882 in queue | processed=1962 queued=31882n
Wed May 9 16:43:58 UTC 2012 GRAPHIOS May 09 16:44:03 graphios.py DEBUG new line = app.backup.sqs.dcc.scheduled.ops01.processed 722
Wed May 9 16:43:58 UTC 2012 production_classic_daemon SQS Messages Processed, 722 messages; 31160 in queue | processed=722 queued=31160n

Wed May 9 16:48:51 UTC 2012 production_classic_daemon SQS Messages Processed, 337 messages; 30823 in queue | processed=337 queued=30823n

Wed May 9 16:53:53 UTC 2012 GRAPHIOS May 09 16:54:05 graphios.py DEBUG new line = app.backup.sqs.dcc.scheduled.ops01.processed 329
Wed May 9 16:53:53 UTC 2012 production_classic_daemon SQS Messages Processed, 329 messages; 30494 in queue | processed=329 queued=30494n
Wed May 9 16:58:55 UTC 2012 GRAPHIOS May 09 16:59:07 graphios.py DEBUG new line = app.backup.sqs.dcc.scheduled.ops01.processed 314
Wed May 9 16:58:55 UTC 2012 production_classic_daemon SQS Messages Processed, 314 messages; 30180 in queue | processed=314 queued=30180n

Wed May 9 17:03:56 UTC 2012 production_classic_daemon SQS Messages Processed, 445 messages; 29735 in queue | processed=445 queued=29735n

Wed May 9 17:08:56 UTC 2012 GRAPHIOS May 09 17:09:09 graphios.py DEBUG new line = app.backup.sqs.dcc.scheduled.ops01.processed 456
Wed May 9 17:08:56 UTC 2012 production_classic_daemon SQS Messages Processed, 456 messages; 29279 in queue | processed=456 queued=29279n
Wed May 9 17:13:57 UTC 2012 GRAPHIOS May 09 17:14:10 graphios.py DEBUG new line = app.backup.sqs.dcc.scheduled.ops01.processed 616
Wed May 9 17:13:57 UTC 2012 production_classic_daemon SQS Messages Processed, 616 messages; 28663 in queue | processed=616 queued=28663n
Wed May 9 17:18:55 UTC 2012 GRAPHIOS May 09 17:18:55 graphios.py DEBUG new line = app.backup.sqs.dcc.scheduled.ops01.processed 370
Wed May 9 17:18:55 UTC 2012 production_classic_daemon SQS Messages Processed, 370 messages; 28293 in queue | processed=370 queued=28293n

Wed May 9 17:23:58 UTC 2012 production_classic_daemon SQS Messages Processed, 443 messages; 27850 in queue | processed=443 queued=27850n

Wed May 9 17:28:56 UTC 2012 GRAPHIOS May 09 17:28:57 graphios.py DEBUG new line = app.backup.sqs.dcc.scheduled.ops01.processed 404
Wed May 9 17:28:56 UTC 2012 production_classic_daemon SQS Messages Processed, 404 messages; 27446 in queue | processed=404 queued=27446n
Wed May 9 17:34:01 UTC 2012 GRAPHIOS May 09 17:34:14 graphios.py DEBUG new line = app.backup.sqs.dcc.scheduled.ops01.processed 380
Wed May 9 17:34:01 UTC 2012 production_classic_daemon SQS Messages Processed, 380 messages; 27066 in queue | processed=380 queued=27066n

Wed May 9 17:39:00 UTC 2012 production_classic_daemon SQS Messages Processed, 273 messages; 26793 in queue | processed=273 queued=26793n

Wed May 9 17:44:07 UTC 2012 GRAPHIOS May 09 17:44:16 graphios.py DEBUG new line = app.backup.sqs.dcc.scheduled.ops01.processed 240
Wed May 9 17:44:07 UTC 2012 production_classic_daemon SQS Messages Processed, 240 messages; 26553 in queue | processed=240 queued=26553n

Wed May 9 17:49:08 UTC 2012 production_classic_daemon SQS Messages Processed, 213 messages; 26340 in queue | processed=213 queued=26340n

The following is my config:
host_perfdata_file=/var/spool/nagios/graphios/host-perfdata
host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tGRAPHITEPREFIX::$_HOSTGRAPHITEPREFIX$\tGRAPHITEPOSTFIX::$_HOSTGRAPHITEPOSTFIX$
host_perfdata_file_mode=a
host_perfdata_file_processing_interval=15
host_perfdata_file_processing_command=process-host-perfdata
service_perfdata_file=/var/spool/nagios/graphios/service-perfdata
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$\tGRAPHITEPREFIX::$_SERVICEGRAPHITEPREFIX$\tGRAPHITEPOSTFIX::$_SERVICEGRAPHITEPOSTFIX$
service_perfdata_file_mode=a
service_perfdata_file_processing_interval=15
service_perfdata_file_processing_command=process-service-perfdata
define command{
command_name process-host-perfdata
command_line mv /var/spool/nagios/graphios/host-perfdata /var/spool/nagios/graphios/host-perfdata.$TIMET$
}
define command{ command_name process-service-perfdata
command_line mv /var/spool/nagios/graphios/service-perfdata /var/spool/nagios/graphios/service-perfdata.$TIMET$
}
define service{
host_name ops01
service_description production_classic_daemon SQS Messages Processed
check_command check_sqs_msgs_processed!/XXXXXXXXXXX/production_classic_daemon!1!1
contact_groups sysadmins
servicegroups amazon
use service-mca3-r5min-c5min-n1hr-24x7
_graphiteprefix app.backup.sqs.dcc.scheduled
}

Processed 0 files (0 metrics) in /var/spool/nagios/graphios

my nagios.cfg::

Auto-generated Graphios configs

process_performance_data=1
service_perfdata_file=/var/spool/nagios/graphios/service-perfdata
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$\tGRAPHITEPREFIX::$_SERVICEGRAPHITEPREFIX$\tGRAPHITEPOSTFIX::$_SERVICEGRAPHITEPOSTFIX$\tMETRICTYPE::$_SERVICEMETRICTYPE$
service_perfdata_file_mode=a
service_perfdata_file_processing_interval=15
service_perfdata_file_processing_command=graphios_perf_service
host_perfdata_file=/var/spool/nagios/graphios/host-perfdata
host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tGRAPHITEPREFIX::$_HOSTGRAPHITEPREFIX$\tGRAPHITEPOSTFIX::$_HOSTGRAPHITEPOSTFIX$\tMETRICTYPE::$_HOSTMETRICTYPE$
host_perfdata_file_mode=a
host_perfdata_file_processing_interval=15
host_perfdata_file_processing_command=graphios_perf_host

cfg_dir=/usr/local/nagios/etc/objects

graphite_perf_service

graphite_perf_host

** And added perf commands in include dir to graphios_commands.cfg

define command {
command_name graphite_perf_host
command_line /bin/mv /var/spool/nagios/graphios/host-perfdata /var/spool/nagios/graphios/host-perfdata.$TIMET$

}

define command {
command_name graphite_perf_service
command_line /bin/mv /var/spool/nagios/graphios/service-perfdata /var/spool/nagios/graphios/service-perfdata.$TIMET$
}

My log file::
graphios.py INFO ctrl-c pressed. Exiting graphios.
graphios_backends.py INFO Carbon Backend Initialized
graphios.py INFO Enabled backends: ['carbon']
graphios.py INFO graphios startup.
graphios.py INFO Processed 0 files (0 metrics) in /var/spool/nagios/graphios
graphios.py INFO Processed 0 files (0 metrics) in /var/spool/nagios/graphios
graphios.py INFO Processed 0 files (0 metrics) in /var/spool/nagios/graphios

and there were no files inside the spool dir ( /var/spool/nagios/graphios )

IOError: [Errno 2] No such file or directory: '/var/log/nagios/graphios.log'

When I try to run the graphios.py script in my ubuntu vm, I m getting the following error.

trustytahr@controller-trusty-ref:~/graphios/graphios$ ./graphios.py
Traceback (most recent call last):
File "./graphios.py", line 490, in
configure(options)
File "./graphios.py", line 102, in configure
opts.log_file, maxBytes=log_max_size, backupCount=4)
File "/usr/lib/python2.7/logging/handlers.py", line 117, in init
BaseRotatingHandler.init(self, filename, mode, encoding, delay)
File "/usr/lib/python2.7/logging/handlers.py", line 64, in init
logging.FileHandler.init(self, filename, mode, encoding, delay)
File "/usr/lib/python2.7/logging/init.py", line 903, in init
StreamHandler.init(self, self._open())
File "/usr/lib/python2.7/logging/init.py", line 928, in _open
stream = open(self.baseFilename, self.mode)
IOError: [Errno 2] No such file or directory: '/var/log/nagios/graphios.log'

expect your reply.

Update to closed issue: Graphios metrics limitations

Shawn,
Sorry, I should have opened a new issue as apposed to updating a closed issue.... but that is what I did. If you want me to open a new issue let me know.

Cheers,
Garry

rhel init script stop function does not work

We start the process as /usr/bin/graphios, then try to kill graphios.py. The process shows up like this:

nagios 6138 6137 0 15:25 pts/0 00:00:00 /usr/bin/python -tt /usr/bin/graphios

So obviously this doesn't work.

Latest pip version inconsistent - does not implement metrics_base_path

I have installed graphios from pip.
Thanks for your work on this tool, I think it's great.

I have been trying to implement a system which uses use_service_desc = True so it sends all of my metrics to graphite.

I have also decided to use a base path for all of my metrics, so I added metric_base_path = icinga to my config file. There was no commented template for this variable in my config file, so I had to add it.

The resulting whisper files are generated without this base path, so I started investigating why.

From my investigation it looks like the latest pip version is inconsistent and only includes some of the changes within this commit. Specifically these changes to graphios_backends.py don't seem to have been included.

2d6c078#diff-52a97f5de4dbf7f87c08baf311d76346L54

In addition to this, the change to graphios.cfg hasn't been included which explains not only why I had to add the configuration statement, but also it explains issue #95 as well.
2d6c078#diff-4df930f201411566ca416b64952a6420R52

I'm not going to make a pull request for it because I'm short of time, but hopefully this will help you to get the pip version up to date.

Once again, many thanks for this tool.

checkresults is a director error.

global prefix

it would be nice to be able to set a "global" prefix in the cfg file... ie.

[graphios]
global_prefix = 'graphios'

which would then result in all metrics being placed under graphios._graphiteprefix.hostname.service-description._graphitepostfix.perfdata.

Right now, I set _graphiteprefix in both my nagios host and service generic creation, but it would be nice to be able to set this globally from graphios for all backends.

pip install for debian doesn't setup correct permissions on /etc/init.d/graphios

I have no idea why.

works on ubuntu.

doesn't work on debian 6/7.

workaround:
chmod 755 /etc/init.d/graphios

Influxdb0.8 backend failed data push

Hi I have found a bug at https://github.com/shawn-sterling/graphios/blob/master/graphios_backends.py#L536

To get it working I had to replace the line with:
path += "%s." % m.GRAPHITEPREFIX

Hope this helps.

Alex

Status code: 400: {"error":"missing measurement"}

All data sent to influx 0.9.0 returns this error:

Status code: 400: {"error":"missing measurement"}

Changing this on the graphios_backends.py

            perfdata.append({
                            "timestamp": int(m.TIMET),
                            "name": path,
                            "tags": tags,
                            "fields": {"value": value}})

            perfdata.append({
                            "timestamp": int(m.TIMET),
                            "measurement": path,
                            "tags": tags,
                            "fields": {"value": value}})

Fixed it for me.
Thing is, I am not sure why I should have to change that, since even the InfluxDB docs use name: in its examples...
Is this the case of changing it on graphios or should I go bug the peeps at InfluxDB?

Please add advice on pruning spool directory to README

It would be great if you could mention in the README that pruning the files in the spool directory would be a good idea. My predecessor here clearly forgot, and now we're in this situation:

$ ls -dlh /var/spool/nagios/graphios/
drwxr-xr-x. 2 icinga icinga 58M Sep 30 20:26 /var/spool/nagios/graphios/

That's right, the directory listing is 58MB. Why? Because there's 1.2 million files in that directory:

$ find /var/spool/nagios/graphios/ -maxdepth 1| wc -l
1205694

It takes up rather a lot of space too (this is actually how I found it):

$ du -sh /var/spool/nagios/graphios/
4.9G    /var/spool/nagios/graphios/

I don't necessarily need to know exactly how to prune, but some advice around what's safe to prune would be great. Are the files referred to constantly, so that if we remove them we lose history? Or is it safe to remove any but the latest one. Or something in between? I just don't know. (Possibly because I haven't read the full documentation but just skimmed the README.)

librato backend looks broken

Hey man!

I just updated graphios and it looks like some literal quote chars are sneaking into the librato back-end input.. eg:

Failed to send metrics to Librato: Code: 400 . Response: {"errors":{"params":{"name":["nagios.cpu.cpuusage.'percent_0' is not a valid metric name"]}},"request_time":1458846729}

I'll dig into it and PR ya something soon.

init script should support status

init.d does not support status, which will help us automate graphios installations

Replace dots with underscores

I'm using full qualified hostnames in Nagios, like:
define service {
use my-web
host_name web010.myservice.net
_graphiteprefix servers.myservice.web010.http
}

I've been using the old Graphios (1.x) until today and my metrics are named:
servers.myservice.web010.http.web010_myservice_net.{size,time}

Unfortunately after upgrade to 2.0 my service names are:
servers.myservice.web010.http.web010.myservice.net.{size,time}

(three dummy tree levels - "web010", "myservice", "net" - before I can reach my metrics).

Can you make it an option to replace dots with underscores, like the old version did? I have replacement_character = _ in my config but it doesn't seem to do anything.

Parsing of performance data

Given the format of perf data: 'label'=value[UOM];[warn];[crit];[min];[max] as described in the plugin guidelines... I have a few questions/issues.

Labels with single quotes and/or spaces have varying results.
- This one throws an error. Not sure if it's the spaces or the non alpha chars:
  - raw perf data:
  'Intel(R) PRO/1000 MT Network Connection-QoS Packet Scheduler-0000_in_prct'=0%;8000;9000;0;100 'Intel(R) PRO/1000 MT Network Connection-QoS Packet Scheduler-0000_out_prct'=0%;9000;10000;0;100 'Intel(R) PRO/1000 MT Network Connection-QoS Packet Scheduler-0000_speed_bps'=1000000000 'Intel(R) PRO/1000 MT Network Connection_in_prct'=0%;8000;9000;0;100 'Intel(R) PRO/1000 MT Network Connection_out_prct'=0%;9000;10000;0;100 'Intel(R) PRO/1000 MT Network Connection_speed_bps'=1000000000
  - graphios.log:
  March 05 22:46:38 graphios.py CRITICAL failed to parse label: 'MT' part of perfstring 'Intel(R) PRO_1000 MT Network Connection-QoS Packet Scheduler-0000_in_prct=0%;8000;9000;0;100 Intel(R) PRO_1000 MT Network Connection-QoS Packet Scheduler-0000_out_prct=0%;9000;10000;0;100 Intel(R) PRO_1000 MT Network Connection-QoS Packet Scheduler-0000_speed_bps=1000000000 Intel(R) PRO_1000 MT Network Connection_in_prct=0%;80
- When labels have quotes ('somelabel'), the graphite metric keeps the quote.
  - raw perf data:
  'pid'=23466 heap=692439KB;;;;1048576 'heap ratio'=66%;80;90 perm=114665KB;;;;524288 perm_ratio=21%;80;90
  - whisper files. note pid has the quotes and 'heap ratio' was split and has the end quote:
```
# ls -l ../MemoryStatistics-Tomcat/
total 1.5M
-rw-r--r-- 1 carbon carbon 204K Mar  5 23:05 'pid'.wsp
-rw-r--r-- 1 carbon carbon 204K Mar  5 23:05 heap.wsp
-rw-r--r-- 1 carbon carbon 204K Mar  5 23:05 perm.wsp
-rw-r--r-- 1 carbon carbon 204K Mar  5 23:05 perm_ratio.wsp
-rw-r--r-- 1 carbon carbon 204K Mar  5 23:05 ratio'.wsp
```
Does graphios just drop UOM, warn, crit, min, max?

The obvious workaround is to not use single qoutes, spaces, or non alpha chars in labels.

BTW, thank you for all the hard work!

Version 2.0 - carbon backend - valid/invalid characters, plus hostname

Hello, we're trying graphios to send a fairly large amount of metrics to carbon, about 300K/minute. The newer version now handles this great.

However, 2 new things we noticed.

A lot of 'invalid characters' get pulled out now by fix_string, and I think at least some of them are valid. We have used # and @ in the past at least. As far as I know any valid character in a filename is valid for carbon, but I can't really find any documentation on that.
Some of our hostnames are fully qualified names, and periods are no longer replaced there.

I think the appropriate place to address this is in grapios_backends.py right before it goes to carbon, so I changed it there and it works fine.

Thanks,
Scott

Here's my diffs:

--- graphios_backends.py.old    2014-10-21 16:52:36.803305932 +0000
+++ graphios_backends.py        2014-10-21 17:00:49.524441129 +0000
@@ -288,6 +288,9 @@
         """
         Builds a carbon metric
         """
+        #replace '.' in hostnames so fqdn works
+        hostname_fixed = re.sub(r'[\s\.:\\]', self.replacement_character,
+                                m.HOSTNAME)
         if m.GRAPHITEPREFIX != "":
             pre = "%s." % m.GRAPHITEPREFIX
         else:
@@ -299,11 +302,11 @@
         if self.use_service_desc:
             # we want: (prefix.)hostname.service_desc(.postfix).perfdata
             service_desc = self.fix_string(m.SERVICEDESC)
-            path = "%s%s.%s%s.%s" % (pre, m.HOSTNAME,
+            path = "%s%s.%s%s.%s" % (pre, hostname_fixed,
                                      service_desc, post,
                                      m.LABEL)
         else:
-            path = "%s%s%s.%s" % (pre, m.HOSTNAME, post, m.LABEL)
+            path = "%s%s%s.%s" % (pre, hostname_fixed, post, m.LABEL)
         path = re.sub(r"\.$", '', path)  # fix paths that end in dot
         path = re.sub(r"\.\.", '.', path)  # fix paths with double dots
         path = self.fix_string(path)
@@ -314,7 +317,9 @@
         takes a string and replaces whitespace and invalid carbon chars with
         the global replacement_character
         """
-        invalid_chars = '~!@#$:;%^*()+={}[]|\/<>'
+        #invalid_chars = '~!@#$:;%^*()+={}[]|\/<>'
+        # @ and # are valid
+        invalid_chars = '~!$:;%^*()+={}[]|\/<>'
         my_string = re.sub("\s", self.replacement_character, my_string)
         for char in invalid_chars:
             my_string = my_string.replace(char, self.replacement_character)

Debian init file

The init file supplied fails to work on Debian. I've created one that works and I'd be happy if it could be included: https://gist.github.com/gjedeer/8480601

shawn-sterling / graphios Goto Github PK

graphios's Introduction

Graphios

Introduction

Requirements

License

Documentation

Simple Example

Another example

Using metric_base_path to add a universal prefix

A few words on Naming things for Librato

Automatic names

Big Fat Warning

Installation

Configuration

(1) graphios.cfg

(2) nagios.cfg

(3) nagios commands

NOTE: Your spool directory may be different, this is setup in step (2) the service_perfdata_file, and host_perfdata_file.

(4) Run it!

(5) Optional init script: graphios

NOTE: You may need to change the location and username that the script runs as. this varies slightly depending on where you decided to put graphios.py

(6) Your host and service configs

Upgrading

PNP4Nagios Notes:

(1) In your nagios.cfg:

(2) Change your commands:

OMD (Open Monitoring Distribution) Notes:

Check_MK Notes:

Trouble getting it working?

Got it working?

Find a bug?

Contributing

Special Thanks

graphios's People

Contributors

Stargazers

Watchers

Forkers

graphios's Issues

Auto-generated Graphios configs

graphite_perf_service

graphite_perf_host

Recommend Projects

Recommend Topics

Recommend Org