ceph / ceph-nagios-plugins Goto Github PK

View Code? Open in Web Editor NEW

82.0 18.0 80.0 209 KB

Nagios plugins for Ceph

License: Apache License 2.0

Makefile 4.70% Python 95.30%

ceph-nagios-plugins's Introduction

Nagios plugins for Ceph

A collection of nagios plugins to monitor a Ceph cluster.

Authentication

Ceph is normally configured to use cephx to authenticate its client.

To run the check_ceph_health or other plugins as user nagios you have to create a special keyring:

root# ceph auth get-or-create client.nagios mon 'allow r' > ceph.client.nagios.keyring

And use this keyring with the plugin:

nagios$ ./check_ceph_health --id nagios --keyring ceph.client.nagios.keyring

check_ceph_health

The check_ceph_health nagios plugin monitors the ceph cluster, and report its health. Can be filtered to only look at certain health checks.

Usage

usage: check_ceph_health [-h] [-e EXE] [-c CONF] [-m MONADDRESS] [-n NAME] [-i ID] [-k KEYRING] [-w WHITELIST] [-d]

'ceph health' nagios plugin.

optional arguments:
  -h, --help            show this help message and exit
  -e EXE, --exe EXE     ceph executable [/usr/bin/ceph]
  -c CONF, --conf CONF  alternative ceph conf file
  -m MONADDRESS, --monaddress MONADDRESS
                        ceph monitor address[:port]
  -i ID, --id ID        ceph client id
  -n NAME, --name NAME  ceph client name
  -k KEYRING, --keyring KEYRING
                        ceph client keyring file
  --check CHECK         regexp of which check(s) to check (luminous+) Can be
                        inverted, e.g. '^((?!PG_DEGRADED|OBJECT_MISPLACED).)*$'
  -w, --whitelist REGEXP
                        whitelist regexp for ceph health warnings
  -d, --detail          exec 'ceph health detail'
  -V, --version         show version and exit

Example

nagios$ ./check_ceph_health --name client.nagios --keyring ceph.client.nagios.keyring
HEALTH WARNING: 1 pgs degraded; 1 pgs recovering; 1 pgs stuck unclean; recovery 4448/28924462 degraded (0.015%); 2/9857830 unfound (0.000%);
nagios$ echo $?
1
nagios$

nagios$ ./check_ceph_health --id nagios --whitelist 'requests.are.blocked(\s)*32.sec'

nagios$ ./check_ceph_health --id nagios
WARNING: MON_CLOCK_SKEW( clock skew detected on mon.a )
OBJECT_MISPLACED( 1937172/695961284 objects misplaced (0.278%) )
PG_DEGRADED( Degraded data redundancy: 98/695961284 objects degraded (0.000%), 1 pg degraded )

nagios$ ./check_ceph_health --id nagios --check 'PG_DEGRADED|OBJECT_MISPLACED'
WARNING: OBJECT_MISPLACED( 1937172/695961284 objects misplaced (0.278%) )
PG_DEGRADED( Degraded data redundancy: 98/695961284 objects degraded (0.000%), 1 pg degraded )

nagios$ ./check_ceph_health --id nagios --check '^((?!PG_DEGRADED|OBJECT_MISPLACED).)*$'
WARNING: MON_CLOCK_SKEW( clock skew detected on mon.a )

check_ceph_mon

The check_ceph_mon nagios plugin monitors an individual mon daemon, reporting its status.

Possible result includes OK (up), WARN (missing).

Usage

usage: check_ceph_mon [-h] [-e EXE] [-c CONF] [-m MONADDRESS] [-i ID]
                      [-k KEYRING] [-V] [-I MONID]

'ceph quorum_status' nagios plugin.

optional arguments:
  -h, --help            show this help message and exit
  -e EXE, --exe EXE     ceph executable [/usr/bin/ceph]
  -c CONF, --conf CONF  alternative ceph conf file
  -m MONADDRESS, --monaddress MONADDRESS
                        ceph monitor to use for queries (address[:port])
  -i ID, --id ID        ceph client id
  -k KEYRING, --keyring KEYRING
                        ceph client keyring file
  -V, --version         show version and exit
  -I MONID, --monid MONID
                        mon ID to be checked for availability

Example

nagios$ ./check_ceph_mon -I node1
MON OK

nagios$ ./check_ceph_mon --monid node2
MON WARN: no mon 'node2' found in quorum

check_ceph_osd

The check_ceph_osd nagios plugin monitors an individual osd daemon or host, reporting its status.

Possible result includes OK (up), WARN (down or missing).

Usage

usage: check_ceph_osd [-h] [-e EXE] [-c CONF] [-m MONADDRESS] [-i ID]
                     [-k KEYRING] [-V] -H HOST [-I OSDID] [-o]

'ceph osd' nagios plugin.

optional arguments:
  -h, --help            show this help message and exit
  -e EXE, --exe EXE     ceph executable [/usr/bin/ceph]
  -c CONF, --conf CONF  alternative ceph conf file
  -m MONADDRESS, --monaddress MONADDRESS
                        ceph monitor address[:port]
  -i ID, --id ID        ceph client id
  -k KEYRING, --keyring KEYRING
                        ceph client keyring file
  -V, --version         show version and exit
  -H HOST, --host HOST  osd host
  -I OSDID, --osdid OSDID
                        osd id
  -o, --out             check osds that are set OUT

Example

nagios$ ./check_ceph_osd -H 172.17.0.2 -I 0
OSD OK

nagios$ ./check_ceph_osd -H 172.17.0.2 -I 0
OSD WARN: OSD.0 is down at 172.17.0.2

nagios$ ./check_ceph_osd -H 172.17.0.2 -I 100
OSD WARN: no OSD.100 found at host 172.17.0.2

nagios$ ./check_ceph_osd -H 172.17.0.2
OSD WARN: Down OSD on 172.17.0.2: osd.0

check_ceph_rgw

The check_ceph_rgw nagios plugin monitors a ceph rados gateway, reporting its status and buckets usage.

Possible result includes OK (up), WARN (down or missing).

Usage

usage: check_ceph_rgw [-h] [-d] [-B] [-e EXE] [-c CONF] [-i ID] [-V]

'radosgw-admin bucket stats' nagios plugin.

optional arguments:
  -h, --help            show this help message and exit
  -d, --detail          output perf data for all buckets
  -B, --byte            output perf data in Byte instead of KB
  -e EXE, --exe EXE     radosgw-admin executable [/usr/bin/radosgw-admin]
  -c CONF, --conf CONF  alternative ceph conf file
  -i ID, --id ID        ceph client id
  -n NAME, --name NAME  ceph client name      
  -V, --version         show version and exit

Example

nagios$ ./check_ceph_rgw
RGW OK: 4 buckets, 102276 KB total | /=102276KB

nagios$ ./check_ceph_rgw --detail --byte
RGW OK: 4 buckets, 102276 KB total | /=104730624B bucket-test1=151552B bucket-test0=12288B bucket-test2=104566784B bucket-test=0B

check_ceph_rgw_api

The check_ceph_rgw_api nagios plugin monitors a ceph rados gateway, reporting its status and buckets usage.

Difference with `check_ceph_rgw`:

check_ceph_rgw is designed for connect to cluster, check_ceph_rgw_api is connected to radosgw directly via admin api. You can check each instance of radosgw or only one endpoint via proxy/balancer (or both).

Possible results

OK - bucket info recieved from radosgw;
WARNING - connected, but wrong admin entry or usage caps;
UNKNOWN - can't connect to proxy/balancer or radosgw directly;

Requirements

Install requests-aws python library:

pip install requests-aws

Configure admin entry point (default is 'admin'):

rgw admin entry = "admin"

Enable admin API (default is enabled):

rgw enable apis = "s3, admin"

Add capability buckets=read for your user who performed checks, see Admin Guide for more details.

Usage

usage: check_ceph_rgw_api [-h] -H HOST [-k] [-e ADMIN_ENTRY] -a ACCESS_KEY -s
                      SECRET_KEY [-d] [-b] [-v]

'radosgw api bucket stats' nagios plugin.

optional arguments:
  -h, --help            show this help message and exit
  -H HOST, --host HOST  Server URL for the radosgw api (example:
                        http://objects.dreamhost.com/)
  -k, --insecure        Allow insecure server connections when using SSL
  -e ADMIN_ENTRY, --admin_entry ADMIN_ENTRY
                        The entry point for an admin request URL [default is
                        'admin']
  -a ACCESS_KEY, --access_key ACCESS_KEY
                        S3 access key
  -s SECRET_KEY, --secret_key SECRET_KEY
                        S3 secret key
  -d, --detail          output perf data for all buckets
  -b, --byte            output perf data in Byte instead of KB
  -v, --version         show version and exit

Example

nagios$ ./check_ceph_rgw_api -H https://objects.dreamhost.com/ -a JXUABTZZYHAFLCMF9VYV -s jjP8RDD0R156atS6ACSy2vNdJLdEPM0TJQ5jD1pw
RGW OK: 1 buckets, 7696 KB total | /=7696KB

nagios$ ./check_ceph_rgw_api -H objects.dreamhost.com -a JXUABTZZYHAFLCMF9VYV -s jjP8RDD0R156atS6ACSy2vNdJLdEPM0TJQ5jD1pw --detail --byte
RGW OK: 1 buckets, 7696 KB total | /=7880704B k0ste=7880704B

check_ceph_df

The check_ceph_df nagios plugin monitors a ceph cluster, reporting its percentual RAW capacity usage, or specific pool usage.

Possible result includes OK, WARN and CRITICAL.

Usage

usage: check_ceph_df [-h] [-e EXE] [-c CONF] [-m MONADDRESS] [-i ID] [-n NAME]
                     [-k KEYRING] [-d] [-W WARN] [-C CRITICAL] [-V]

'ceph df' nagios plugin.

optional arguments:
  -h, --help            show this help message and exit
  -e EXE, --exe EXE     ceph executable [/usr/bin/ceph]
  -c CONF, --conf CONF  alternative ceph conf file
  -m MONADDRESS, --monaddress MONADDRESS
                        ceph monitor address[:port]
  -i ID, --id ID        ceph client id
  -n NAME, --name NAME  ceph client name
  -k KEYRING, --keyring KEYRING
                        ceph client keyring file
  -p POOL, --pool POOL  ceph pool name
  -d, --detail          show pool details on warn and critical
  -W WARN, --warn WARN  warn above this percent RAW USED
  -C CRITICAL, --critical CRITICAL
                        critical alert above this percent RAW USED
  -V, --version         show version and exit

Example

nagios$ ./check_ceph_df -i nagios -k /etc/ceph/ceph.client.nagios.keyring -W 29.12 -C 30.22 -d
RAW usage 28.36%

nagios$ ./check_ceph_df -i nagios -k /etc/ceph/ceph.client.nagios.keyring -W 26.14 -C 30
WARNING: global RAW usage of 28.36% is above 26.14% (783G of 1093G free)

nagios$ ./check_ceph_df -i nagios -k /etc/ceph/ceph.client.nagios.keyring -W 60 -C 70 -p hdd
CRITICAL: Pool 'hdd' usage of 71.71% is above 70.0% (9703G used)

nagios$ ./check_ceph_df -i nagios -k /etc/ceph/ceph.client.nagios.keyring -W 60 -C 70 -p nvme
CRITICAL: Pool 'nvme' usage of 76.08% is above 70.0% (223G used)

nagios$ ./check_ceph_df -i nagios -k /etc/ceph/ceph.client.nagios.keyring -W 26.14 -C 30 -d
WARNING: global RAW usage of 28.36% is above 26.14% (783G of 1093G free)

 POOLS:
     NAME                ID     USED       %USED     MAX AVAIL     OBJECTS
     rbd                 0      96137M      8.59          348G       24441
     cephfs_data         1      61785M      5.52          348G       99940
     cephfs_metadata     2      40380k         0          348G        8037
     libvirt-pool        3         145         0          348G           2

check_ceph_mds

The check_ceph_mds nagios plugin monitors an individual mds daemon, reporting its status.

Possible result includes OK, WARN (laggy) and Error (not found).

Usage

usage: check_ceph_mds [-h] [-e EXE] [-c CONF] [-m MONADDRESS] [-i ID]
                      [-k KEYRING] [-V] -n NAME -f FILESYSTEM

'ceph mds stat' nagios plugin.

optional arguments:
  -h, --help            show this help message and exit
  -e EXE, --exe EXE     ceph executable [/usr/bin/ceph]
  -c CONF, --conf CONF  alternative ceph conf file
  -m MONADDRESS, --monaddress MONADDRESS
                        ceph monitor to use for queries (address[:port])
  -i ID, --id ID        ceph client id
  -k KEYRING, --keyring KEYRING
                        ceph client keyring file
  -V, --version         show version and exit
  -n NAME, --name NAME  mds daemon name
  -f FILESYSTEM, --filesystem FILESYSTEM
                        mds filesystem name

Example

nagios$ ./check_ceph_mds -f cephfs -n ceph-mds-1
MDS OK: MDS 'ceph-mds-1' is up:active

nagios$ ./check_ceph_mds -f cephfs -n ceph-mds-2
MDS OK: MDS 'ceph-mds-2' is up:standby

nagios$ ./check_ceph_mds -f cephfs -n ceph-mds-1
MDS WARN: MDS 'ceph-mds-1' is up:active (laggy or crashed)

nagios$ ./check_ceph_mds -f cephfs -n ceph-mds-3
MDS ERROR: MDS 'ceph-mds-3' is not found (offline?)

check_ceph_mgr

The check_ceph_mgr nagios plugin monitors the mgr.

Usage

usage: check_ceph_mgr [-h] [-e EXE] [-c CONF] [-m MONADDRESS] [-i ID]
                    [-n NAME] [-k KEYRING] [-V]

'ceph mgr dump' nagios plugin.

optional arguments:
-h, --help            show this help message and exit
-e EXE, --exe EXE     ceph executable [/usr/bin/ceph]
-c CONF, --conf CONF  alternative ceph conf file
-m MONADDRESS, --monaddress MONADDRESS
                        ceph monitor to use for queries (address[:port])
-i ID, --id ID        ceph client id
-n NAME, --name NAME  ceph client name
-k KEYRING, --keyring KEYRING
                        ceph client keyring file
-V, --version         show version and exit

Example

nagios$ ./check_ceph_mgr
MGR OK: active: zhdk0013, standbys: zhdk0009, zhdk0025

check_ceph_osd_db

The check_ceph_osd_db checks the percentage usage of the BlueStore DB for the OSD and reports it as critical if it's above the threshold.

ceph-nagios-plugins's People

Contributors

Stargazers

Watchers

Forkers

rochaporto atta mbornoz hufman lielongxingkong cirrax richardnixonshead bryanagee renard ninech adrianlzt ashishchandra1 safiyat elliot64 theanalyst rubenk melkah dmick rudybroersma stfc goodard kingleoric2010 lukasdeboer minshenglin kringalf jpronans frankz naag b1-systems wjin amc90 adfinis-forks ibr cakebrick wzs803 emsl-msc honza801 ptrunk oliveiradan osuosl k0ste maksimshchuplov sq3 monschein harisekhon simran1 gbrochard rbkmoney sg4r bbkz eagleusb tetehacko clempi j-licht noris-network marcelohuqu bergei snak isabella232 lordhigham warriorxk tobias-urdin haklein imatic-it dalees maartenbeeckmans akashc6 ndpf heroeslament skroutz anubisdonetks lbausch bluikko niclan numerigraphe indigex devwatchdog

ceph-nagios-plugins's Issues

[check_ceph_osd] says ok when an OSD is Down+out

Hi,

Thanks for this plugin, I have one question about check_ceph_osd, it would be nice if the below output generate a warning. An OSD is down but the status of OSD's are OK so we dont get a warning that an OSD is out. The OSD was faulty and was taken out by ceph itself.

OSD OK
Up OSDs: osd.1 osd.4 osd.7 osd.10
Down+In OSDs: 
Down+Out OSDs: osd.13

check_rgw errors

1st Problem:
There is no support for the --keyfile switch. All other check support it. Please add the keyfile switch.

2nd Problem:

When running the check with simple options, if fails while trying to fetch mon config:

./check_ceph_rgw --id nagios -c /etc/ceph/ceph.conf
RGW ERROR:  :: failed to fetch mon config (--no-mon-config to skip)

The option doesn't seem to be supported by the script though:

 ./check_ceph_rgw --id nagios --no-mon-config
usage: check_ceph_rgw [-h] [-d] [-B] [-e EXE] [-c CONF] [-i ID] [-n NAME] [-V]
check_ceph_rgw: error: unrecognized arguments: --no-mon-config

What to do to check RGW Gateway status?

Debian Repository

This package contains a debian directory. Is there a Debian repository already?

Did you try to add the package to the offical repository?

Adjust to work with Octopus installed via cephadm

In Octopus the default way to install ceph is to use cephadm, which runs all the ceph daemons in Docker Containers. However, therefore the Ceph Nodes will have names conatining the Container ID. Especially the MDS Check is not working anymore in this case.

As a quick workaround, I suggest to make a substring check for the hostname instead of an exact string match, this should prevent reconfiguration of the nagios check.

replace Line 109 by:
if name in mds.get_name():

and replace line 118 by:
if not name in mds.get_name():

check_ceph_df :=> ValueError: could not convert string to float: TiB

check_ceph_df isn't able to convert Tib. I am getting following error :-

/usr/local/libexec/nagios/sl_plugins/check_ceph_df --name client.nagios --keyring /etc/ceph/ceph.nagios.keyring -c /etc/ceph/ceph.conf -W 75 -C 90 -p rbd
Traceback (most recent call last):
File "/usr/local/libexec/nagios/sl_plugins/check_ceph_df", line 179, in
sys.exit(main())
File "/usr/local/libexec/nagios/sl_plugins/check_ceph_df", line 123, in main
pool_usage_percent = float(poolvals[3])
ValueError: could not convert string to float: TiB

Error initializing cluster client

Hi,
I have an error from nagios server when i try to use "check_ceph_mon" plugin or another one. For the moment, only the check_ceph_health plugin is working fine.

The issue is: "MON ERROR: b"Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')\n"
If i run the plugin from pve works fine. Can someone help me to solve the issue?

Thanks in advance.

check_ceph_health --whitelist reports HEALTH_OK when in WARN

# Without whitelist
root@mira021:~# /usr/lib/nagios/plugins/check_ceph_health --name client.nagios -k /etc/ceph/client.nagios.keyring
HEALTH WARNING: noout flag(s) set; 1 backfillfull osd(s); 1 osds down; 1 nearfull osd(s); 14954651/135094260 objects misplaced (11.070%); 12/44437884 objects unfound (0.000%); Reduced data availability: 21 pgs inactive; Degraded data redundancy: 12352137/135094260 objects degraded (9.143%), 3525 pgs unclean, 3169 pgs degraded, 1272 pgs undersized; 10 slow requests are blocked > 32 sec; clock skew detected on mon.mira049, mon.mira060; 

# With whitelist
root@mira021:~# /usr/lib/nagios/plugins/check_ceph_health --name client.nagios -k /etc/ceph/client.nagios.keyring --whitelist 'blocked'
HEALTH OK

# Cluster health
root@mira021:~# ceph health detail | grep -v pg
osd.50 is backfill full
osd.87 (root=default,host=mira122) is down
osd.45 is near full
1 ops are blocked > 1048.58 sec
8 ops are blocked > 524.288 sec
2 ops are blocked > 262.144 sec
osds 7,16,41,55,65,71 have blocked requests > 524.288 sec
osd.39 has blocked requests > 1048.58 sec
mon.mira049 addr 172.21.5.114:6789/0 clock skew 0.632786s > max 0.05s (latency 0.194764s)
mon.mira060 addr 172.21.6.130:6789/0 clock skew 0.195608s > max 0.05s (latency 0.655687s)

Shouldn't the clock skew and osd issues still report as HEALTH_WARN?

OSD Permissions Denied when using OSD Check Plugins with Nagios 4.3.4 / NPRE 3.2.1

When attempting to check OSD with Nagios using NRPE, I am getting the following error:
OSD ERROR: 2018-01-10 14:18:26.252441 7f67360c7700 -1 asok(0x7f6730001680) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.nagios.1784273.140081163671472.asok': (13) Permission denied .

I have followed the documentation where we create the keyring file for nagios:
ceph auth get-or-create client.nagios mon 'allow r' > /etc/ceph/client.nagios.keyring

thank you

Karl Birkland

check_ceph_health consistently timing out/dying

As many as 3 times per hour, the script times out and is terminated by signal 9. I was curious as to whether this was a known issue and/or potentially something wrong with my setup (utilizing Icinga at the moment). Thank you!

When it times out, it stays like this for at least a minute before working again

Move to python3

I believe most distros moved to Python 3 as Python 2 is no longer supported.

It would be great to drop Python 2 support and i.e. fix shebangs to #!/usr/bin/env python3.

I know, that it is possible to run this with symlink from python to python3. IMHO it isn't great solution as it makes harder to spot problem with different applications not yet ported to python2.

For legacy systems, it might be easy enough to leave python 2 support in i.e. python2 branch.

./check_ceph_rgw as user nagios

hi,

I get all plugins as non-root working, except check_ceph_rgw.

RGW ERROR:  :: 2017-06-25 17:20:33.839691 7f2f7333c8c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.admin.keyring: (13) Permission denied
2017-06-25 17:20:33.839702 7f2f7333c8c0  0 librados: client.admin initialization error (13) Permission denied
couldn't init storage provider

what is understandable, because /etc/pve/priv/ceph.client.admin.keyring is only readable by root. For the other checks, I created a separated keyring, but that option is missing. So what I have missed, to get it as user nagios working, withouth make the /etc/pve/priv/ceph.client.admin.keyring readable by nagios ?

cu denny

ps. very big thanks for the great plugins 👍

check_ceph_df: `TypeError: a bytes-like object is required, not 'str'`

When monitoring a specific pool with check_ceph_df the following error is thrown:

$ ./check_ceph_df --pool rbd -C 70 -W 50
Traceback (most recent call last):
  File "/usr/lib64/nagios/plugins/ceph-nagios-plugins/check_ceph_df", line 231, in <module>
    sys.exit(main())
  File "/usr/lib64/nagios/plugins/ceph-nagios-plugins/check_ceph_df", line 136, in main
    if args.pool in line:
TypeError: a bytes-like object is required, not 'str'

$ /usr/bin/env python -V
Python 3.6.8

I worked around this by modifying line 136:

# before
if args.pool in line:

# after
if args.pool in str(line):

Not sure if this an appropriate fix - if so, I can open a PR to fix this issue.

Have a variant of rgw plugin using the restful admin api

We could have a variant of rgw plugin calling the same command using the restful api, this has an advantage of being able to be deployed on the admin or any other nodes and doesn't require the cephx keyring. The downside is that a user has to be created in advance with the necessary caps for this to be working.

check_ceph_mds TypeError: 'NoneType' object is not iterable

Hi,
I'm having some issues getting check_ceph_mds working correctly. When I supply the minimum or more arguments, -n and -f I get the following error:

./check_ceph_mds --id nagios --keyring /etc/ceph/client.nagios.keyring -f testfs-cephfs-1 -n ceph1
Traceback (most recent call last):
File "./check_ceph_mds", line 181, in
sys.exit(main())
File "./check_ceph_mds", line 103, in main
return check_target_mds(mds_stat, args.filesystem, args.name)
File "./check_ceph_mds", line 115, in check_target_mds
for mds in active_mdss:
TypeError: 'NoneType' object is not iterable

Any thoughts on this? Server is CentOS 7.

Thanks!

dpkg: python-support not available in recent debian

The dpkg package can not be built with current debian versions (9.6, stretch, stable as of now, 2018):

~/ceph-nagios-plugins# dpkg-buildpackage -us -uc
dpkg-buildpackage: info: source package nagios-plugins-ceph
dpkg-buildpackage: info: source version 1.5.1-1
dpkg-buildpackage: info: source distribution unstable
dpkg-buildpackage: info: source changed by Roman Plessl <[email protected]>
dpkg-buildpackage: info: host architecture amd64
 dpkg-source --before-build ceph-nagios-plugins
dpkg-checkbuilddeps: error: Unmet build dependencies: python-support
dpkg-buildpackage: warning: build dependencies/conflicts unsatisfied; aborting
dpkg-buildpackage: warning: (Use -d flag to override.)

When using the -d flag as suggested, this error comes up, but I can not say whether this is a follow-up error or an issue all by itself (removing python-support from the file 'control' results in the same output):

~/ceph-nagios-plugins# dpkg-buildpackage -us -uc -d
dpkg-buildpackage: info: source package nagios-plugins-ceph
dpkg-buildpackage: info: source version 1.5.1-1
dpkg-buildpackage: info: source distribution unstable
dpkg-buildpackage: info: source changed by Roman Plessl <[email protected]>
dpkg-buildpackage: info: host architecture amd64
 dpkg-source --before-build ceph-nagios-plugins
 debian/rules clean
dh clean
dh: Compatibility levels before 9 are deprecated (level 8 in use)
   dh_testdir
   dh_auto_clean
dh_auto_clean: Compatibility levels before 9 are deprecated (level 8 in use)
	make -j1 clean
make[1]: Entering directory '/root/ceph-nagios-plugins'
rm -rf /root/ceph-nagios-plugins/tmp *.tar.gz *.deb *.dsc
make[1]: Leaving directory '/root/ceph-nagios-plugins'
   dh_clean
dh_clean: Compatibility levels before 9 are deprecated (level 8 in use)
 dpkg-source -b ceph-nagios-plugins
dpkg-source: error: can't build with source format '3.0 (native)': native package version may not have a revision
dpkg-buildpackage: error: dpkg-source -b ceph-nagios-plugins gave error exit status 255

MGR ERROR: keyring file '/etc/nagios/client.nagios.keyring' doesn't exist

Hey!

I got some error I can't figure out, proably something easy but here goes:

When I run the command locally all works:
sudo -u nrpe /usr/lib64/nagios/plugins/check_ceph_mgr -i nagios -k /etc/ceph/client.nagios.keyring -c /etc/ceph/ceph.conf
MGR OK: active: mgr1, standbys: mgr2

However on the nagios I get the error "MGR ERROR: keyring file '/etc/nagios/client.nagios.keyring' doesn't exist"

/usr/local/nagios/libexec/check_nrpe -H mgr1 -c check_ceph_mgr
MGR ERROR: keyring file '/etc/nagios/client.nagios.keyring' doesn't exist

check_ceph_osd Issue

I am using you ceph plugin for nagios in ubuntu 14.04 machine. All commands are working fine except the below one
"/usr/lib/nagios/plugins/check_ceph_osd --host 10.xx.xx.x1 -I -0"

When running the command getting following message:
OSD ERROR: 2016-12-07 06:14:11.663694 7fa2c44fc700 1 -- :/0 messenger.start

But if I try to run ceph osd tree I am getting below message:
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.58893 root default
-2 0.86298 host 10.xx.xx.x1
0 0.43149 osd.0 up 1.00000 1.00000
1 0.43149 osd.1 up 1.00000 1.00000
-3 0.86298 host 10.xx.xx.x2
2 0.43149 osd.2 up 1.00000 1.00000
3 0.43149 osd.3 up 1.00000 1.00000
-4 0.86298 host 10.xx.xx.x3
4 0.43149 osd.4 up 1.00000 1.00000
5 0.43149 osd.5 up 1.00000 1.00000

Kindly help on the above issue.

ImportError: cannot import name S3Auth

Installed requests-aws:

Successfully installed requests-aws-0.1.8
Successfully installed awsauth-0.3.3

$ ./check_ceph_rgw_api 
Traceback (most recent call last):
  File "./check_ceph_rgw_api", line 24, in <module>
    from awsauth import S3Auth
ImportError: cannot import name S3Auth

running ubuntu 16.04:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.6 LTS
Release:        16.04
Codename:       xenial

--id option does not work on radosgw-admin in giant

For me --id does not work in radosgw-admin, but -i does:

/usr/bin/radosgw-admin -v
ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)

I know this is an ceph problem not yours but maybe using -i instead of --id in check_ceph_rgw is an option..

Adding plugin for MGR service monitoring

Hi!

Is there any plans for adding MGR daemon monitoring? It's a separated daemon since Luminous release, and should be monitored separately.

Thanks for this amazing monitoring plugins!

check_ceph_df checks only first class

Hi,

i've discovered that check_ceph_df uses only 3rd line from ceph df output to report usage ignoring other classes.

For example my ceph df output is as following:

RAW STORAGE:
    CLASS     SIZE        AVAIL       USED        RAW USED     %RAW USED
    hdd       115 TiB      50 TiB      64 TiB       65 TiB         56.56
    ssd        13 TiB     8.8 TiB     4.1 TiB      4.3 TiB         32.63
    TOTAL     128 TiB      59 TiB      68 TiB       69 TiB         54.10

and check reports only:

RAW usage 56.56%

ceph version:
ceph version 14.2.16

client id parameter for `radosgw-admin` should be `--id` instead of `-i`

As radosgw-admin is a CEPH_ENTITY_TYPE_CLIENT, --id needs to be appended instead of -i, when constructing the rgw_cmd:

https://github.com/ceph/ceph-nagios-plugins/blob/master/src/check_ceph_rgw#L72

Whats about a new Github Release?

As I see you build Debain Packages with version 1.5.5, but there as only a github release of version 1.5.0.

check_ceph_mon always show "MON OK"

Hi!
First of all, thanks for this plugin.
I have ceph cluster with disabled cephx and a ceph monitor down name node2:

root@node1:~# ceph mon stat
e3: 3 mons at {node1=172.16.1.1:6789/0,node2=172.16.1.2:6789/0,node3=172.16.1.3:6789/0}, election epoch 522, quorum 0,2 node1,node3

and also no ping replo to node2:

root@node1:~# ping node2
PING node2 (172.16.1.2) 56(84) bytes of data.
^C
--- node2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1007ms

but the plugin says ok

root@node1:~# ./ceph-nagios-plugins-master/src/check_ceph_mon -I node2 -H node2
MON OK

What can be wrong?

output of 'ceph df' seems to be changed in mimic-Release

I had to change two lines in the check_ceph_df script after upgrading my ceph cluster to mimic:

Line 140 & 141:

<            global_usage_percent = float(globalvals[3])
<            global_available_space = globalvals[1]
>            global_usage_percent = float(globalvals[6])
>            global_available_space = globalvals[2]

check_ceph_osd not working after upgrade to ceph version 14.2.20

Hi there,

since the upgrade I get the following error:

check_ceph_osd --id nagios --keyring /etc/icinga/objects/credstore/ceph.client.nagios.keyring -H '10.20.0.XXX' -I '0'
OSD ERROR: 2021-05-05 15:29:38.125135 7f10f0853700  0 monclient: hunting for new mon

Despite the warning about "mons are allowing insecure global_id reclaim" the ceph cluster is healthy. This warning is related to
https://docs.ceph.com/en/latest/security/CVE-2021-20288/

The MON check check_ceph_mon is runnig without any issues for all hosts:

check_ceph_mon --id nagios --keyring /etc/icinga/objects/credstore/ceph.client.nagios.keyring -I 'XXXXXXX'
MON OK

The check version is 1.5.2

Thank you in advance
Greetings
Leo

check_ceph_osd fails from remote host : failed to bind the UNIX domain socket

@valerytschopp , first of all thanks for out-of-box working plugins.

I have been trying these plugins in my environment and check_ceph_osd does'nt seems to be working from a remote host, however its working from a localhost

[root@admin libexec]# ./check_nrpe -H storage0111 -c check_ceph_osd -a nagios /etc/ceph/client.nagios.keyring storage0111-ib
OSD ERROR: 2015-05-27 15:57:14.277582 7f810c20b700 -1 asok(0x7f8104000fc0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.nagios.3014269.140192094621760.asok': (13) Permission denied

[root@admin libexec]#

However if i execute it from a localhost , it simply works.


[root@storage0111 libexec]# ./check_ceph_osd --id nagios --keyring /etc/ceph/client.nagios.keyring --host storage0111-ib
OSD OK
Up OSDs: osd.0 osd.6 osd.10 osd.17 osd.21 osd.26 osd.30 osd.35 osd.40 osd.45
Down+In OSDs:
Down+Out OSDs:
[root@storage0111 libexec]#

FYI , check_ceph_health and check_ceph_mon are working fine from local as well as remote host. The problem is only with check_ceph_osd script.

Since you are the father of these scripts , any help wrt is appreciated.

check_ceph_osd nor working with Proxmox/Ceph Nautilus

I've just upgraded our Proxmox cluster to 6.2 including the upgrade from Ceph Luminous to Nautilus.
All other checks (check_ceph_health, check_ceph_mon, check_ceph_df) still work as expected but check_ceph_osd is refusing to work.

I'm using the following command:
/usr/lib/nagios/plugins/check_ceph_osd -i nagios --key /var/lib/nagios/ceph.client.nagios.keyring --host 10.55.0.1 --out

followed by this error:
OSD ERROR: 2020-08-26 09:32:03.862 7fbe53968700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.nagios.keyring: (2) No such file or directory 2020-08-26 09:32:03.862 7fbe53968700 -1 AuthRegistry(0x7fbe4c081ff8) no keyring found at /etc/pve/priv/ceph.client.nagios.keyring, disabling cephx

I don't know why ceph is looking for a key in /etc/pve/priv/ceph.client.nagios.keyring
If I copy my key from /var/lib/nagios/ceph.client.nagios.keyring to /etc/pve/priv/ceph.client.nagios.keyring the command works as expected but only as user root. In Proxmox, /etc/pve/priv is a special cluster file system where all files are owned by root with no read permissions for any other user. Of course I would like to avoid running the check as root.

Keyring has been created with
ceph auth get-or-create client.nagios mon 'allow r' osd 'allow r' > /var/lib/nagios/ceph.client.nagios.keyring

Maybe thats the same problem as in issue #30 but worked in Luminous?

Repository appears to have no active maintainers anymore

Hello,

we've opened a PR which implements an additional check almost two months ago with no response from any maintainer of this repository.

Since there has been no response or other noticeable activity in this repository I'd say that this is an issue and new maintainers should be found by the ceph project who'll take care of this repository in the future.

check_ceph_df returns "ValueError: could not convert string to float: TiB" in nautilus

# ceph df
RAW STORAGE:
    CLASS     SIZE        AVAIL       USED        RAW USED     %RAW USED 
    hdd       427 TiB     194 TiB     233 TiB      234 TiB         54.68 
    ssd       2.1 TiB     1.0 TiB     1.1 TiB      1.1 TiB         51.63 
    TOTAL     430 TiB     195 TiB     234 TiB      235 TiB         54.67 
 
POOLS:
    POOL                           ID      STORED      OBJECTS     USED        %USED     MAX AVAIL 
    data                             0      75 TiB      50.38M      75 TiB     45.87        30 TiB 
    metadata                         1      56 GiB       2.47M      56 GiB      0.06        22 TiB 
    libvirt-pool                     4     2.0 GiB         518     2.0 GiB         0        30 TiB 
    djf_tmp                         92     1.4 TiB       8.24M     1.4 TiB      1.60        30 TiB 
    .rgw.root                       93     1.1 KiB           4     1.1 KiB         0        30 TiB 
    default.rgw.control             94         0 B           8         0 B         0        30 TiB 
    default.rgw.meta                95     2.4 KiB           8     2.4 KiB         0        30 TiB 
    default.rgw.log                 96     8.0 MiB         209     8.0 MiB         0        30 TiB 
    default.rgw.buckets.index       97      38 MiB           2      38 MiB         0        30 TiB 
    default.rgw.buckets.data        98     744 GiB     242.06k     744 GiB      0.81        30 TiB 
    default.rgw.buckets.non-ec      99         0 B           0         0 B         0        30 TiB 
    device_health_metrics          100      52 MiB         146      52 MiB         0        30 TiB 


# ceph --version
ceph version 14.2.1-198-g869a6a3 (869a6a3e1140d44523ad1e10239a9c874cce0885) nautilus (stable)

Passing keyring to a cephadm volume for health check

Trying to use the check_ceph_health script I hit an issue: the keyring is not passed to the cephadm container that runs the ceph command.

I do not know how this current implementation is supposed to be run but usually there would be an additional switch passed to cephadm that mounts the keyring from the host OS to the container as a volume, for example with:

cephadm shell -v /etc/ceph/client.nagios.keyring:/etc/ceph/client.nagios.keyring:z

Possibly the :z at the end of the volume specification would not work on a system without SELinux but I do not have any other than RPM systems to test with. In the example code this is not included since I couldn't test it on other systems.

The example I made is very simple, possibly it is not pretty but Python is not one of my favorites: bluikko/ceph-nagios-plugins@master...bluikko-cephadm-volume
It works on my system but since this was not included in the original implementation it makes me question how was the cephadm support supposed to work earlier. Usually (?) the container it runs would not include keyrings, to my knowledge.

I would like feedback on this before I make a PR.

Additionally running the script without adding the above patch fails at line 189 on Python 3.9 (the real error is about the missing keyring)...

TypeError: a bytes-like object is required, not 'str'

ceph / ceph-nagios-plugins Goto Github PK

ceph-nagios-plugins's Introduction

Nagios plugins for Ceph

Authentication

check_ceph_health

Usage

Example

check_ceph_mon

Usage

Example

check_ceph_osd

Usage

Example

check_ceph_rgw

Usage

Example

check_ceph_rgw_api

Difference with check_ceph_rgw:

Possible results

Requirements

Usage

Example

check_ceph_df

Usage

Example

check_ceph_mds

Usage

Example

check_ceph_mgr

Usage

Example

check_ceph_osd_db

ceph-nagios-plugins's People

Contributors

Stargazers

Watchers

Forkers

ceph-nagios-plugins's Issues

Recommend Projects

Recommend Topics

Recommend Org

Difference with `check_ceph_rgw`: