rochaporto / collectd-ceph Goto Github PK

View Code? Open in Web Editor NEW

61.0 61.0 65.0 403 KB

collectd plugins and dashboards for ceph

License: GNU General Public License v2.0

Python 100.00%

collectd-ceph's People

Contributors

Stargazers

Watchers

Forkers

catalyst gitshaw bobrik benh57 ssejourne kazhang odiso benagricola gwdg liom lincolnbryant scambra italux ksingh7 gcmalloc rzerres briancline lloucas-imvu mourgaya congto marcosmamorim inkscope grinapo cfz wadeholler kiranos yatinkumbhare patchkez aidaho12 a2brenna nerdmagic congdonglinux edgaruribe369 zphj1987 ying-rao thanhdl3 lukasdeboer dmartinb switch-ch anmolbabu bradleykite devicenull dlupescu shayanh onkelmonokel eakkoyun mgiacomoli stratoscale

collectd-ceph's Issues

placement groups dump (pg stats)

Metrics to be collected:

number of pgs in each possible state
? perf stats per pool
? perf stats per osd

ceph pg dump --format json-pretty

{ "version": 401,
  "stamp": "2014-05-19 23:33:26.976176",
  "last_osdmap_epoch": 29,
  "last_pg_scan": 23,
  "full_ratio": "0.950000",
  "near_full_ratio": "0.750000",
  "pg_stats_sum": { "stat_sum": { "num_bytes": 22908977,
          "num_objects": 13,
          "num_object_clones": 0,
          "num_object_copies": 26,
          "num_objects_missing_on_primary": 0,
          "num_objects_degraded": 0,
          "num_objects_unfound": 0,
          "num_read": 1350,
          "num_read_kb": 1079,
          "num_write": 99,
          "num_write_kb": 31913,
          "num_scrub_errors": 0,
          "num_shallow_scrub_errors": 0,
          "num_deep_scrub_errors": 0,
          "num_objects_recovered": 0,
          "num_bytes_recovered": 0,
          "num_keys_recovered": 0},
      "stat_cat_sum": {},
      "log_size": 127,
      "ondisk_log_size": 127},
  "osd_stats_sum": { "kb": 6815452,
      "kb_used": 2220596,
      "kb_avail": 4594856,
      "hb_in": [],
      "hb_out": [],
      "snap_trim_queue_len": 0,
      "num_snap_trimming": 0,
      "op_queue_age_hist": { "histogram": [],
          "upper_bound": 1},
      "fs_perf_stat": { "commit_latency_ms": 1021,
          "apply_latency_ms": 11153}},
  "pg_stats_delta": { "stat_sum": { "num_bytes": 0,
          "num_objects": 0,
          "num_object_clones": 0,
          "num_object_copies": 0,
          "num_objects_missing_on_primary": 0,
          "num_objects_degraded": 0,
          "num_objects_unfound": 0,
          "num_read": 0,
          "num_read_kb": 0,
          "num_write": 0,
          "num_write_kb": 0,
          "num_scrub_errors": 0,
          "num_shallow_scrub_errors": 0,
          "num_deep_scrub_errors": 0,
          "num_objects_recovered": 0,
          "num_bytes_recovered": 0,
          "num_keys_recovered": 0},
      "stat_cat_sum": {},
      "log_size": 0,
      "ondisk_log_size": 0},
  "pg_stats": [
        { "pgid": "14.31",
          "version": "0'0",
          "reported_seq": "11",
          "reported_epoch": "29",
          "state": "active+clean",
          "last_fresh": "2014-05-19 23:26:25.549117",
          "last_change": "2014-05-16 04:34:11.013010",
          "last_active": "2014-05-19 23:26:25.549117",
          "last_clean": "2014-05-19 23:26:25.549117",
          "last_became_active": "0.000000",
          "last_unstale": "2014-05-19 23:26:25.549117",
          "mapping_epoch": 23,
          "log_start": "0'0",
          "ondisk_log_start": "0'0",
          "created": 23,
          "last_epoch_clean": 25,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2014-05-16 04:28:54.807245",
          "last_deep_scrub": "0'0",
          "last_deep_scrub_stamp": "2014-05-16 04:28:54.807245",
          "last_clean_scrub_stamp": "0.000000",
          "log_size": 0,
          "ondisk_log_size": 0,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 0,
              "num_objects": 0,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 0,
              "num_read_kb": 0,
              "num_write": 0,
              "num_write_kb": 0,
              "num_scrub_errors": 0,
              "num_shallow_scrub_errors": 0,
              "num_deep_scrub_errors": 0,
              "num_objects_recovered": 0,
              "num_bytes_recovered": 0,
              "num_keys_recovered": 0},
          "stat_cat_sum": {},
          "up": [
                0,
                1],
          "acting": [
                0,
                1]},
...
  "pool_stats": [
        { "poolid": 0,
          "stat_sum": { "num_bytes": 0,
              "num_objects": 0,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 0,
              "num_read_kb": 0,
              "num_write": 0,
              "num_write_kb": 0,
              "num_scrub_errors": 0,
              "num_shallow_scrub_errors": 0,
              "num_deep_scrub_errors": 0,
              "num_objects_recovered": 0,
              "num_bytes_recovered": 0,
              "num_keys_recovered": 0},
          "stat_cat_sum": {},
          "log_size": 0,
          "ondisk_log_size": 0},
...
  "osd_stats": [
        { "osd": 0,
          "kb": 2919444,
          "kb_used": 1110328,
          "kb_avail": 1809116,
          "hb_in": [
                1],
          "hb_out": [],
          "snap_trim_queue_len": 0,
          "num_snap_trimming": 0,
          "op_queue_age_hist": { "histogram": [],
              "upper_bound": 1},
          "fs_perf_stat": { "commit_latency_ms": 484,
              "apply_latency_ms": 4703}},
        { "osd": 1,
          "kb": 3896008,
          "kb_used": 1110268,
          "kb_avail": 2785740,
          "hb_in": [
                0],
          "hb_out": [],
          "snap_trim_queue_len": 0,
          "num_snap_trimming": 0,
          "op_queue_age_hist": { "histogram": [],
              "upper_bound": 1},
          "fs_perf_stat": { "commit_latency_ms": 537,
              "apply_latency_ms": 6450}}]}

mon dump plugin (monitor stats)

Metrics to be collected:

number of monitors: len(data['mons'])
current quorum: len(data['quorum'])

ceph mon dump --format json-pretty
dumped monmap epoch 1

{ "epoch": 1,
  "fsid": "95f73d54-8bdc-41d4-a540-fbd4480dc511",
  "modified": "0.000000",
  "created": "0.000000",
  "mons": [
        { "rank": 0,
          "name": "ceph",
          "addr": "192.168.5.230:6789\/0"}],
  "quorum": [
        0]}

Hello,
experimenitng with ceph_pool_plugin I noticed that when changing the Interval (the plugin parameter, not the collectd global Interval setting) to something like 10 seconds - I had the graphite retention period to 10seconds already - , graphite was not drawing any data.

Is the 60sec plugin Interval I see on the example some kind of restriction?

Regards,
Kostis

integrate collectd disk stats

Two plugin candidates:
https://collectd.org/wiki/index.php/Plugin:Disk
https://github.com/indygreg/collectd-diskstats

We might need to package them in the deb.

CentOS6.6 AttributeError: 'module' object has no attribute 'check_output'

@rochaporto A big thanks for sharing these plugins.

Need help , While trying to use ceph_pool_plugin , i got this error in collectd logs.

[2015-07-09 16:53:04] [error] ceph-pool: failed to ceph pool stats :: 'module' object has no attribute 'check_output' :: Traceback (most recent call last):
  File "/usr/lib/collectd/plugins/ceph/ceph_pool_plugin.py", line 54, in get_stats
    stats_output = subprocess.check_output('ceph osd pool stats -f json', shell=True)
AttributeError: 'module' object has no attribute 'check_output'

[2015-07-09 16:53:04] [info] ceph: collectd new data from service :: took 0 seconds
[2015-07-09 16:53:04] [error] ceph: failed to retrieve stats

ceph: failed to get stats :: float division by zero for ceph OSD percentage used metrics

ceph: failed to get stats :: float division by zero :: Traceback (most recent call last):#12 File "/usr/lib/collectd/plugins/ceph/base.py", line 125, in read_callback#012 stats = self.get_stats(config)#12 File "/usr/lib/collectd/plugins/ceph/ceph_pg_plugin.py", line 73, in get_stats#012 data[ceph_cluster][osd_id]['percent_used'] = 100.0 * (osd['kb_used'] / float(osd['kb']))#012ZeroDivisionError: float division by zero

Max_avail is not collected

max_avail is an important metric gathered by "ceph df", but It's not being collected anymore.

CentOS 7 errors on collectd start using ceph_pool_plugin

After starting collectd running on CentOS 7, (ceph giant and now upgraded to hammer) I'm getting the following log errors using the ceph_pool_plugin.

-- Unit collectd.service has begun starting up.
Apr 15 15:04:18 ceph1.domain systemd[1]: Started Collectd statistics daemon.
-- Subject: Unit collectd.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit collectd.service has finished starting up.
-- 
-- The start-up result is done.
Apr 15 15:04:18 ceph1.domain collectd[22862]: Initialization complete, entering read-loop.
Apr 15 15:04:18 ceph1.domain python[22874]: detected unhandled Python exception in '/usr/bin/ceph'
Apr 15 15:04:18 ceph1.domain abrt-server[22881]: Package 'ceph-common' isn't signed with proper key
Apr 15 15:04:18 ceph1.domain abrt-server[22881]: 'post-create' on '/var/tmp/abrt/Python-2015-04-15-15:04:18-22874' exited with 1
Apr 15 15:04:18 ceph1.domain abrt-server[22881]: Deleting problem directory '/var/tmp/abrt/Python-2015-04-15-15:04:18-22874'
Apr 15 15:04:18 ceph1.domain collectd[22862]: Traceback (most recent call last):
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/bin/ceph", line 896, in <module>
Apr 15 15:04:18 ceph1.domain collectd[22862]: retval = main()
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/bin/ceph", line 647, in main
Apr 15 15:04:18 ceph1.domain collectd[22862]: conffile=conffile)
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib/python2.7/site-packages/rados.py", line 212, in __init__
Apr 15 15:04:18 ceph1.domain collectd[22862]: library_path  = find_library('rados')
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib64/python2.7/ctypes/util.py", line 244, in find_library
Apr 15 15:04:18 ceph1.domain collectd[22862]: return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib64/python2.7/ctypes/util.py", line 237, in _findSoname_ldconfig
Apr 15 15:04:18 ceph1.domain collectd[22862]: f.close()
Apr 15 15:04:18 ceph1.domain collectd[22862]: IOError: [Errno 10] No child processes
Apr 15 15:04:18 ceph1.domain python[22884]: detected unhandled Python exception in '/usr/bin/ceph'
Apr 15 15:04:18 ceph1.domain abrt-server[22891]: Not saving repeating crash in '/usr/bin/ceph'
Apr 15 15:04:18 ceph1.domain collectd[22862]: Traceback (most recent call last):
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/bin/ceph", line 896, in <module>
Apr 15 15:04:18 ceph1.domain collectd[22862]: retval = main()
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/bin/ceph", line 647, in main
Apr 15 15:04:18 ceph1.domain collectd[22862]: conffile=conffile)
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib/python2.7/site-packages/rados.py", line 212, in __init__
Apr 15 15:04:18 ceph1.domain collectd[22862]: library_path  = find_library('rados')
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib64/python2.7/ctypes/util.py", line 244, in find_library
Apr 15 15:04:18 ceph1.domain collectd[22862]: return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib64/python2.7/ctypes/util.py", line 237, in _findSoname_ldconfig
Apr 15 15:04:18 ceph1.domain collectd[22862]: f.close()
Apr 15 15:04:18 ceph1.domain collectd[22862]: IOError: [Errno 10] No child processes
Apr 15 15:04:18 ceph1.domain collectd[22862]: ceph: failed to get stats :: No JSON object could be decoded :: Traceback (most recent call last):
                                                      File "/usr/lib64/collectd/base.py", line 114, in read_callback
                                                        stats = self.get_stats()
                                                      File "/usr/lib64/collectd/ceph_pool_plugin.py", line 67, in get_stats
                                                        json_stats_data = json.loads(stats_output)
                                                      File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
                                                        return _default_decoder.decode(s)
                                                      File "/usr/lib64/python2.7/json/decoder.py", line 365, in decode
                                                        obj, end = self.raw_decode(s, idx=_w(s, 0).end())
                                                      File "/usr/lib64/python2.7/json/decoder.py", line 383, in raw_decode
                                                        raise ValueError("No JSON object could be decoded")
                                                    ValueError: No JSON object could be decoded
Apr 15 15:04:18 ceph1.domain collectd[22862]: Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment
Apr 15 15:04:18 ceph1.domain collectd[22862]: read-function of plugin `python.ceph_pool_plugin' failed. Will suspend it for 120.000 seconds.

collectd.conf:

<LoadPlugin python>
  Globals true
</LoadPlugin>

<Plugin "python">
    ModulePath "/usr/lib64/collectd"

    Import "ceph_pool_plugin"

    <Module "ceph_pool_plugin">
        Verbose "True"
        Cluster "ceph"
        Interval "60"
        TestPool "rbd"
    </Module>
</Plugin>

Pool parsing fails with ceph 0.87

From the release notes:

The rd_kb and wr_kb fields in the JSON dumps for pool stats (accessed via the ceph df detail -f json-pretty and related commands) have been replaced with corresponding *_bytes fields. Similarly, the total_space, total_used, and total_avail fields are replaced with total_bytes, total_used_bytes, and total_avail_bytes fields.

This breaks the plugin

io latency plugin

With something like:

rados -p test bench 10 write -t 1 -b 65536 2>/dev/null | grep -i latency | awk \'{print 1000*$3}\'

get plugins to trigger shell processes

In each plugin we're doing subprocess.check_output(..., shell=False).

This is better, but causes issues when multiple python plugins are loaded. Enabling shell seems to fix it.

install collectd-ceph

There ,https://github.com/rochaporto/collectd-ceph ,I have not found a description of how to install the collectd ceph plugin

Fork, merge, stuff

Hello All,

I had enough of the outstanding bugs, pulls and forks everywhere lying around since the beginning of 2015, I have forked and merged in most of the non-overlapping pulls and some forks.

https://github.com/grinapo/collectd-ceph

If anyone want to create pulls against it feel free, I try to merge them. I don't promise to fix bugs, but I may eventually since I'm using it as well.

Obviously it'd be okay to pull it back here if rochaporto's back again.

list index out of range in latency plugin

Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: ceph: failed to get stats :: list index out of range :: Traceback (most recent call last):
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: File "/usr/lib/collectd/plugins/ceph/base.py", line 114, in read_callback
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: stats = self.get_stats()
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: File "/usr/lib/collectd/plugins/ceph/ceph_latency_plugin.py", line 67, in get_stats
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: data[ceph_cluster]['cluster']['stddev_latency'] = results[1]
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: IndexError: list index out of range
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]:
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: read-function of plugin `python.ceph.ceph_latency_plugin' failed. Will suspend it for 120.000 seconds.

Values in config are ignored

I'm running collectd 5.4.1 on CentOS 6.5, running with the CentOS scl python27 embedded into collectd (since this plugin requires python 2.7 due to its use of subprocess.check_output)

I noticed any values i set in the config are ignored

14:13:53 root@sm-sensu /usr/lib/collectd/plugins/ceph-git/plugins $ cat /etc/collectd.d/ceph_latency.conf
<LoadPlugin "python">
Globals true

Interval 10
Debug True

<Plugin "python">
ModulePath "/usr/lib/collectd/plugins/ceph-git/plugins"

Import "ceph_latency_plugin"

<Module "ceph_latency_plugin">
            Verbose "true"
            Cluster ceph
            Interval 10
            TestPool test
</Module>

I believe this is because the register_read happens before the configure_callback is called.

config key: Cluster - ceph
config key: Interval - 10
config key: TestPool - test
Stopping collectd: [ OK ]
Starting collectd: latency plugin registering with interval: 60.0

To fix this, i moved the register_read inside the callback, and now it works:
def configure_callback(conf):
"""Received configuration information"""
plugin.config_callback(conf)
collectd.error("latency plugin registering with interval: %s" % plugin.interval)
collectd.register_read(read_callback, plugin.interval)

Stopping collectd: [ OK ]
Starting collectd: config key: Verbose - true
config key: Cluster - ceph
config key: Interval - 10.0
config key: TestPool - test
latency plugin registering with interval: 10.0

protect against failures in ceph latency plugin

Current plugin tries forever when rados bench fails (the process just keeps trying).

Need to wrap it around a timeout so that we make sure it disappears when it fails to run (as in a network connectivity issue for example).

Syntax error

Hi there !

I found a bug on a variable name here :

cephdf_cmdline='ceph df -f json --cluster ' + self.cluster 
df_output = subprocess.check_output(ceph_dfcmdline, shell=True)

Just need to rename ceph_dfcmdline in cephdf_cmdline

bye

add missing init.py files

So that we can reference the plugins location as a package.

interval not being taken into account

Same issue as in collectd-openstack, the interval is not being passed properly to the dispatch call of collectd.

osd dump plugin (osd / pool stats)

Metrics to be included:

number of osds in each state (in, out, down, up)
number of pools

ceph osd dump --format json-pretty

{ "epoch": 29,
  "fsid": "95f73d54-8bdc-41d4-a540-fbd4480dc511",
  "created": "2014-05-16 04:21:58.549874",
  "modified": "2014-05-19 23:26:25.318330",
  "flags": "",
  "cluster_snapshot": "",
  "pool_max": 14,
  "max_osd": 2,
  "pools": [
        { "pool": 0,
          "pool_name": "data",
          "flags": 0,
          "flags_names": "",
          "type": 1,
          "size": 2,
          "min_size": 2,
          "crush_ruleset": 0,
          "object_hash": 2,
          "pg_num": 64,
          "pg_placement_num": 64,
          "crash_replay_interval": 45,
          "last_change": "3",
          "auid": 0,
          "snap_mode": "selfmanaged",
          "snap_seq": 0,
          "snap_epoch": 0,
          "pool_snaps": {},
          "removed_snaps": "[]",
          "quota_max_bytes": 0,
          "quota_max_objects": 0,
          "tiers": [],
          "tier_of": -1,
          "read_tier": -1,
          "write_tier": -1,
          "cache_mode": "none",
          "properties": []},
        { "pool": 1,
          "pool_name": "metadata",
          "flags": 0,
          "flags_names": "",
          "type": 1,
          "size": 2,
          "min_size": 2,
          "crush_ruleset": 1,
          "object_hash": 2,
          "pg_num": 64,
          "pg_placement_num": 64,
          "crash_replay_interval": 0,
          "last_change": "8",
          "auid": 0,
          "snap_mode": "selfmanaged",
          "snap_seq": 0,
          "snap_epoch": 0,
          "pool_snaps": {},
          "removed_snaps": "[]",
          "quota_max_bytes": 0,
          "quota_max_objects": 0,
          "tiers": [],
          "tier_of": -1,
          "read_tier": -1,
          "write_tier": -1,
          "cache_mode": "none",
          "properties": []},
        { "pool": 2,
          "pool_name": "rbd",
          "flags": 0,
          "flags_names": "",
          "type": 1,
          "size": 2,
          "min_size": 2,
          "crush_ruleset": 2,
          "object_hash": 2,
          "pg_num": 64,
          "pg_placement_num": 64,
          "crash_replay_interval": 0,
          "last_change": "2",
          "auid": 0,
          "snap_mode": "selfmanaged",
          "snap_seq": 0,
          "snap_epoch": 0,
          "pool_snaps": {},
          "removed_snaps": "[]",
          "quota_max_bytes": 0,
          "quota_max_objects": 0,
          "tiers": [],
          "tier_of": -1,
          "read_tier": -1,
          "write_tier": -1,
          "cache_mode": "none",
          "properties": []},
        { "pool": 3,
          "pool_name": "images",
          "flags": 0,
          "flags_names": "",
          "type": 1,
          "size": 2,
          "min_size": 2,
          "crush_ruleset": 0,
          "object_hash": 2,
          "pg_num": 64,
          "pg_placement_num": 64,
          "crash_replay_interval": 0,
          "last_change": "29",
          "auid": 0,
          "snap_mode": "selfmanaged",
          "snap_seq": 5,
          "snap_epoch": 29,
          "pool_snaps": {},
          "removed_snaps": "[1~2,4~1]",
          "quota_max_bytes": 0,
          "quota_max_objects": 0,
          "tiers": [],
          "tier_of": -1,
          "read_tier": -1,
          "write_tier": -1,
          "cache_mode": "none",
          "properties": []},
        { "pool": 4,
          "pool_name": "volumes",
          "flags": 0,
          "flags_names": "",
          "type": 1,
          "size": 2,
          "min_size": 2,
          "crush_ruleset": 0,
          "object_hash": 2,
          "pg_num": 64,
          "pg_placement_num": 64,
          "crash_replay_interval": 0,
          "last_change": "11",
          "auid": 0,
          "snap_mode": "selfmanaged",
          "snap_seq": 0,
          "snap_epoch": 0,
          "pool_snaps": {},
          "removed_snaps": "[]",
          "quota_max_bytes": 0,
          "quota_max_objects": 0,
          "tiers": [],
          "tier_of": -1,
          "read_tier": -1,
          "write_tier": -1,
          "cache_mode": "none",
          "properties": []}],
  "osds": [
        { "osd": 0,
          "uuid": "1e79235a-f094-47e1-80d1-8232d2d475cb",
          "up": 1,
          "in": 1,
          "last_clean_begin": 0,
          "last_clean_end": 0,
          "up_from": 13,
          "up_thru": 24,
          "down_at": 0,
          "lost_at": 0,
          "public_addr": "192.168.5.230:6800\/16168",
          "cluster_addr": "192.168.5.230:6801\/16168",
          "heartbeat_back_addr": "192.168.5.230:6802\/16168",
          "heartbeat_front_addr": "192.168.5.230:6803\/16168",
          "state": [
                "exists",
                "up"]},
        { "osd": 1,
          "uuid": "555ed8d0-11da-49b5-8ee9-3887c5937237",
          "up": 1,
          "in": 1,
          "last_clean_begin": 0,
          "last_clean_end": 0,
          "up_from": 16,
          "up_thru": 0,
          "down_at": 0,
          "lost_at": 0,
          "public_addr": "192.168.5.230:6805\/16349",
          "cluster_addr": "192.168.5.230:6806\/16349",
          "heartbeat_back_addr": "192.168.5.230:6807\/16349",
          "heartbeat_front_addr": "192.168.5.230:6808\/16349",
          "state": [
                "exists",
                "up"]}],
  "osd_xinfo": [
        { "osd": 0,
          "down_stamp": "0.000000",
          "laggy_probability": "0.000000",
          "laggy_interval": 0},
        { "osd": 1,
          "down_stamp": "0.000000",
          "laggy_probability": "0.000000",
          "laggy_interval": 0}],
  "pg_temp": [],
  "blacklist": []}

improve error logging / detection

Right now the get_stats functions are not exiting on error.

In particular we should check for empty output when a query to ceph fails, and log and exit immediately (otherwise we get a bunch of nasty stacktraces).

rbd io stats

By parsing detailed osd log messages.
https://github.com/cernceph/ceph-scripts/blob/master/tools/rbd-io-stats.pl

cluster status (ceph perf dump)

Something similar to:

ceph --admin-daemon /var/run/ceph/ceph-mon.ceph.asok perf dump
{ "cluster": { "num_mon": 1,
      "num_mon_quorum": 1,
      "num_osd": 2,
      "num_osd_up": 2,
      "num_osd_in": 2,
      "osd_epoch": 29,
      "osd_kb": 6815452,
      "osd_kb_used": 2220596,
      "osd_kb_avail": 4594856,
      "num_pool": 15,
      "num_pg": 960,
      "num_pg_active_clean": 960,
      "num_pg_active": 960,
      "num_pg_peering": 0,
      "num_object": 13,
      "num_object_degraded": 0,
      "num_object_unfound": 0,
      "num_bytes": 22908977,
      "num_mds_up": 0,
      "num_mds_in": 0,
      "num_mds_failed": 0,
      "mds_epoch": 1},
  "leveldb": { "leveldb_get": 35461,
      "leveldb_transaction": 2166,
      "leveldb_compact": 0,
      "leveldb_compact_range": 2,
      "leveldb_compact_queue_merge": 0,
      "leveldb_compact_queue_len": 0},
  "mon": {},
  "throttle-mon_client_bytes": { "val": 0,
      "max": 104857600,
      "get": 2573538,
      "get_sum": 187405680,
      "get_or_fail_fail": 0,
      "get_or_fail_success": 0,
      "take": 0,
      "take_sum": 0,
      "put": 2573538,
      "put_sum": 187405680,
      "wait": { "avgcount": 0,
          "sum": 0.000000000}},
  "throttle-mon_daemon_bytes": { "val": 0,
      "max": 419430400,
      "get": 11354,
      "get_sum": 4799934,
      "get_or_fail_fail": 0,
      "get_or_fail_success": 0,
      "take": 0,
      "take_sum": 0,
      "put": 11354,
      "put_sum": 4799934,
      "wait": { "avgcount": 0,
          "sum": 0.000000000}},
  "throttle-msgr_dispatch_throttler-mon": { "val": 0,
      "max": 104857600,
      "get": 2584892,
      "get_sum": 192205614,
      "get_or_fail_fail": 0,
      "get_or_fail_success": 0,
      "take": 0,
      "take_sum": 0,
      "put": 2584892,
      "put_sum": 192205614,
      "wait": { "avgcount": 0,
          "sum": 0.000000000}}}

Even if some of the metrics are coming from other places too, still worth it.

No result from subprocess.check_output(...)

The commands like "subprocess.check_output(['ceph', 'df', '-f', 'json'])" return a empty string when it's called from collectd.
Other commands in a collectd python plugin are working (like 'ls' or 'pwd').
A simple python script executing the same command ("subprocess.check_output(['ceph', 'df', '-f', 'json'])") works well (out of collectd).
Where is the incompatibility between python/collectd/ceph (user rights, subprocess, ...) ?
Any suggestions ?

OS: ubuntu 1204
Collectd: 5.1.0
Python: 2.7.3

individual osd stats (throughput, iops, etc)

Should be possible to plug in metrics regarding throughtput, iops, etc into graphite, the same way we do for pools.

typo in pool code

--- ceph_pool_plugin.py~ 2016-03-02 23:35:20.000000000 +0100
+++ ceph_pool_plugin.py 2016-03-03 00:04:04.876014974 +0100
@@ -54,7 +54,7 @@
osd_pool_cmdline='ceph osd pool stats -f json --cluster ' + self.cluster
stats_output = subprocess.check_output(osd_pool_cmdline, shell=True)
cephdf_cmdline='ceph df -f json --cluster ' + self.cluster

       df_output = subprocess.check_output(ceph_dfcmdline, shell=True)

       df_output = subprocess.check_output(cephdf_cmdline, shell=True)
 except Exception as exc:
     collectd.error("ceph-pool: failed to ceph pool stats :: %s :: %s"
             % (exc, traceback.format_exc()))

Error ceph_pool

Hello,

I have this error when i execute collectd and in syslog i have this.

ceph-pool: failed to ceph pool stats :: global name 'ceph_dfcmdline' is not defined :: Traceback (most recent call last):#12 File "/opt/collectd/plugins/ceph_pool_plugin.py", line 57, in get_stats#012 df_output = subprocess.check_output(ceph_dfcmdline, shell=True)#012NameError: global name 'ceph_dfcmdline' is not defined

I compiled collectd with python and ceph.

Thank for your help.

Video tutorial?

How to used it? Did you have any video tutorial?
Any help would be appreciated.

read_op_per_sec / write_op_per_sec in Jewel

Heya,

It seems that Ceph changed the op_per_sec to read_op_per_sec and write_op_per_sec somewhere in Jewel. I think this is the reason I'm no longer getting op_per_sec stats in my graphite.

I might want to venture a pull request to fix this but I'm unsure how to properly check for Ceph versions. Is there any instance of this somewhere in the code already? Is there a recommended way of doing this?

space usage stats (ceph df)

Metrics to be collectd:

total of space used and available
space used per pool
number of objects per pool

ceph df --format json-pretty

{ "stats": { "total_space": 6815452,
      "total_used": 2220596,
      "total_avail": 4594856},
  "pools": [
        { "name": "data",
          "id": 0,
          "stats": { "kb_used": 0,
              "bytes_used": 0,
              "objects": 0}},
        { "name": "metadata",
          "id": 1,
          "stats": { "kb_used": 0,
              "bytes_used": 0,
              "objects": 0}},
        { "name": "rbd",
          "id": 2,
          "stats": { "kb_used": 0,
              "bytes_used": 0,
              "objects": 0}},
        { "name": "images",
          "id": 3,
          "stats": { "kb_used": 22373,
              "bytes_used": 22908960,
              "objects": 9}},
        { "name": "volumes",
          "id": 4,
          "stats": { "kb_used": 1,
              "bytes_used": 17,
              "objects": 4}}]

No JSON object could be decoded

I am setting up a box to monitor a ceph cluster. My client config seems to be fine, i am able to run all the commands that these scripts are running inside. But seeing the below errors in collectd log. I am using ubuntu 14.04 with collectd 5.4. Please help me with this issue. Thank you!

[2016-02-25 15:27:40] read-function of plugin `python.ceph_pool_plugin' failed. Will suspend it for 240.000 seconds.
[2016-02-25 15:31:40] ceph: failed to get stats :: No JSON object could be decoded :: Traceback (most recent call last):
File "/usr/lib/collectd/plugins/ceph/base.py", line 108, in read_callback
stats = self.get_stats()
File "/usr/lib/collectd/plugins/ceph/ceph_monitor_plugin.py", line 62, in get_stats
json_data = json.loads(output)
File "/usr/lib/python2.7/json/init.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

[2016-02-25 15:31:40] Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment
[2016-02-25 15:31:40] read-function of plugin `python.ceph_monitor_plugin' failed. Will suspend it for 480.000 seconds.
[2016-02-25 15:31:40] ceph: failed to get stats :: No JSON object could be decoded :: Traceback (most recent call last):
File "/usr/lib/collectd/plugins/ceph/base.py", line 108, in read_callback
stats = self.get_stats()
File "/usr/lib/collectd/plugins/ceph/ceph_pool_plugin.py", line 67, in get_stats
json_stats_data = json.loads(stats_output)
File "/usr/lib/python2.7/json/init.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

cluster and pools stats (ceph osd pool stats)

Metrics to be collected:

Per pool read bytes / sec
Per pool write bytes / sec
Per pool IOPS

ceph osd pool stats -f json-pretty

[
    { "pool_name": "data",
      "pool_id": 0,
      "recovery": {},
      "recovery_rate": {},
      "client_io_rate": {}},
    { "pool_name": "metadata",
      "pool_id": 1,
      "recovery": {},
      "recovery_rate": {},
      "client_io_rate": {}},
    },
]

Which also includes inside client_io_rate things like read_bytes_sec, write_bytes_sec, op_per_sec.

visualization of ceph cluster

Look for a widget that would show the cluster structure: whatever is defined in the crushmap.

integrate iostat plugin

Also check if there's a built-in one that does what we need. Otherwise a candidate:
https://github.com/keirans/collectd-iostat

and we might need to package it in the deb.

Add identity (ceph username) to config

Please add ceph client name to options.

Ceph allows to use different clients (usernames) with different permissions (option '-n'). It is the best practice to use identity with minimal permissions for a given task. Current configuration calls ceph without client name, which implies 'client.admin' client name. Such permission for monitoring service is too wide, IMHO.

Error on ceph_pg_plugin.py

hi i;m taehoon.
i deploying ceph clutser(Luminous).
So build monitoring system with graphite , grafana , collectd.

but i have Trouble below this.

-------/var/log/message ------------------------------------------------------------------------
Jun 11 10:09:42 ceph-mgr.cdngp.net collectd[57815]: ceph: failed to get stats :: 'fs_perf_stat' :: Traceback (most recent call last):#12 File "/usr/lib64/collectd/plugins/ceph/base.py", line 114, in read_callback#012 stats = self.get_stats()#12 File "/usr/lib64/collectd/plugins/ceph/ceph_pg_plugin.py", line 79, in get_stats#012 data[ceph_cluster][osd_id]['apply_latency_ms'] = osd['fs_perf_stat']['apply_latency_ms']#012KeyError: 'fs_perf_stat'

Jun 11 10:09:42 ceph-mgr.cdngp.net collectd[57815]: Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment

Jun 11 10:09:42 ceph-mgr.cdngp.net collectd[57815]: read-function of plugin `python.ceph_pg_plugin' failed. Will suspend it for 120.000 seconds.

Jun 11 10:09:43 ceph-mgr.cdngp.net collectd: dumped fsmap epoch 196

somebody help me.. :-<

Collectd example configuration

Collectd configuration that matches grafana dashboard wasn't easy to guess for me. Probably not easy for everyone who sees collectd for the first time.

I created dockerized version of collectd and this plugin: https://github.com/bobrik/ceph-collectd-graphite

At least collectd configuration (like in my repo) should be mentioned in readme for this plugin. Dockerized version could also be mentioned, it is much easier to deploy from scratch.

Thanks for this plugin!

rochaporto / collectd-ceph Goto Github PK

collectd-ceph's People

Contributors

Stargazers

Watchers

Forkers

collectd-ceph's Issues

Recommend Projects

Recommend Topics

Recommend Org