Giter Site home page Giter Site logo

collectd-ceph's People

Contributors

bobrik avatar mourgaya avatar rochaporto avatar umesecke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

collectd-ceph's Issues

placement groups dump (pg stats)

Metrics to be collected:

  • number of pgs in each possible state
  • ? perf stats per pool
  • ? perf stats per osd
ceph pg dump --format json-pretty

{ "version": 401,
  "stamp": "2014-05-19 23:33:26.976176",
  "last_osdmap_epoch": 29,
  "last_pg_scan": 23,
  "full_ratio": "0.950000",
  "near_full_ratio": "0.750000",
  "pg_stats_sum": { "stat_sum": { "num_bytes": 22908977,
          "num_objects": 13,
          "num_object_clones": 0,
          "num_object_copies": 26,
          "num_objects_missing_on_primary": 0,
          "num_objects_degraded": 0,
          "num_objects_unfound": 0,
          "num_read": 1350,
          "num_read_kb": 1079,
          "num_write": 99,
          "num_write_kb": 31913,
          "num_scrub_errors": 0,
          "num_shallow_scrub_errors": 0,
          "num_deep_scrub_errors": 0,
          "num_objects_recovered": 0,
          "num_bytes_recovered": 0,
          "num_keys_recovered": 0},
      "stat_cat_sum": {},
      "log_size": 127,
      "ondisk_log_size": 127},
  "osd_stats_sum": { "kb": 6815452,
      "kb_used": 2220596,
      "kb_avail": 4594856,
      "hb_in": [],
      "hb_out": [],
      "snap_trim_queue_len": 0,
      "num_snap_trimming": 0,
      "op_queue_age_hist": { "histogram": [],
          "upper_bound": 1},
      "fs_perf_stat": { "commit_latency_ms": 1021,
          "apply_latency_ms": 11153}},
  "pg_stats_delta": { "stat_sum": { "num_bytes": 0,
          "num_objects": 0,
          "num_object_clones": 0,
          "num_object_copies": 0,
          "num_objects_missing_on_primary": 0,
          "num_objects_degraded": 0,
          "num_objects_unfound": 0,
          "num_read": 0,
          "num_read_kb": 0,
          "num_write": 0,
          "num_write_kb": 0,
          "num_scrub_errors": 0,
          "num_shallow_scrub_errors": 0,
          "num_deep_scrub_errors": 0,
          "num_objects_recovered": 0,
          "num_bytes_recovered": 0,
          "num_keys_recovered": 0},
      "stat_cat_sum": {},
      "log_size": 0,
      "ondisk_log_size": 0},
  "pg_stats": [
        { "pgid": "14.31",
          "version": "0'0",
          "reported_seq": "11",
          "reported_epoch": "29",
          "state": "active+clean",
          "last_fresh": "2014-05-19 23:26:25.549117",
          "last_change": "2014-05-16 04:34:11.013010",
          "last_active": "2014-05-19 23:26:25.549117",
          "last_clean": "2014-05-19 23:26:25.549117",
          "last_became_active": "0.000000",
          "last_unstale": "2014-05-19 23:26:25.549117",
          "mapping_epoch": 23,
          "log_start": "0'0",
          "ondisk_log_start": "0'0",
          "created": 23,
          "last_epoch_clean": 25,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2014-05-16 04:28:54.807245",
          "last_deep_scrub": "0'0",
          "last_deep_scrub_stamp": "2014-05-16 04:28:54.807245",
          "last_clean_scrub_stamp": "0.000000",
          "log_size": 0,
          "ondisk_log_size": 0,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 0,
              "num_objects": 0,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 0,
              "num_read_kb": 0,
              "num_write": 0,
              "num_write_kb": 0,
              "num_scrub_errors": 0,
              "num_shallow_scrub_errors": 0,
              "num_deep_scrub_errors": 0,
              "num_objects_recovered": 0,
              "num_bytes_recovered": 0,
              "num_keys_recovered": 0},
          "stat_cat_sum": {},
          "up": [
                0,
                1],
          "acting": [
                0,
                1]},
...
  "pool_stats": [
        { "poolid": 0,
          "stat_sum": { "num_bytes": 0,
              "num_objects": 0,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 0,
              "num_read_kb": 0,
              "num_write": 0,
              "num_write_kb": 0,
              "num_scrub_errors": 0,
              "num_shallow_scrub_errors": 0,
              "num_deep_scrub_errors": 0,
              "num_objects_recovered": 0,
              "num_bytes_recovered": 0,
              "num_keys_recovered": 0},
          "stat_cat_sum": {},
          "log_size": 0,
          "ondisk_log_size": 0},
...
  "osd_stats": [
        { "osd": 0,
          "kb": 2919444,
          "kb_used": 1110328,
          "kb_avail": 1809116,
          "hb_in": [
                1],
          "hb_out": [],
          "snap_trim_queue_len": 0,
          "num_snap_trimming": 0,
          "op_queue_age_hist": { "histogram": [],
              "upper_bound": 1},
          "fs_perf_stat": { "commit_latency_ms": 484,
              "apply_latency_ms": 4703}},
        { "osd": 1,
          "kb": 3896008,
          "kb_used": 1110268,
          "kb_avail": 2785740,
          "hb_in": [
                0],
          "hb_out": [],
          "snap_trim_queue_len": 0,
          "num_snap_trimming": 0,
          "op_queue_age_hist": { "histogram": [],
              "upper_bound": 1},
          "fs_perf_stat": { "commit_latency_ms": 537,
              "apply_latency_ms": 6450}}]}

mon dump plugin (monitor stats)

Metrics to be collected:

  • number of monitors: len(data['mons'])
  • current quorum: len(data['quorum'])
ceph mon dump --format json-pretty
dumped monmap epoch 1

{ "epoch": 1,
  "fsid": "95f73d54-8bdc-41d4-a540-fbd4480dc511",
  "modified": "0.000000",
  "created": "0.000000",
  "mons": [
        { "rank": 0,
          "name": "ceph",
          "addr": "192.168.5.230:6789\/0"}],
  "quorum": [
        0]}

plugin interval values

Hello,
experimenitng with ceph_pool_plugin I noticed that when changing the Interval (the plugin parameter, not the collectd global Interval setting) to something like 10 seconds - I had the graphite retention period to 10seconds already - , graphite was not drawing any data.

Is the 60sec plugin Interval I see on the example some kind of restriction?

Regards,
Kostis

CentOS6.6 AttributeError: 'module' object has no attribute 'check_output'

@rochaporto A big thanks for sharing these plugins.

Need help , While trying to use ceph_pool_plugin , i got this error in collectd logs.

[2015-07-09 16:53:04] [error] ceph-pool: failed to ceph pool stats :: 'module' object has no attribute 'check_output' :: Traceback (most recent call last):
  File "/usr/lib/collectd/plugins/ceph/ceph_pool_plugin.py", line 54, in get_stats
    stats_output = subprocess.check_output('ceph osd pool stats -f json', shell=True)
AttributeError: 'module' object has no attribute 'check_output'

[2015-07-09 16:53:04] [info] ceph: collectd new data from service :: took 0 seconds
[2015-07-09 16:53:04] [error] ceph: failed to retrieve stats

ceph: failed to get stats :: float division by zero for ceph OSD percentage used metrics

ceph: failed to get stats :: float division by zero :: Traceback (most recent call last):#12 File "/usr/lib/collectd/plugins/ceph/base.py", line 125, in read_callback#012 stats = self.get_stats(config)#12 File "/usr/lib/collectd/plugins/ceph/ceph_pg_plugin.py", line 73, in get_stats#012 data[ceph_cluster][osd_id]['percent_used'] = 100.0 * (osd['kb_used'] / float(osd['kb']))#012ZeroDivisionError: float division by zero

CentOS 7 errors on collectd start using ceph_pool_plugin

After starting collectd running on CentOS 7, (ceph giant and now upgraded to hammer) I'm getting the following log errors using the ceph_pool_plugin.

-- Unit collectd.service has begun starting up.
Apr 15 15:04:18 ceph1.domain systemd[1]: Started Collectd statistics daemon.
-- Subject: Unit collectd.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit collectd.service has finished starting up.
-- 
-- The start-up result is done.
Apr 15 15:04:18 ceph1.domain collectd[22862]: Initialization complete, entering read-loop.
Apr 15 15:04:18 ceph1.domain python[22874]: detected unhandled Python exception in '/usr/bin/ceph'
Apr 15 15:04:18 ceph1.domain abrt-server[22881]: Package 'ceph-common' isn't signed with proper key
Apr 15 15:04:18 ceph1.domain abrt-server[22881]: 'post-create' on '/var/tmp/abrt/Python-2015-04-15-15:04:18-22874' exited with 1
Apr 15 15:04:18 ceph1.domain abrt-server[22881]: Deleting problem directory '/var/tmp/abrt/Python-2015-04-15-15:04:18-22874'
Apr 15 15:04:18 ceph1.domain collectd[22862]: Traceback (most recent call last):
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/bin/ceph", line 896, in <module>
Apr 15 15:04:18 ceph1.domain collectd[22862]: retval = main()
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/bin/ceph", line 647, in main
Apr 15 15:04:18 ceph1.domain collectd[22862]: conffile=conffile)
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib/python2.7/site-packages/rados.py", line 212, in __init__
Apr 15 15:04:18 ceph1.domain collectd[22862]: library_path  = find_library('rados')
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib64/python2.7/ctypes/util.py", line 244, in find_library
Apr 15 15:04:18 ceph1.domain collectd[22862]: return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib64/python2.7/ctypes/util.py", line 237, in _findSoname_ldconfig
Apr 15 15:04:18 ceph1.domain collectd[22862]: f.close()
Apr 15 15:04:18 ceph1.domain collectd[22862]: IOError: [Errno 10] No child processes
Apr 15 15:04:18 ceph1.domain python[22884]: detected unhandled Python exception in '/usr/bin/ceph'
Apr 15 15:04:18 ceph1.domain abrt-server[22891]: Not saving repeating crash in '/usr/bin/ceph'
Apr 15 15:04:18 ceph1.domain collectd[22862]: Traceback (most recent call last):
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/bin/ceph", line 896, in <module>
Apr 15 15:04:18 ceph1.domain collectd[22862]: retval = main()
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/bin/ceph", line 647, in main
Apr 15 15:04:18 ceph1.domain collectd[22862]: conffile=conffile)
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib/python2.7/site-packages/rados.py", line 212, in __init__
Apr 15 15:04:18 ceph1.domain collectd[22862]: library_path  = find_library('rados')
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib64/python2.7/ctypes/util.py", line 244, in find_library
Apr 15 15:04:18 ceph1.domain collectd[22862]: return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib64/python2.7/ctypes/util.py", line 237, in _findSoname_ldconfig
Apr 15 15:04:18 ceph1.domain collectd[22862]: f.close()
Apr 15 15:04:18 ceph1.domain collectd[22862]: IOError: [Errno 10] No child processes
Apr 15 15:04:18 ceph1.domain collectd[22862]: ceph: failed to get stats :: No JSON object could be decoded :: Traceback (most recent call last):
                                                      File "/usr/lib64/collectd/base.py", line 114, in read_callback
                                                        stats = self.get_stats()
                                                      File "/usr/lib64/collectd/ceph_pool_plugin.py", line 67, in get_stats
                                                        json_stats_data = json.loads(stats_output)
                                                      File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
                                                        return _default_decoder.decode(s)
                                                      File "/usr/lib64/python2.7/json/decoder.py", line 365, in decode
                                                        obj, end = self.raw_decode(s, idx=_w(s, 0).end())
                                                      File "/usr/lib64/python2.7/json/decoder.py", line 383, in raw_decode
                                                        raise ValueError("No JSON object could be decoded")
                                                    ValueError: No JSON object could be decoded
Apr 15 15:04:18 ceph1.domain collectd[22862]: Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment
Apr 15 15:04:18 ceph1.domain collectd[22862]: read-function of plugin `python.ceph_pool_plugin' failed. Will suspend it for 120.000 seconds.

collectd.conf:

<LoadPlugin python>
  Globals true
</LoadPlugin>

<Plugin "python">
    ModulePath "/usr/lib64/collectd"

    Import "ceph_pool_plugin"

    <Module "ceph_pool_plugin">
        Verbose "True"
        Cluster "ceph"
        Interval "60"
        TestPool "rbd"
    </Module>
</Plugin>

Pool parsing fails with ceph 0.87

From the release notes:

The rd_kb and wr_kb fields in the JSON dumps for pool stats (accessed via the ceph df detail -f json-pretty and related commands) have been replaced with corresponding *_bytes fields. Similarly, the total_space, total_used, and total_avail fields are replaced with total_bytes, total_used_bytes, and total_avail_bytes fields.

This breaks the plugin

io latency plugin

With something like:

rados -p test bench 10 write -t 1 -b 65536 2>/dev/null | grep -i latency | awk \'{print 1000*$3}\'

get plugins to trigger shell processes

In each plugin we're doing subprocess.check_output(..., shell=False).

This is better, but causes issues when multiple python plugins are loaded. Enabling shell seems to fix it.

Fork, merge, stuff

Hello All,

I had enough of the outstanding bugs, pulls and forks everywhere lying around since the beginning of 2015, I have forked and merged in most of the non-overlapping pulls and some forks.

https://github.com/grinapo/collectd-ceph

If anyone want to create pulls against it feel free, I try to merge them. I don't promise to fix bugs, but I may eventually since I'm using it as well.

Obviously it'd be okay to pull it back here if rochaporto's back again.

list index out of range in latency plugin

Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: ceph: failed to get stats :: list index out of range :: Traceback (most recent call last):
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: File "/usr/lib/collectd/plugins/ceph/base.py", line 114, in read_callback
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: stats = self.get_stats()
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: File "/usr/lib/collectd/plugins/ceph/ceph_latency_plugin.py", line 67, in get_stats
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: data[ceph_cluster]['cluster']['stddev_latency'] = results[1]
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: IndexError: list index out of range
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]:
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: read-function of plugin `python.ceph.ceph_latency_plugin' failed. Will suspend it for 120.000 seconds.

Values in config are ignored

I'm running collectd 5.4.1 on CentOS 6.5, running with the CentOS scl python27 embedded into collectd (since this plugin requires python 2.7 due to its use of subprocess.check_output)

I noticed any values i set in the config are ignored

14:13:53 root@sm-sensu /usr/lib/collectd/plugins/ceph-git/plugins $ cat /etc/collectd.d/ceph_latency.conf
<LoadPlugin "python">
Globals true

Interval 10
Debug True

<Plugin "python">
ModulePath "/usr/lib/collectd/plugins/ceph-git/plugins"

Import "ceph_latency_plugin"

<Module "ceph_latency_plugin">
            Verbose "true"
            Cluster ceph
            Interval 10
            TestPool test
</Module>

I believe this is because the register_read happens before the configure_callback is called.

config key: Cluster - ceph
config key: Interval - 10
config key: TestPool - test
Stopping collectd: [ OK ]
Starting collectd: latency plugin registering with interval: 60.0

To fix this, i moved the register_read inside the callback, and now it works:
def configure_callback(conf):
"""Received configuration information"""
plugin.config_callback(conf)
collectd.error("latency plugin registering with interval: %s" % plugin.interval)
collectd.register_read(read_callback, plugin.interval)

Stopping collectd: [ OK ]
Starting collectd: config key: Verbose - true
config key: Cluster - ceph
config key: Interval - 10.0
config key: TestPool - test
latency plugin registering with interval: 10.0

protect against failures in ceph latency plugin

Current plugin tries forever when rados bench fails (the process just keeps trying).

Need to wrap it around a timeout so that we make sure it disappears when it fails to run (as in a network connectivity issue for example).

Syntax error

Hi there !

I found a bug on a variable name here :

cephdf_cmdline='ceph df -f json --cluster ' + self.cluster 
df_output = subprocess.check_output(ceph_dfcmdline, shell=True)

Just need to rename ceph_dfcmdline in cephdf_cmdline

bye

osd dump plugin (osd / pool stats)

Metrics to be included:

  • number of osds in each state (in, out, down, up)
  • number of pools
ceph osd dump --format json-pretty

{ "epoch": 29,
  "fsid": "95f73d54-8bdc-41d4-a540-fbd4480dc511",
  "created": "2014-05-16 04:21:58.549874",
  "modified": "2014-05-19 23:26:25.318330",
  "flags": "",
  "cluster_snapshot": "",
  "pool_max": 14,
  "max_osd": 2,
  "pools": [
        { "pool": 0,
          "pool_name": "data",
          "flags": 0,
          "flags_names": "",
          "type": 1,
          "size": 2,
          "min_size": 2,
          "crush_ruleset": 0,
          "object_hash": 2,
          "pg_num": 64,
          "pg_placement_num": 64,
          "crash_replay_interval": 45,
          "last_change": "3",
          "auid": 0,
          "snap_mode": "selfmanaged",
          "snap_seq": 0,
          "snap_epoch": 0,
          "pool_snaps": {},
          "removed_snaps": "[]",
          "quota_max_bytes": 0,
          "quota_max_objects": 0,
          "tiers": [],
          "tier_of": -1,
          "read_tier": -1,
          "write_tier": -1,
          "cache_mode": "none",
          "properties": []},
        { "pool": 1,
          "pool_name": "metadata",
          "flags": 0,
          "flags_names": "",
          "type": 1,
          "size": 2,
          "min_size": 2,
          "crush_ruleset": 1,
          "object_hash": 2,
          "pg_num": 64,
          "pg_placement_num": 64,
          "crash_replay_interval": 0,
          "last_change": "8",
          "auid": 0,
          "snap_mode": "selfmanaged",
          "snap_seq": 0,
          "snap_epoch": 0,
          "pool_snaps": {},
          "removed_snaps": "[]",
          "quota_max_bytes": 0,
          "quota_max_objects": 0,
          "tiers": [],
          "tier_of": -1,
          "read_tier": -1,
          "write_tier": -1,
          "cache_mode": "none",
          "properties": []},
        { "pool": 2,
          "pool_name": "rbd",
          "flags": 0,
          "flags_names": "",
          "type": 1,
          "size": 2,
          "min_size": 2,
          "crush_ruleset": 2,
          "object_hash": 2,
          "pg_num": 64,
          "pg_placement_num": 64,
          "crash_replay_interval": 0,
          "last_change": "2",
          "auid": 0,
          "snap_mode": "selfmanaged",
          "snap_seq": 0,
          "snap_epoch": 0,
          "pool_snaps": {},
          "removed_snaps": "[]",
          "quota_max_bytes": 0,
          "quota_max_objects": 0,
          "tiers": [],
          "tier_of": -1,
          "read_tier": -1,
          "write_tier": -1,
          "cache_mode": "none",
          "properties": []},
        { "pool": 3,
          "pool_name": "images",
          "flags": 0,
          "flags_names": "",
          "type": 1,
          "size": 2,
          "min_size": 2,
          "crush_ruleset": 0,
          "object_hash": 2,
          "pg_num": 64,
          "pg_placement_num": 64,
          "crash_replay_interval": 0,
          "last_change": "29",
          "auid": 0,
          "snap_mode": "selfmanaged",
          "snap_seq": 5,
          "snap_epoch": 29,
          "pool_snaps": {},
          "removed_snaps": "[1~2,4~1]",
          "quota_max_bytes": 0,
          "quota_max_objects": 0,
          "tiers": [],
          "tier_of": -1,
          "read_tier": -1,
          "write_tier": -1,
          "cache_mode": "none",
          "properties": []},
        { "pool": 4,
          "pool_name": "volumes",
          "flags": 0,
          "flags_names": "",
          "type": 1,
          "size": 2,
          "min_size": 2,
          "crush_ruleset": 0,
          "object_hash": 2,
          "pg_num": 64,
          "pg_placement_num": 64,
          "crash_replay_interval": 0,
          "last_change": "11",
          "auid": 0,
          "snap_mode": "selfmanaged",
          "snap_seq": 0,
          "snap_epoch": 0,
          "pool_snaps": {},
          "removed_snaps": "[]",
          "quota_max_bytes": 0,
          "quota_max_objects": 0,
          "tiers": [],
          "tier_of": -1,
          "read_tier": -1,
          "write_tier": -1,
          "cache_mode": "none",
          "properties": []}],
  "osds": [
        { "osd": 0,
          "uuid": "1e79235a-f094-47e1-80d1-8232d2d475cb",
          "up": 1,
          "in": 1,
          "last_clean_begin": 0,
          "last_clean_end": 0,
          "up_from": 13,
          "up_thru": 24,
          "down_at": 0,
          "lost_at": 0,
          "public_addr": "192.168.5.230:6800\/16168",
          "cluster_addr": "192.168.5.230:6801\/16168",
          "heartbeat_back_addr": "192.168.5.230:6802\/16168",
          "heartbeat_front_addr": "192.168.5.230:6803\/16168",
          "state": [
                "exists",
                "up"]},
        { "osd": 1,
          "uuid": "555ed8d0-11da-49b5-8ee9-3887c5937237",
          "up": 1,
          "in": 1,
          "last_clean_begin": 0,
          "last_clean_end": 0,
          "up_from": 16,
          "up_thru": 0,
          "down_at": 0,
          "lost_at": 0,
          "public_addr": "192.168.5.230:6805\/16349",
          "cluster_addr": "192.168.5.230:6806\/16349",
          "heartbeat_back_addr": "192.168.5.230:6807\/16349",
          "heartbeat_front_addr": "192.168.5.230:6808\/16349",
          "state": [
                "exists",
                "up"]}],
  "osd_xinfo": [
        { "osd": 0,
          "down_stamp": "0.000000",
          "laggy_probability": "0.000000",
          "laggy_interval": 0},
        { "osd": 1,
          "down_stamp": "0.000000",
          "laggy_probability": "0.000000",
          "laggy_interval": 0}],
  "pg_temp": [],
  "blacklist": []}

improve error logging / detection

Right now the get_stats functions are not exiting on error.

In particular we should check for empty output when a query to ceph fails, and log and exit immediately (otherwise we get a bunch of nasty stacktraces).

cluster status (ceph perf dump)

Something similar to:

ceph --admin-daemon /var/run/ceph/ceph-mon.ceph.asok perf dump
{ "cluster": { "num_mon": 1,
      "num_mon_quorum": 1,
      "num_osd": 2,
      "num_osd_up": 2,
      "num_osd_in": 2,
      "osd_epoch": 29,
      "osd_kb": 6815452,
      "osd_kb_used": 2220596,
      "osd_kb_avail": 4594856,
      "num_pool": 15,
      "num_pg": 960,
      "num_pg_active_clean": 960,
      "num_pg_active": 960,
      "num_pg_peering": 0,
      "num_object": 13,
      "num_object_degraded": 0,
      "num_object_unfound": 0,
      "num_bytes": 22908977,
      "num_mds_up": 0,
      "num_mds_in": 0,
      "num_mds_failed": 0,
      "mds_epoch": 1},
  "leveldb": { "leveldb_get": 35461,
      "leveldb_transaction": 2166,
      "leveldb_compact": 0,
      "leveldb_compact_range": 2,
      "leveldb_compact_queue_merge": 0,
      "leveldb_compact_queue_len": 0},
  "mon": {},
  "throttle-mon_client_bytes": { "val": 0,
      "max": 104857600,
      "get": 2573538,
      "get_sum": 187405680,
      "get_or_fail_fail": 0,
      "get_or_fail_success": 0,
      "take": 0,
      "take_sum": 0,
      "put": 2573538,
      "put_sum": 187405680,
      "wait": { "avgcount": 0,
          "sum": 0.000000000}},
  "throttle-mon_daemon_bytes": { "val": 0,
      "max": 419430400,
      "get": 11354,
      "get_sum": 4799934,
      "get_or_fail_fail": 0,
      "get_or_fail_success": 0,
      "take": 0,
      "take_sum": 0,
      "put": 11354,
      "put_sum": 4799934,
      "wait": { "avgcount": 0,
          "sum": 0.000000000}},
  "throttle-msgr_dispatch_throttler-mon": { "val": 0,
      "max": 104857600,
      "get": 2584892,
      "get_sum": 192205614,
      "get_or_fail_fail": 0,
      "get_or_fail_success": 0,
      "take": 0,
      "take_sum": 0,
      "put": 2584892,
      "put_sum": 192205614,
      "wait": { "avgcount": 0,
          "sum": 0.000000000}}}

Even if some of the metrics are coming from other places too, still worth it.

No result from subprocess.check_output(...)

The commands like "subprocess.check_output(['ceph', 'df', '-f', 'json'])" return a empty string when it's called from collectd.
Other commands in a collectd python plugin are working (like 'ls' or 'pwd').
A simple python script executing the same command ("subprocess.check_output(['ceph', 'df', '-f', 'json'])") works well (out of collectd).
Where is the incompatibility between python/collectd/ceph (user rights, subprocess, ...) ?
Any suggestions ?

OS: ubuntu 1204
Collectd: 5.1.0
Python: 2.7.3

typo in pool code

--- ceph_pool_plugin.py~ 2016-03-02 23:35:20.000000000 +0100
+++ ceph_pool_plugin.py 2016-03-03 00:04:04.876014974 +0100
@@ -54,7 +54,7 @@
osd_pool_cmdline='ceph osd pool stats -f json --cluster ' + self.cluster
stats_output = subprocess.check_output(osd_pool_cmdline, shell=True)
cephdf_cmdline='ceph df -f json --cluster ' + self.cluster

  •        df_output = subprocess.check_output(ceph_dfcmdline, shell=True)
    
  •        df_output = subprocess.check_output(cephdf_cmdline, shell=True)
     except Exception as exc:
         collectd.error("ceph-pool: failed to ceph pool stats :: %s :: %s"
                 % (exc, traceback.format_exc()))
    

Error ceph_pool

Hello,

I have this error when i execute collectd and in syslog i have this.

ceph-pool: failed to ceph pool stats :: global name 'ceph_dfcmdline' is not defined :: Traceback (most recent call last):#12 File "/opt/collectd/plugins/ceph_pool_plugin.py", line 57, in get_stats#012 df_output = subprocess.check_output(ceph_dfcmdline, shell=True)#012NameError: global name 'ceph_dfcmdline' is not defined

I compiled collectd with python and ceph.

Thank for your help.

Video tutorial?

How to used it? Did you have any video tutorial?
Any help would be appreciated.

read_op_per_sec / write_op_per_sec in Jewel

Heya,

It seems that Ceph changed the op_per_sec to read_op_per_sec and write_op_per_sec somewhere in Jewel. I think this is the reason I'm no longer getting op_per_sec stats in my graphite.

I might want to venture a pull request to fix this but I'm unsure how to properly check for Ceph versions. Is there any instance of this somewhere in the code already? Is there a recommended way of doing this?

space usage stats (ceph df)

Metrics to be collectd:

  • total of space used and available
  • space used per pool
  • number of objects per pool
ceph df --format json-pretty

{ "stats": { "total_space": 6815452,
      "total_used": 2220596,
      "total_avail": 4594856},
  "pools": [
        { "name": "data",
          "id": 0,
          "stats": { "kb_used": 0,
              "bytes_used": 0,
              "objects": 0}},
        { "name": "metadata",
          "id": 1,
          "stats": { "kb_used": 0,
              "bytes_used": 0,
              "objects": 0}},
        { "name": "rbd",
          "id": 2,
          "stats": { "kb_used": 0,
              "bytes_used": 0,
              "objects": 0}},
        { "name": "images",
          "id": 3,
          "stats": { "kb_used": 22373,
              "bytes_used": 22908960,
              "objects": 9}},
        { "name": "volumes",
          "id": 4,
          "stats": { "kb_used": 1,
              "bytes_used": 17,
              "objects": 4}}]

No JSON object could be decoded

I am setting up a box to monitor a ceph cluster. My client config seems to be fine, i am able to run all the commands that these scripts are running inside. But seeing the below errors in collectd log. I am using ubuntu 14.04 with collectd 5.4. Please help me with this issue. Thank you!

[2016-02-25 15:27:40] read-function of plugin `python.ceph_pool_plugin' failed. Will suspend it for 240.000 seconds.
[2016-02-25 15:31:40] ceph: failed to get stats :: No JSON object could be decoded :: Traceback (most recent call last):
File "/usr/lib/collectd/plugins/ceph/base.py", line 108, in read_callback
stats = self.get_stats()
File "/usr/lib/collectd/plugins/ceph/ceph_monitor_plugin.py", line 62, in get_stats
json_data = json.loads(output)
File "/usr/lib/python2.7/json/init.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

[2016-02-25 15:31:40] Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment
[2016-02-25 15:31:40] read-function of plugin `python.ceph_monitor_plugin' failed. Will suspend it for 480.000 seconds.
[2016-02-25 15:31:40] ceph: failed to get stats :: No JSON object could be decoded :: Traceback (most recent call last):
File "/usr/lib/collectd/plugins/ceph/base.py", line 108, in read_callback
stats = self.get_stats()
File "/usr/lib/collectd/plugins/ceph/ceph_pool_plugin.py", line 67, in get_stats
json_stats_data = json.loads(stats_output)
File "/usr/lib/python2.7/json/init.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

cluster and pools stats (ceph osd pool stats)

Metrics to be collected:

  • Per pool read bytes / sec
  • Per pool write bytes / sec
  • Per pool IOPS
ceph osd pool stats -f json-pretty

[
    { "pool_name": "data",
      "pool_id": 0,
      "recovery": {},
      "recovery_rate": {},
      "client_io_rate": {}},
    { "pool_name": "metadata",
      "pool_id": 1,
      "recovery": {},
      "recovery_rate": {},
      "client_io_rate": {}},
    },
]

Which also includes inside client_io_rate things like read_bytes_sec, write_bytes_sec, op_per_sec.

Add identity (ceph username) to config

Please add ceph client name to options.

Ceph allows to use different clients (usernames) with different permissions (option '-n'). It is the best practice to use identity with minimal permissions for a given task. Current configuration calls ceph without client name, which implies 'client.admin' client name. Such permission for monitoring service is too wide, IMHO.

Error on ceph_pg_plugin.py

hi i;m taehoon.
i deploying ceph clutser(Luminous).
So build monitoring system with graphite , grafana , collectd.

but i have Trouble below this.

-------/var/log/message ------------------------------------------------------------------------
Jun 11 10:09:42 ceph-mgr.cdngp.net collectd[57815]: ceph: failed to get stats :: 'fs_perf_stat' :: Traceback (most recent call last):#12 File "/usr/lib64/collectd/plugins/ceph/base.py", line 114, in read_callback#012 stats = self.get_stats()#12 File "/usr/lib64/collectd/plugins/ceph/ceph_pg_plugin.py", line 79, in get_stats#012 data[ceph_cluster][osd_id]['apply_latency_ms'] = osd['fs_perf_stat']['apply_latency_ms']#012KeyError: 'fs_perf_stat'

Jun 11 10:09:42 ceph-mgr.cdngp.net collectd[57815]: Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment

Jun 11 10:09:42 ceph-mgr.cdngp.net collectd[57815]: read-function of plugin `python.ceph_pg_plugin' failed. Will suspend it for 120.000 seconds.

Jun 11 10:09:43 ceph-mgr.cdngp.net collectd: dumped fsmap epoch 196


somebody help me.. :-<

Collectd example configuration

Collectd configuration that matches grafana dashboard wasn't easy to guess for me. Probably not easy for everyone who sees collectd for the first time.

I created dockerized version of collectd and this plugin: https://github.com/bobrik/ceph-collectd-graphite

At least collectd configuration (like in my repo) should be mentioned in readme for this plugin. Dockerized version could also be mentioned, it is much easier to deploy from scratch.

Thanks for this plugin!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.