rochaporto / collectd-ceph Goto Github PK
View Code? Open in Web Editor NEWcollectd plugins and dashboards for ceph
License: GNU General Public License v2.0
collectd plugins and dashboards for ceph
License: GNU General Public License v2.0
Metrics to be collected:
ceph pg dump --format json-pretty
{ "version": 401,
"stamp": "2014-05-19 23:33:26.976176",
"last_osdmap_epoch": 29,
"last_pg_scan": 23,
"full_ratio": "0.950000",
"near_full_ratio": "0.750000",
"pg_stats_sum": { "stat_sum": { "num_bytes": 22908977,
"num_objects": 13,
"num_object_clones": 0,
"num_object_copies": 26,
"num_objects_missing_on_primary": 0,
"num_objects_degraded": 0,
"num_objects_unfound": 0,
"num_read": 1350,
"num_read_kb": 1079,
"num_write": 99,
"num_write_kb": 31913,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0},
"stat_cat_sum": {},
"log_size": 127,
"ondisk_log_size": 127},
"osd_stats_sum": { "kb": 6815452,
"kb_used": 2220596,
"kb_avail": 4594856,
"hb_in": [],
"hb_out": [],
"snap_trim_queue_len": 0,
"num_snap_trimming": 0,
"op_queue_age_hist": { "histogram": [],
"upper_bound": 1},
"fs_perf_stat": { "commit_latency_ms": 1021,
"apply_latency_ms": 11153}},
"pg_stats_delta": { "stat_sum": { "num_bytes": 0,
"num_objects": 0,
"num_object_clones": 0,
"num_object_copies": 0,
"num_objects_missing_on_primary": 0,
"num_objects_degraded": 0,
"num_objects_unfound": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 0,
"num_write_kb": 0,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0},
"stat_cat_sum": {},
"log_size": 0,
"ondisk_log_size": 0},
"pg_stats": [
{ "pgid": "14.31",
"version": "0'0",
"reported_seq": "11",
"reported_epoch": "29",
"state": "active+clean",
"last_fresh": "2014-05-19 23:26:25.549117",
"last_change": "2014-05-16 04:34:11.013010",
"last_active": "2014-05-19 23:26:25.549117",
"last_clean": "2014-05-19 23:26:25.549117",
"last_became_active": "0.000000",
"last_unstale": "2014-05-19 23:26:25.549117",
"mapping_epoch": 23,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 23,
"last_epoch_clean": 25,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "0'0",
"last_scrub_stamp": "2014-05-16 04:28:54.807245",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2014-05-16 04:28:54.807245",
"last_clean_scrub_stamp": "0.000000",
"log_size": 0,
"ondisk_log_size": 0,
"stats_invalid": "0",
"stat_sum": { "num_bytes": 0,
"num_objects": 0,
"num_object_clones": 0,
"num_object_copies": 0,
"num_objects_missing_on_primary": 0,
"num_objects_degraded": 0,
"num_objects_unfound": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 0,
"num_write_kb": 0,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0},
"stat_cat_sum": {},
"up": [
0,
1],
"acting": [
0,
1]},
...
"pool_stats": [
{ "poolid": 0,
"stat_sum": { "num_bytes": 0,
"num_objects": 0,
"num_object_clones": 0,
"num_object_copies": 0,
"num_objects_missing_on_primary": 0,
"num_objects_degraded": 0,
"num_objects_unfound": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 0,
"num_write_kb": 0,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0},
"stat_cat_sum": {},
"log_size": 0,
"ondisk_log_size": 0},
...
"osd_stats": [
{ "osd": 0,
"kb": 2919444,
"kb_used": 1110328,
"kb_avail": 1809116,
"hb_in": [
1],
"hb_out": [],
"snap_trim_queue_len": 0,
"num_snap_trimming": 0,
"op_queue_age_hist": { "histogram": [],
"upper_bound": 1},
"fs_perf_stat": { "commit_latency_ms": 484,
"apply_latency_ms": 4703}},
{ "osd": 1,
"kb": 3896008,
"kb_used": 1110268,
"kb_avail": 2785740,
"hb_in": [
0],
"hb_out": [],
"snap_trim_queue_len": 0,
"num_snap_trimming": 0,
"op_queue_age_hist": { "histogram": [],
"upper_bound": 1},
"fs_perf_stat": { "commit_latency_ms": 537,
"apply_latency_ms": 6450}}]}
Metrics to be collected:
ceph mon dump --format json-pretty
dumped monmap epoch 1
{ "epoch": 1,
"fsid": "95f73d54-8bdc-41d4-a540-fbd4480dc511",
"modified": "0.000000",
"created": "0.000000",
"mons": [
{ "rank": 0,
"name": "ceph",
"addr": "192.168.5.230:6789\/0"}],
"quorum": [
0]}
Things like pg_num, pgp_num and size.
Hello,
experimenitng with ceph_pool_plugin I noticed that when changing the Interval (the plugin parameter, not the collectd global Interval setting) to something like 10 seconds - I had the graphite retention period to 10seconds already - , graphite was not drawing any data.
Is the 60sec plugin Interval I see on the example some kind of restriction?
Regards,
Kostis
Two plugin candidates:
https://collectd.org/wiki/index.php/Plugin:Disk
https://github.com/indygreg/collectd-diskstats
We might need to package them in the deb.
@rochaporto A big thanks for sharing these plugins.
Need help , While trying to use ceph_pool_plugin , i got this error in collectd logs.
[2015-07-09 16:53:04] [error] ceph-pool: failed to ceph pool stats :: 'module' object has no attribute 'check_output' :: Traceback (most recent call last):
File "/usr/lib/collectd/plugins/ceph/ceph_pool_plugin.py", line 54, in get_stats
stats_output = subprocess.check_output('ceph osd pool stats -f json', shell=True)
AttributeError: 'module' object has no attribute 'check_output'
[2015-07-09 16:53:04] [info] ceph: collectd new data from service :: took 0 seconds
[2015-07-09 16:53:04] [error] ceph: failed to retrieve stats
ceph: failed to get stats :: float division by zero :: Traceback (most recent call last):#12 File "/usr/lib/collectd/plugins/ceph/base.py", line 125, in read_callback#012 stats = self.get_stats(config)#12 File "/usr/lib/collectd/plugins/ceph/ceph_pg_plugin.py", line 73, in get_stats#012 data[ceph_cluster][osd_id]['percent_used'] = 100.0 * (osd['kb_used'] / float(osd['kb']))#012ZeroDivisionError: float division by zero
max_avail is an important metric gathered by "ceph df", but It's not being collected anymore.
After starting collectd running on CentOS 7, (ceph giant and now upgraded to hammer) I'm getting the following log errors using the ceph_pool_plugin.
-- Unit collectd.service has begun starting up.
Apr 15 15:04:18 ceph1.domain systemd[1]: Started Collectd statistics daemon.
-- Subject: Unit collectd.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit collectd.service has finished starting up.
--
-- The start-up result is done.
Apr 15 15:04:18 ceph1.domain collectd[22862]: Initialization complete, entering read-loop.
Apr 15 15:04:18 ceph1.domain python[22874]: detected unhandled Python exception in '/usr/bin/ceph'
Apr 15 15:04:18 ceph1.domain abrt-server[22881]: Package 'ceph-common' isn't signed with proper key
Apr 15 15:04:18 ceph1.domain abrt-server[22881]: 'post-create' on '/var/tmp/abrt/Python-2015-04-15-15:04:18-22874' exited with 1
Apr 15 15:04:18 ceph1.domain abrt-server[22881]: Deleting problem directory '/var/tmp/abrt/Python-2015-04-15-15:04:18-22874'
Apr 15 15:04:18 ceph1.domain collectd[22862]: Traceback (most recent call last):
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/bin/ceph", line 896, in <module>
Apr 15 15:04:18 ceph1.domain collectd[22862]: retval = main()
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/bin/ceph", line 647, in main
Apr 15 15:04:18 ceph1.domain collectd[22862]: conffile=conffile)
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib/python2.7/site-packages/rados.py", line 212, in __init__
Apr 15 15:04:18 ceph1.domain collectd[22862]: library_path = find_library('rados')
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib64/python2.7/ctypes/util.py", line 244, in find_library
Apr 15 15:04:18 ceph1.domain collectd[22862]: return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib64/python2.7/ctypes/util.py", line 237, in _findSoname_ldconfig
Apr 15 15:04:18 ceph1.domain collectd[22862]: f.close()
Apr 15 15:04:18 ceph1.domain collectd[22862]: IOError: [Errno 10] No child processes
Apr 15 15:04:18 ceph1.domain python[22884]: detected unhandled Python exception in '/usr/bin/ceph'
Apr 15 15:04:18 ceph1.domain abrt-server[22891]: Not saving repeating crash in '/usr/bin/ceph'
Apr 15 15:04:18 ceph1.domain collectd[22862]: Traceback (most recent call last):
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/bin/ceph", line 896, in <module>
Apr 15 15:04:18 ceph1.domain collectd[22862]: retval = main()
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/bin/ceph", line 647, in main
Apr 15 15:04:18 ceph1.domain collectd[22862]: conffile=conffile)
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib/python2.7/site-packages/rados.py", line 212, in __init__
Apr 15 15:04:18 ceph1.domain collectd[22862]: library_path = find_library('rados')
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib64/python2.7/ctypes/util.py", line 244, in find_library
Apr 15 15:04:18 ceph1.domain collectd[22862]: return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib64/python2.7/ctypes/util.py", line 237, in _findSoname_ldconfig
Apr 15 15:04:18 ceph1.domain collectd[22862]: f.close()
Apr 15 15:04:18 ceph1.domain collectd[22862]: IOError: [Errno 10] No child processes
Apr 15 15:04:18 ceph1.domain collectd[22862]: ceph: failed to get stats :: No JSON object could be decoded :: Traceback (most recent call last):
File "/usr/lib64/collectd/base.py", line 114, in read_callback
stats = self.get_stats()
File "/usr/lib64/collectd/ceph_pool_plugin.py", line 67, in get_stats
json_stats_data = json.loads(stats_output)
File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python2.7/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/python2.7/json/decoder.py", line 383, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Apr 15 15:04:18 ceph1.domain collectd[22862]: Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment
Apr 15 15:04:18 ceph1.domain collectd[22862]: read-function of plugin `python.ceph_pool_plugin' failed. Will suspend it for 120.000 seconds.
collectd.conf:
<LoadPlugin python>
Globals true
</LoadPlugin>
<Plugin "python">
ModulePath "/usr/lib64/collectd"
Import "ceph_pool_plugin"
<Module "ceph_pool_plugin">
Verbose "True"
Cluster "ceph"
Interval "60"
TestPool "rbd"
</Module>
</Plugin>
From the release notes:
The rd_kb and wr_kb fields in the JSON dumps for pool stats (accessed via the ceph df detail -f json-pretty and related commands) have been replaced with corresponding *_bytes fields. Similarly, the total_space, total_used, and total_avail fields are replaced with total_bytes, total_used_bytes, and total_avail_bytes fields.
This breaks the plugin
With something like:
rados -p test bench 10 write -t 1 -b 65536 2>/dev/null | grep -i latency | awk \'{print 1000*$3}\'
In each plugin we're doing subprocess.check_output(..., shell=False).
This is better, but causes issues when multiple python plugins are loaded. Enabling shell seems to fix it.
There ,https://github.com/rochaporto/collectd-ceph ,I have not found a description of how to install the collectd ceph plugin
Hello All,
I had enough of the outstanding bugs, pulls and forks everywhere lying around since the beginning of 2015, I have forked and merged in most of the non-overlapping pulls and some forks.
https://github.com/grinapo/collectd-ceph
If anyone want to create pulls against it feel free, I try to merge them. I don't promise to fix bugs, but I may eventually since I'm using it as well.
Obviously it'd be okay to pull it back here if rochaporto's back again.
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: ceph: failed to get stats :: list index out of range :: Traceback (most recent call last):
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: File "/usr/lib/collectd/plugins/ceph/base.py", line 114, in read_callback
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: stats = self.get_stats()
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: File "/usr/lib/collectd/plugins/ceph/ceph_latency_plugin.py", line 67, in get_stats
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: data[ceph_cluster]['cluster']['stddev_latency'] = results[1]
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: IndexError: list index out of range
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]:
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment
Jan 18 11:59:11 36s62 docker/9781622b6cd5[13999]: read-function of plugin `python.ceph.ceph_latency_plugin' failed. Will suspend it for 120.000 seconds.
I'm running collectd 5.4.1 on CentOS 6.5, running with the CentOS scl python27 embedded into collectd (since this plugin requires python 2.7 due to its use of subprocess.check_output)
I noticed any values i set in the config are ignored
14:13:53 root@sm-sensu /usr/lib/collectd/plugins/ceph-git/plugins $ cat /etc/collectd.d/ceph_latency.conf
<LoadPlugin "python">
Globals true
Interval 10
Debug True
<Plugin "python">
ModulePath "/usr/lib/collectd/plugins/ceph-git/plugins"
Import "ceph_latency_plugin"
<Module "ceph_latency_plugin">
Verbose "true"
Cluster ceph
Interval 10
TestPool test
</Module>
I believe this is because the register_read happens before the configure_callback is called.
config key: Cluster - ceph
config key: Interval - 10
config key: TestPool - test
Stopping collectd: [ OK ]
Starting collectd: latency plugin registering with interval: 60.0
To fix this, i moved the register_read inside the callback, and now it works:
def configure_callback(conf):
"""Received configuration information"""
plugin.config_callback(conf)
collectd.error("latency plugin registering with interval: %s" % plugin.interval)
collectd.register_read(read_callback, plugin.interval)
Stopping collectd: [ OK ]
Starting collectd: config key: Verbose - true
config key: Cluster - ceph
config key: Interval - 10.0
config key: TestPool - test
latency plugin registering with interval: 10.0
Current plugin tries forever when rados bench fails (the process just keeps trying).
Need to wrap it around a timeout so that we make sure it disappears when it fails to run (as in a network connectivity issue for example).
Hi there !
I found a bug on a variable name here :
cephdf_cmdline='ceph df -f json --cluster ' + self.cluster
df_output = subprocess.check_output(ceph_dfcmdline, shell=True)
Just need to rename ceph_dfcmdline
in cephdf_cmdline
bye
So that we can reference the plugins location as a package.
Same issue as in collectd-openstack, the interval is not being passed properly to the dispatch call of collectd.
Metrics to be included:
ceph osd dump --format json-pretty
{ "epoch": 29,
"fsid": "95f73d54-8bdc-41d4-a540-fbd4480dc511",
"created": "2014-05-16 04:21:58.549874",
"modified": "2014-05-19 23:26:25.318330",
"flags": "",
"cluster_snapshot": "",
"pool_max": 14,
"max_osd": 2,
"pools": [
{ "pool": 0,
"pool_name": "data",
"flags": 0,
"flags_names": "",
"type": 1,
"size": 2,
"min_size": 2,
"crush_ruleset": 0,
"object_hash": 2,
"pg_num": 64,
"pg_placement_num": 64,
"crash_replay_interval": 45,
"last_change": "3",
"auid": 0,
"snap_mode": "selfmanaged",
"snap_seq": 0,
"snap_epoch": 0,
"pool_snaps": {},
"removed_snaps": "[]",
"quota_max_bytes": 0,
"quota_max_objects": 0,
"tiers": [],
"tier_of": -1,
"read_tier": -1,
"write_tier": -1,
"cache_mode": "none",
"properties": []},
{ "pool": 1,
"pool_name": "metadata",
"flags": 0,
"flags_names": "",
"type": 1,
"size": 2,
"min_size": 2,
"crush_ruleset": 1,
"object_hash": 2,
"pg_num": 64,
"pg_placement_num": 64,
"crash_replay_interval": 0,
"last_change": "8",
"auid": 0,
"snap_mode": "selfmanaged",
"snap_seq": 0,
"snap_epoch": 0,
"pool_snaps": {},
"removed_snaps": "[]",
"quota_max_bytes": 0,
"quota_max_objects": 0,
"tiers": [],
"tier_of": -1,
"read_tier": -1,
"write_tier": -1,
"cache_mode": "none",
"properties": []},
{ "pool": 2,
"pool_name": "rbd",
"flags": 0,
"flags_names": "",
"type": 1,
"size": 2,
"min_size": 2,
"crush_ruleset": 2,
"object_hash": 2,
"pg_num": 64,
"pg_placement_num": 64,
"crash_replay_interval": 0,
"last_change": "2",
"auid": 0,
"snap_mode": "selfmanaged",
"snap_seq": 0,
"snap_epoch": 0,
"pool_snaps": {},
"removed_snaps": "[]",
"quota_max_bytes": 0,
"quota_max_objects": 0,
"tiers": [],
"tier_of": -1,
"read_tier": -1,
"write_tier": -1,
"cache_mode": "none",
"properties": []},
{ "pool": 3,
"pool_name": "images",
"flags": 0,
"flags_names": "",
"type": 1,
"size": 2,
"min_size": 2,
"crush_ruleset": 0,
"object_hash": 2,
"pg_num": 64,
"pg_placement_num": 64,
"crash_replay_interval": 0,
"last_change": "29",
"auid": 0,
"snap_mode": "selfmanaged",
"snap_seq": 5,
"snap_epoch": 29,
"pool_snaps": {},
"removed_snaps": "[1~2,4~1]",
"quota_max_bytes": 0,
"quota_max_objects": 0,
"tiers": [],
"tier_of": -1,
"read_tier": -1,
"write_tier": -1,
"cache_mode": "none",
"properties": []},
{ "pool": 4,
"pool_name": "volumes",
"flags": 0,
"flags_names": "",
"type": 1,
"size": 2,
"min_size": 2,
"crush_ruleset": 0,
"object_hash": 2,
"pg_num": 64,
"pg_placement_num": 64,
"crash_replay_interval": 0,
"last_change": "11",
"auid": 0,
"snap_mode": "selfmanaged",
"snap_seq": 0,
"snap_epoch": 0,
"pool_snaps": {},
"removed_snaps": "[]",
"quota_max_bytes": 0,
"quota_max_objects": 0,
"tiers": [],
"tier_of": -1,
"read_tier": -1,
"write_tier": -1,
"cache_mode": "none",
"properties": []}],
"osds": [
{ "osd": 0,
"uuid": "1e79235a-f094-47e1-80d1-8232d2d475cb",
"up": 1,
"in": 1,
"last_clean_begin": 0,
"last_clean_end": 0,
"up_from": 13,
"up_thru": 24,
"down_at": 0,
"lost_at": 0,
"public_addr": "192.168.5.230:6800\/16168",
"cluster_addr": "192.168.5.230:6801\/16168",
"heartbeat_back_addr": "192.168.5.230:6802\/16168",
"heartbeat_front_addr": "192.168.5.230:6803\/16168",
"state": [
"exists",
"up"]},
{ "osd": 1,
"uuid": "555ed8d0-11da-49b5-8ee9-3887c5937237",
"up": 1,
"in": 1,
"last_clean_begin": 0,
"last_clean_end": 0,
"up_from": 16,
"up_thru": 0,
"down_at": 0,
"lost_at": 0,
"public_addr": "192.168.5.230:6805\/16349",
"cluster_addr": "192.168.5.230:6806\/16349",
"heartbeat_back_addr": "192.168.5.230:6807\/16349",
"heartbeat_front_addr": "192.168.5.230:6808\/16349",
"state": [
"exists",
"up"]}],
"osd_xinfo": [
{ "osd": 0,
"down_stamp": "0.000000",
"laggy_probability": "0.000000",
"laggy_interval": 0},
{ "osd": 1,
"down_stamp": "0.000000",
"laggy_probability": "0.000000",
"laggy_interval": 0}],
"pg_temp": [],
"blacklist": []}
Right now the get_stats functions are not exiting on error.
In particular we should check for empty output when a query to ceph fails, and log and exit immediately (otherwise we get a bunch of nasty stacktraces).
By parsing detailed osd log messages.
https://github.com/cernceph/ceph-scripts/blob/master/tools/rbd-io-stats.pl
Something similar to:
ceph --admin-daemon /var/run/ceph/ceph-mon.ceph.asok perf dump
{ "cluster": { "num_mon": 1,
"num_mon_quorum": 1,
"num_osd": 2,
"num_osd_up": 2,
"num_osd_in": 2,
"osd_epoch": 29,
"osd_kb": 6815452,
"osd_kb_used": 2220596,
"osd_kb_avail": 4594856,
"num_pool": 15,
"num_pg": 960,
"num_pg_active_clean": 960,
"num_pg_active": 960,
"num_pg_peering": 0,
"num_object": 13,
"num_object_degraded": 0,
"num_object_unfound": 0,
"num_bytes": 22908977,
"num_mds_up": 0,
"num_mds_in": 0,
"num_mds_failed": 0,
"mds_epoch": 1},
"leveldb": { "leveldb_get": 35461,
"leveldb_transaction": 2166,
"leveldb_compact": 0,
"leveldb_compact_range": 2,
"leveldb_compact_queue_merge": 0,
"leveldb_compact_queue_len": 0},
"mon": {},
"throttle-mon_client_bytes": { "val": 0,
"max": 104857600,
"get": 2573538,
"get_sum": 187405680,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 0,
"take_sum": 0,
"put": 2573538,
"put_sum": 187405680,
"wait": { "avgcount": 0,
"sum": 0.000000000}},
"throttle-mon_daemon_bytes": { "val": 0,
"max": 419430400,
"get": 11354,
"get_sum": 4799934,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 0,
"take_sum": 0,
"put": 11354,
"put_sum": 4799934,
"wait": { "avgcount": 0,
"sum": 0.000000000}},
"throttle-msgr_dispatch_throttler-mon": { "val": 0,
"max": 104857600,
"get": 2584892,
"get_sum": 192205614,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 0,
"take_sum": 0,
"put": 2584892,
"put_sum": 192205614,
"wait": { "avgcount": 0,
"sum": 0.000000000}}}
Even if some of the metrics are coming from other places too, still worth it.
The commands like "subprocess.check_output(['ceph', 'df', '-f', 'json'])" return a empty string when it's called from collectd.
Other commands in a collectd python plugin are working (like 'ls' or 'pwd').
A simple python script executing the same command ("subprocess.check_output(['ceph', 'df', '-f', 'json'])") works well (out of collectd).
Where is the incompatibility between python/collectd/ceph (user rights, subprocess, ...) ?
Any suggestions ?
OS: ubuntu 1204
Collectd: 5.1.0
Python: 2.7.3
Should be possible to plug in metrics regarding throughtput, iops, etc into graphite, the same way we do for pools.
--- ceph_pool_plugin.py~ 2016-03-02 23:35:20.000000000 +0100
+++ ceph_pool_plugin.py 2016-03-03 00:04:04.876014974 +0100
@@ -54,7 +54,7 @@
osd_pool_cmdline='ceph osd pool stats -f json --cluster ' + self.cluster
stats_output = subprocess.check_output(osd_pool_cmdline, shell=True)
cephdf_cmdline='ceph df -f json --cluster ' + self.cluster
df_output = subprocess.check_output(ceph_dfcmdline, shell=True)
df_output = subprocess.check_output(cephdf_cmdline, shell=True)
except Exception as exc:
collectd.error("ceph-pool: failed to ceph pool stats :: %s :: %s"
% (exc, traceback.format_exc()))
Hello,
I have this error when i execute collectd and in syslog i have this.
ceph-pool: failed to ceph pool stats :: global name 'ceph_dfcmdline' is not defined :: Traceback (most recent call last):#12 File "/opt/collectd/plugins/ceph_pool_plugin.py", line 57, in get_stats#012 df_output = subprocess.check_output(ceph_dfcmdline, shell=True)#012NameError: global name 'ceph_dfcmdline' is not defined
I compiled collectd with python and ceph.
Thank for your help.
How to used it? Did you have any video tutorial?
Any help would be appreciated.
Heya,
It seems that Ceph changed the op_per_sec
to read_op_per_sec
and write_op_per_sec
somewhere in Jewel. I think this is the reason I'm no longer getting op_per_sec
stats in my graphite.
I might want to venture a pull request to fix this but I'm unsure how to properly check for Ceph versions. Is there any instance of this somewhere in the code already? Is there a recommended way of doing this?
Metrics to be collectd:
ceph df --format json-pretty
{ "stats": { "total_space": 6815452,
"total_used": 2220596,
"total_avail": 4594856},
"pools": [
{ "name": "data",
"id": 0,
"stats": { "kb_used": 0,
"bytes_used": 0,
"objects": 0}},
{ "name": "metadata",
"id": 1,
"stats": { "kb_used": 0,
"bytes_used": 0,
"objects": 0}},
{ "name": "rbd",
"id": 2,
"stats": { "kb_used": 0,
"bytes_used": 0,
"objects": 0}},
{ "name": "images",
"id": 3,
"stats": { "kb_used": 22373,
"bytes_used": 22908960,
"objects": 9}},
{ "name": "volumes",
"id": 4,
"stats": { "kb_used": 1,
"bytes_used": 17,
"objects": 4}}]
I am setting up a box to monitor a ceph cluster. My client config seems to be fine, i am able to run all the commands that these scripts are running inside. But seeing the below errors in collectd log. I am using ubuntu 14.04 with collectd 5.4. Please help me with this issue. Thank you!
[2016-02-25 15:27:40] read-function of plugin `python.ceph_pool_plugin' failed. Will suspend it for 240.000 seconds.
[2016-02-25 15:31:40] ceph: failed to get stats :: No JSON object could be decoded :: Traceback (most recent call last):
File "/usr/lib/collectd/plugins/ceph/base.py", line 108, in read_callback
stats = self.get_stats()
File "/usr/lib/collectd/plugins/ceph/ceph_monitor_plugin.py", line 62, in get_stats
json_data = json.loads(output)
File "/usr/lib/python2.7/json/init.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
[2016-02-25 15:31:40] Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment
[2016-02-25 15:31:40] read-function of plugin `python.ceph_monitor_plugin' failed. Will suspend it for 480.000 seconds.
[2016-02-25 15:31:40] ceph: failed to get stats :: No JSON object could be decoded :: Traceback (most recent call last):
File "/usr/lib/collectd/plugins/ceph/base.py", line 108, in read_callback
stats = self.get_stats()
File "/usr/lib/collectd/plugins/ceph/ceph_pool_plugin.py", line 67, in get_stats
json_stats_data = json.loads(stats_output)
File "/usr/lib/python2.7/json/init.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Metrics to be collected:
ceph osd pool stats -f json-pretty
[
{ "pool_name": "data",
"pool_id": 0,
"recovery": {},
"recovery_rate": {},
"client_io_rate": {}},
{ "pool_name": "metadata",
"pool_id": 1,
"recovery": {},
"recovery_rate": {},
"client_io_rate": {}},
},
]
Which also includes inside client_io_rate things like read_bytes_sec, write_bytes_sec, op_per_sec.
Look for a widget that would show the cluster structure: whatever is defined in the crushmap.
Also check if there's a built-in one that does what we need. Otherwise a candidate:
https://github.com/keirans/collectd-iostat
and we might need to package it in the deb.
Please add ceph client name to options.
Ceph allows to use different clients (usernames) with different permissions (option '-n'). It is the best practice to use identity with minimal permissions for a given task. Current configuration calls ceph without client name, which implies 'client.admin' client name. Such permission for monitoring service is too wide, IMHO.
hi i;m taehoon.
i deploying ceph clutser(Luminous).
So build monitoring system with graphite , grafana , collectd.
but i have Trouble below this.
-------/var/log/message ------------------------------------------------------------------------
Jun 11 10:09:42 ceph-mgr.cdngp.net collectd[57815]: ceph: failed to get stats :: 'fs_perf_stat' :: Traceback (most recent call last):#12 File "/usr/lib64/collectd/plugins/ceph/base.py", line 114, in read_callback#012 stats = self.get_stats()#12 File "/usr/lib64/collectd/plugins/ceph/ceph_pg_plugin.py", line 79, in get_stats#012 data[ceph_cluster][osd_id]['apply_latency_ms'] = osd['fs_perf_stat']['apply_latency_ms']#012KeyError: 'fs_perf_stat'
Jun 11 10:09:42 ceph-mgr.cdngp.net collectd[57815]: Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment
Jun 11 10:09:42 ceph-mgr.cdngp.net collectd[57815]: read-function of plugin `python.ceph_pg_plugin' failed. Will suspend it for 120.000 seconds.
Jun 11 10:09:43 ceph-mgr.cdngp.net collectd: dumped fsmap epoch 196
somebody help me.. :-<
Collectd configuration that matches grafana dashboard wasn't easy to guess for me. Probably not easy for everyone who sees collectd for the first time.
I created dockerized version of collectd and this plugin: https://github.com/bobrik/ceph-collectd-graphite
At least collectd configuration (like in my repo) should be mentioned in readme for this plugin. Dockerized version could also be mentioned, it is much easier to deploy from scratch.
Thanks for this plugin!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.