Giter Site home page Giter Site logo

ganglia / monitor-core Goto Github PK

View Code? Open in Web Editor NEW
489.0 489.0 246.0 9.33 MB

Ganglia Monitoring core

License: BSD 3-Clause "New" or "Revised" License

Shell 2.11% Python 19.82% PHP 0.14% Perl 0.77% CSS 0.04% HTML 2.07% JavaScript 18.13% C 52.75% Makefile 0.97% M4 2.47% Roff 0.52% RPC 0.16% VBScript 0.01% NASL 0.02%

monitor-core's People

Contributors

afbjorklund avatar carenas avatar cburroughs avatar comptonqc avatar dhobsd avatar dpocock avatar georgiou avatar graphaelli avatar hawson avatar jbuchbinder avatar jhatala avatar jimjcollins avatar johntconklin avatar junichi-tanaka avatar keyurdg avatar knobi avatar maxk-fortscale avatar n0ts avatar noodlesnz avatar olahaye74 avatar pabl0 avatar plaguedbypenguins avatar saaros avatar satterly avatar sflow avatar skemper avatar tomprince avatar vuksanv avatar vvuksan avatar wartime avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

monitor-core's Issues

Gmond crashes in mod_disk

This is on Ubuntu 11.04 x86_64. If user takes out mod_disk from gmond.conf it starts normally.

/usr/sbin/gmond -d 2
loaded module: core_metrics
loaded module: cpu_module
loaded module: disk_module
loaded module: load_module
loaded module: mem_module
loaded module: net_module
loaded module: proc_module
loaded module: sys_module
udp_recv_channel mcast_join=239.2.11.71 mcast_if=NULL port=8649 bind=239.2.11.71
tcp_accept_channel bind=NULL port=8649
udp_send_channel mcast_join=239.2.11.71 mcast_if=NULL host=NULL port=8649

metric 'cpu_user' being collected now
metric 'cpu_user' has value_threshold 1.000000
metric 'cpu_system' being collected now
metric 'cpu_system' has value_threshold 1.000000
metric 'cpu_idle' being collected now
metric 'cpu_idle' has value_threshold 5.000000
metric 'cpu_nice' being collected now
metric 'cpu_nice' has value_threshold 1.000000
metric 'cpu_aidle' being collected now
metric 'cpu_aidle' has value_threshold 5.000000
metric 'cpu_wio' being collected now
metric 'cpu_wio' has value_threshold 1.000000
metric 'load_one' being collected now
metric 'load_one' has value_threshold 1.000000
metric 'load_five' being collected now
metric 'load_five' has value_threshold 1.000000
metric 'load_fifteen' being collected now
metric 'load_fifteen' has value_threshold 1.000000
metric 'proc_run' being collected now
metric 'proc_run' has value_threshold 1.000000
metric 'proc_total' being collected now
metric 'proc_total' has value_threshold 1.000000
metric 'mem_free' being collected now
metric 'mem_free' has value_threshold 1024.000000
metric 'mem_shared' being collected now
metric 'mem_shared' has value_threshold 1024.000000
metric 'mem_buffers' being collected now
metric 'mem_buffers' has value_threshold 1024.000000
metric 'mem_cached' being collected now
metric 'mem_cached' has value_threshold 1024.000000
metric 'swap_free' being collected now
metric 'swap_free' has value_threshold 1024.000000
metric 'bytes_out' being collected now

********** bytes_out: 41.888252
metric 'bytes_out' has value_threshold 4096.000000
metric 'bytes_in' being collected now
********** bytes_in: 14.235692
metric 'bytes_in' has value_threshold 4096.000000
metric 'pkts_in' being collected now
********** pkts_in: 0.014141
metric 'pkts_in' has value_threshold 256.000000
metric 'pkts_out' being collected now
********** pkts_out: 0.028995
metric 'pkts_out' has value_threshold 256.000000
metric 'disk_total' being collected now
Counting device /dev/disk/by-uuid/7ffbe477-4849-4058-aca9-4b7fe11ae4dd (12.45 %)
Counting device /dev/mapper/lvg-var (23.30 %)
For all disks: 127.931 GB total, 102.394 GB free for users.
*** stack smashing detected ***: /usr/sbin/gmond terminated
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x37)[0x7f12cff994f7]
/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x0)[0x7f12cff994c0]
/usr/lib/ganglia/moddisk.so(disk_free_func+0x0)[0x7f12ceeaeb50]
/usr/lib/ganglia/moddisk.so(disk_total_func+0x20)[0x7f12ceeaeba0]
/usr/lib/ganglia/moddisk.so(+0x220a)[0x7f12ceeac20a]
/usr/sbin/gmond(Ganglia_collection_group_collect+0xa2)[0x407c02]
/usr/sbin/gmond(process_collection_groups+0x52)[0x408202]
/usr/sbin/gmond(main+0x374)[0x403fb4]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f12cfec030d]
/usr/sbin/gmond[0x404319]
======= Memory map: ========
00400000-0040e000 r-xp 00000000 68:01 1576898 /usr/sbin/gmond
0060d000-0060e000 r--p 0000d000 68:01 1576898 /usr/sbin/gmond
0060e000-0060f000 rw-p 0000e000 68:01 1576898 /usr/sbin/gmond
0060f000-00610000 rw-p 00000000 00:00 0
020e0000-02122000 rw-p 00000000 00:00 0 [heap]
7f12cd9ba000-7f12cd9cf000 r-xp 00000000 68:01 394053 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f12cd9cf000-7f12cdbce000 ---p 00015000 68:01 394053 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f12cdbce000-7f12cdbcf000 r--p 00014000 68:01 394053 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f12cdbcf000-7f12cdbd0000 rw-p 00015000 68:01 394053 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f12cdbd0000-7f12cdbdc000 r-xp 00000000 68:01 400959 /lib/x86_64-linux-gnu/libnss_files-2.13.so
7f12cdbdc000-7f12cdddb000 ---p 0000c000 68:01 400959 /lib/x86_64-linux-gnu/libnss_files-2.13.so
7f12cdddb000-7f12cdddc000 r--p 0000b000 68:01 400959 /lib/x86_64-linux-gnu/libnss_files-2.13.so
7f12cdddc000-7f12cdddd000 rw-p 0000c000 68:01 400959 /lib/x86_64-linux-gnu/libnss_files-2.13.so
7f12cdddd000-7f12cdde7000 r-xp 00000000 68:01 400964 /lib/x86_64-linux-gnu/libnss_nis-2.13.so
7f12cdde7000-7f12cdfe7000 ---p 0000a000 68:01 400964 /lib/x86_64-linux-gnu/libnss_nis-2.13.so
7f12cdfe7000-7f12cdfe8000 r--p 0000a000 68:01 400964 /lib/x86_64-linux-gnu/libnss_nis-2.13.so
7f12cdfe8000-7f12cdfe9000 rw-p 0000b000 68:01 400964 /lib/x86_64-linux-gnu/libnss_nis-2.13.so
7f12cdfe9000-7f12ce000000 r-xp 00000000 68:01 400952 /lib/x86_64-linux-gnu/libnsl-2.13.so
7f12ce000000-7f12ce1ff000 ---p 00017000 68:01 400952 /lib/x86_64-linux-gnu/libnsl-2.13.so
7f12ce1ff000-7f12ce200000 r--p 00016000 68:01 400952 /lib/x86_64-linux-gnu/libnsl-2.13.so
7f12ce200000-7f12ce201000 rw-p 00017000 68:01 400952 /lib/x86_64-linux-gnu/libnsl-2.13.so
7f12ce201000-7f12ce203000 rw-p 00000000 00:00 0
7f12ce203000-7f12ce20b000 r-xp 00000000 68:01 400954 /lib/x86_64-linux-gnu/libnss_compat-2.13.so
7f12ce20b000-7f12ce40a000 ---p 00008000 68:01 400954 /lib/x86_64-linux-gnu/libnss_compat-2.13.so
7f12ce40a000-7f12ce40b000 r--p 00007000 68:01 400954 /lib/x86_64-linux-gnu/libnss_compat-2.13.so
7f12ce40b000-7f12ce40c000 rw-p 00008000 68:01 400954 /lib/x86_64-linux-gnu/libnss_compat-2.13.so
7f12ce40c000-7f12ce412000 r-xp 00000000 68:01 1592907 /usr/lib/ganglia/modsys.so
7f12ce412000-7f12ce612000 ---p 00006000 68:01 1592907 /usr/lib/ganglia/modsys.so
7f12ce612000-7f12ce613000 r--p 00006000 68:01 1592907 /usr/lib/ganglia/modsys.so
7f12ce613000-7f12ce624000 rw-p 00007000 68:01 1592907 /usr/lib/ganglia/modsys.so
7f12ce624000-7f12ce62c000 rw-p 00000000 00:00 0
7f12ce62c000-7f12ce632000 r-xp 00000000 68:01 1592906 /usr/lib/ganglia/modproc.so
7f12ce632000-7f12ce831000 ---p 00006000 68:01 1592906 /usr/lib/ganglia/modproc.so
7f12ce831000-7f12ce832000 r--p 00005000 68:01 1592906 /usr/lib/ganglia/modproc.so
7f12ce832000-7f12ce843000 rw-p 00006000 68:01 1592906 /usr/lib/ganglia/modproc.so
7f12ce843000-7f12ce84b000 rw-p 00000000 00:00 0
7f12ce84b000-7f12ce851000 r-xp 00000000 68:01 1592905 /usr/lib/ganglia/modnet.so
7f12ce851000-7f12cea51000 ---p 00006000 68:01 1592905 /usr/lib/ganglia/modnet.so
7f12cea51000-7f12cea52000 r--p 00006000 68:01 1592905 /usr/lib/ganglia/modnet.so
7f12cea52000-7f12cea63000 rw-p 00007000 68:01 1592905 /usr/lib/ganglia/modnet.so
7f12cea63000-7f12cea6b000 rw-p 00000000 00:00 0
7f12cea6b000-7f12cea72000 r-xp 00000000 68:01 1592904 /usr/lib/ganglia/modmem.so
7f12cea72000-7f12cec71000 ---p 00007000 68:01 1592904 /usr/lib/ganglia/modmem.so
7f12cec71000-7f12cec72000 r--p 00006000 68:01 1592904 /usr/lib/ganglia/modmem.so
7f12cec72000-7f12cec83000 rw-p 00007000 68:01 1592904 /usr/lib/ganglia/modmem.so
7f12cec83000-7f12cec8b000 rw-p 00000000 00:00 0
7f12cec8b000-7f12cec91000 r-xp 00000000 68:01 1592902 /usr/lib/ganglia/modload.so
7f12cec91000-7f12cee90000 ---p 00006000 68:01 1592902 /usr/lib/ganglia/modload.so
7f12cee90000-7f12cee91000 r--p 00005000 68:01 1592902 /usr/lib/ganglia/modload.so
7f12cee91000-7f12ceea2000 rw-p 00006000 68:01 1592902 /usr/lib/ganglia/modload.so
7f12ceea2000-7f12ceeaa000 rw-p 00000000 00:00 0
7f12ceeaa000-7f12ceeb0000 r-xp 00000000 68:01 1592903 /usr/lib/ganglia/moddisk.so
7f12ceeb0000-7f12cf0af000 ---p 00006000 68:01 1592903 /usr/lib/ganglia/moddisk.so
7f12cf0af000-7f12cf0b0000 r--p 00005000 68:01 1592903 /usr/lib/ganglia/moddisk.so
7f12cf0b0000-7f12cf0c1000 rw-p 00006000 68:01 1592903 /usr/lib/ganglia/moddisk.so
7f12cf0c1000-7f12cf0c9000 rw-p 00000000 00:00 0
7f12cf0c9000-7f12cf0d0000 r-xp 00000000 68:01 1592901 /usr/lib/ganglia/modcpu.so
7f12cf0d0000-7f12cf2cf000 ---p 00007000 68:01 1592901 /usr/lib/ganglia/modcpu.so
7f12cf2cf000-7f12cf2d0000 r--p 00006000 68:01 1592901 /usr/lib/ganglia/modcpu.so
7f12cf2d0000-7f12cf2e1000 rw-p 00007000 68:01 1592901 /usr/lib/ganglia/modcpu.so
7f12cf2e1000-7f12cf2e9000 rw-p 00000000 00:00 0
7f12cf2e9000-7f12cf86c000 r--p 00000000 68:01 1581347 /usr/lib/locale/locale-archive
7f12cf86c000-7f12cf86e000 r-xp 00000000 68:01 400946 /lib/x86_64-linux-gnu/libdl-2.13.so
7f12cf86e000-7f12cfa6e000 ---p 00002000 68:01 400946 /lib/x86_64-linux-gnu/libdl-2.13.so
7f12cfa6e000-7f12cfa6f000 r--p 00002000 68:01 400946 /lib/x86_64-linux-gnu/libdl-2.13.so
7f12cfa6f000-7f12cfa70000 rw-p 00003000 68:01 400946 /lib/x86_64-linux-gnu/libdl-2.13.so
7f12cfa70000-7f12cfa74000 r-xp 00000000 68:01 394079 /lib/x86_64-linux-gnu/libuuid.so.1.3.0
7f12cfa74000-7f12cfc73000 ---p 00004000 68:01 394079 /lib/x86_64-linux-gnu/libuuid.so.1.3.0
7f12cfc73000-7f12cfc74000 r--p 00003000 68:01 394079 /lib/x86_64-linux-gnu/libuuid.so.1.3.0
7f12cfc74000-7f12cfc75000 rw-p 00004000 68:01 394079 /lib/x86_64-linux-gnu/libuuid.so.1.3.0
7f12cfc75000-7f12cfc9c000 r-xp 00000000 68:01 394232 /lib/x86_64-linux-gnu/libexpat.so.1.5.2
7f12cfc9c000-7f12cfe9c000 ---p 00027000 68:01 394232 /lib/x86_64-linux-gnu/libexpat.so.1.5.2
7f12cfe9c000-7f12cfe9e000 r--p 00027000 68:01 394232 /lib/x86_64-linux-gnu/libexpat.so.1.5.2
7f12cfe9e000-7f12cfe9f000 rw-p 00029000 68:01 394232 /lib/x86_64-linux-gnu/libexpat.so.1.5.2
7f12cfe9f000-7f12d0036000 r-xp 00000000 68:01 400940 /lib/x86_64-linux-gnu/libc-2.13.so
7f12d0036000-7f12d0235000 ---p 00197000 68:01 400940 /lib/x86_64-linux-gnu/libc-2.13.so
7f12d0235000-7f12d0239000 r--p 00196000 68:01 400940 /lib/x86_64-linux-gnu/libc-2.13.so
7f12d0239000-7f12d023a000 rw-p 0019a000 68:01 400940 /lib/x86_64-linux-gnu/libc-2.13.so
7f12d023a000-7f12d0240000 rw-p 00000000 00:00 0
7f12d0240000-7f12d0258000 r-xp 00000000 68:01 400972 /lib/x86_64-linux-gnu/libpthread-2.13.so
7f12d0258000-7f12d0457000 ---p 00018000 68:01 400972 /lib/x86_64-linux-gnu/libpthread-2.13.so
7f12d0457000-7f12d0458000 r--p 00017000 68:01 400972 /lib/x86_64-linux-gnu/libpthread-2.13.so
7f12d0458000-7f12d0459000 rw-p 00018000 68:01 400972 /lib/x86_64-linux-gnu/libpthread-2.13.so
7f12d0459000-7f12d045d000 rw-p 00000000 00:00 0
7f12d045d000-7f12d0495000 r-xp 00000000 68:01 1586085 /usr/lib/libapr-1.so.0.4.5
7f12d0495000-7f12d0694000 ---p 00038000 68:01 1586085 /usr/lib/libapr-1.so.0.4.5
7f12d0694000-7f12d0695000 r--p 00037000 68:01 1586085 /usr/lib/libapr-1.so.0.4.5
7f12d0695000-7f12d0696000 rw-p 00038000 68:01 1586085 /usr/lib/libapr-1.so.0.4.5
7f12d0696000-7f12d06a0000 r-xp 00000000 68:01 1592896 /usr/lib/x86_64-linux-gnu/libconfuse.so.0.0.0
7f12d06a0000-7f12d08a0000 ---p 0000a000 68:01 1592896 /usr/lib/x86_64-linux-gnu/libconfuse.so.0.0.0
7f12d08a0000-7f12d08a1000 r--p 0000a000 68:01 1592896 /usr/lib/x86_64-linux-gnu/libconfuse.so.0.0.0
7f12d08a1000-7f12d08a2000 rw-p 0000b000 68:01 1592896 /usr/lib/x86_64-linux-gnu/libconfuse.so.0.0.0
7f12d08a2000-7f12d08dd000 r-xp 00000000 68:01 394081 /lib/x86_64-linux-gnu/libpcre.so.3.12.1
7f12d08dd000-7f12d0adc000 ---p 0003b000 68:01 394081 /lib/x86_64-linux-gnu/libpcre.so.3.12.1
7f12d0adc000-7f12d0add000 r--p 0003a000 68:01 394081 /lib/x86_64-linux-gnu/libpcre.so.3.12.1
7f12d0add000-7f12d0ade000 rw-p 0003b000 68:01 394081 /lib/x86_64-linux-gnu/libpcre.so.3.12.1
7f12d0ade000-7f12d0af1000 r-xp 00000000 68:01 1592898 /usr/lib/libganglia-3.1.7.so.0.0.0
7f12d0af1000-7f12d0cf0000 ---p 00013000 68:01 1592898 /usr/lib/libganglia-3.1.7.so.0.0.0
7f12d0cf0000-7f12d0cf1000 r--p 00012000 68:01 1592898 /usr/lib/libganglia-3.1.7.so.0.0.0
7f12d0cf1000-7f12d0cf4000 rw-p 00013000 68:01 1592898 /usr/lib/libganglia-3.1.7.so.0.0.0
7f12d0cf4000-7f12d0d15000 r-xp 00000000 68:01 400934 /lib/x86_64-linux-gnu/ld-2.13.so
7f12d0ef5000-7f12d0f0d000 rw-p 00000000 00:00 0
7f12d0f0e000-7f12d0f14000 rw-p 00000000 00:00 0
7f12d0f14000-7f12d0f15000 r--p 00020000 68:01 400934 /lib/x86_64-linux-gnu/ld-2.13.so
7f12d0f15000-7f12d0f17000 rw-p 00021000 68:01 400934 /lib/x86_64-linux-gnu/ld-2.13.so
7fff1fbf6000-7fff1fc17000 rw-p 00000000 00:00 0 [stack]
7fff1fde6000-7fff1fde7000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Aborted

gmond broken on OS X (Mavericks)

gmond broken on OS X (Mavericks).

brew install ganglia passed, but gmond always fails to start.

Here's my error messages:

Cannot load /usr/local/Cellar/ganglia/3.6.0/lib/ganglia/modcpu.so metric module: dlopen(/usr/local/Cellar/ganglia/3.6.0/lib/ganglia/modcpu.so, 10): Symbol not found: _cpu_steal_func
Referenced from: /usr/local/Cellar/ganglia/3.6.0/lib/ganglia/modcpu.so
Expected in: flat namespace
in /usr/local/Cellar/ganglia/3.6.0/lib/ganglia/modcpu.so

I've made a patch to fix it. This patch was tested on my Mavericks.

See also:
Homebrew/legacy-homebrew#28183

gmetad: bind to a specific interface/IP?

gmond has a well-documented and functional means of binding: 'bind ='. Absent directives in gmetad.conf, I tried 'bind', 'ip:port' for the pair of port directives, and 'inet_addr/sin_addr' (picked up via strace) - none worked, at least on the apparently ancient 3.1.7 release via EPEL. Is there a way in this version, or a future version?

I'm aware of the trust options (and firewalls), but I still like to use binding as my first line of defense.

Allow writing data to graphite without the cluster prefix

Hello,
It would be great if I could send my collected data from ganglia to graphite without having the cluster prefix. We only have 6 servers, for which the cluster prefix is overkill, and it makes all later analysis longer as we always have to include the cluster prefix.

3.3 RPMs/debs

The latest versions on http://vuksan.com/centos/RPMS/x86_64/ are 3.2.0. I emailed Jeff who maintains https://launchpad.net/~rufustfirefly/+archive/ganglia and he's getting an updated version up there.

Are there any "official" locations for RPMS and debs of ganglia (CentOS/RHEL 5 and Ubuntu 10.04+ x86_64 specifically) that track the releases quickly or would it be better to build from source all the time?

Much appreciation to those hosting these archives of course. I just get impatient when compiling things :)

Support rrd COUNTER submission with gmetric

From the old wishlist on sourceforge:

"Support for counters (metrics with +ve slope)
This shouldn't require much work (from memory, make sure the slope-type information is preserved and patch gmetad to create RRD files with the correct options). Currently Ganglia doesn't actually support custom counter metrics, which is an awkward limitation."

Ganglia Build Issue - Solaris

The following DEFINE at the top of the lib/update_pidfile.c causes ganglia 3.3.7 not to build: #define _XOPEN_SOURCE 500 /* for getpgid */. Removing the define fixes the build problem.

High CPU usage due to millisecond polling

Not sure if this is a bug or a feature, but I've noticed that @satterly's change to make gmond's TCP accept channel listen in a separate thread (74cee73) appears to poll the listener socket every millisecond. This value is hardcoded to 1000 (ยตs, not ms):

apr_interval_time_t wait = 1000;

debug_msg("Starting TCP listener thread...");
for(;!done;)
  {
    if(!deaf)
      {
        now = apr_time_now();
        /* Pull in incoming data */
        poll_tcp_listen_channels(wait, now);
      }
    else
      {
        apr_sleep( wait );
      }

  }

(https://github.com/ganglia/monitor-core/blob/master/gmond/gmond.c#L3184)

Now, this could be the intended behaviour - however, it makes each gmond instance use between 2% and 5% CPU at all times on my test machines. Running multiple instances of gmond on one machine (for monitoring multiple clusters, for example) becomes very expensive after this change. Is this intentional?

Possibility of different rrd store path for different data_source

Hi,

Is there a possibility that rrds can be stored in different hierarchy or disk for each data_source? Assuming there are different path hierarchy or multiple disks, this will great help having the rrds written to different places. Would alleviate disk IO problem incase of multiple disks.

syslog fills with "Error 1 sending the modular data", gmond keeps using socket after EINVAL

Suggested action/solution: if write returns EINVAL, gmond should try to recreate or re-bind the sending socket, rather than continuing to send on a bad socket (and filling logs with errors)

Google reveals this has been discussed several times in the past, and
none of the discussions ended with a solution, so I'm presenting some
analysis below.

Here is what I did and what I found:

I discovered my gmond PID = 21015 and I checked it with strace:

strace -p 21015 -o /tmp/gmond.errs -v

After about a minute, I had a look inside /tmp/gmond.errs, lots of this:

write(7, "\0\0\0\205\0\0\0\4srv1\0\0\0\fmachine_type\0\0\0\0"..., 52) = 52
write(8, "\0\0\0\205\0\0\0\4srv1\0\0\0\fmachine_type\0\0\0\0"..., 52) =
-1 EINVAL (Invalid argument)
write(7, "\0\0\0\200\0\0\0\4srv1\0\0\0\7os_name\0\0\0\0\0\0\0\0\6"...,
164) = 164
write(8, "\0\0\0\200\0\0\0\4srv1\0\0\0\7os_name\0\0\0\0\0\0\0\0\6"...,
164) = -1 EINVAL (Invalid argument)
time([1351418592]) = 1351418592
sendto(9, "<30>Oct 28 11:03:12 /usr/sbin/gm"..., 90, MSG_NOSIGNAL, NULL,
0) = 90

Notice the `sendto' is actually sending the error to syslog, not sending
a metric packet

Ok, the `write' calls show me two file descriptors, 7 and 8. writes to
FD 8 are failing with EINVAL:

write(8, .... ) = -1 EINVAL (Invalid argument)

The file descriptors correspond to two different udp_send_channels in
gmond.conf - but which is which? Fortunately, lsof tells me:

lsof -p 21015 -n

gmond 21015 ganglia 7u IPv4 2747622 0t0 UDP
192.168.1.2:44778->239.2.11.71:8649

gmond 21015 ganglia 8u IPv4 2747628 0t0 UDP
(VPN address):53976->(remote server address):8649

Notice that FD 7 corresponds to a very standard multicast channel, while
FD 8 corresponds to a UDP unicast channel. I have deleted the IP
addresses, but this immediately revealed the problem (in my case
anyway): the local address (VPN address) existed when gmond started, but
no longer exists on this machine (because the VPN is not always up).

I can imagine similar problems would occur for hosts that get an IP by
means of DHCP, or hosts that have IPsec tunnel, PPP or some other
transient interfaces.

If anyone else sees the problem, it would be interested to see your
strace and lsof output. I believe gmond could be tweaked, for example,
to recreate (or re-bind) the socket with FD 8 after such an EINVAL error.
Doing so might log a more specific error or might successfully bind on a
new local IP.

Merge gmond PHP support

I have PHP gmond module support staged in this branch:

https://github.com/ganglia/monitor-core/tree/php-support

It should be able to be easily merged into master, but I wasn't sure whether we were ready to merge a potentially breaking change into the master branch. At the moment, I have the patch set to be disabled unless --enable-php is passed as a configure option.

gmond collector places metrics in wrong bucket(host)

Hi, we just ran into this issue where stats coming from jmxtrans as cassandra-prod were getting deposited into web-prod.

This is what jmxtrans sends:

 (com.googlecode.jmxtrans.model.output.GangliaWriter:253) - Emitted metric cassandra-cfs.MemberTimeline.RecentWriteLatencyMicros, type DOUBLE, value 12.0 for host: 127.0.0.1:cassandra-prod

Restarting gmond fixed the issue. But, just for the records, this is what I saw in the reds directory.

first:
[root@myhost /mnt/apps/ganglia/rrds/RIQ/web-prod]# ls -l cassandra*
-rw-rw-rw- 1 nobody nobody 630760 Aug 27 20:30 cassandra-cfs.AssetView.RecentReadLatencyMicros.rrd

After restarting
[root@myhost /mnt/apps/ganglia/rrds/RIQ/cassandra-prod]# ls -l cassandra-cfs.A*
-rw-rw-rw- 1 nobody nobody 630760 Aug 27 20:42 cassandra-cfs.Asset.RecentReadLatencyMicros.rrd

As you can see, they "ls" command was issued at the same time and it shows different timestamps. The cassandra-cfs.Asset.RecentReadLatencyMicros.rrd metric now is correctly being saved in cassandra-prod host and not in web-prod anymore.

I'm guessing gmond is failing or doing the comparison somehow wrong when interpreting the sender.

On the other hand, metrics that were showing up under web-prod such us jetty-heap.... stopped showing after I restarted gmond. It looks like they are competing with each other. But the still show up in the SummaryInfo directory.

 [root@myhost /mnt/apps/ganglia/rrds/RIQ/__SummaryInfo__]# ls -l jetty*
 -rw-rw-rw- 1 nobody nobody 1260992 Aug 27 20:59 jetty-heap.HeapMemoryUsage_committed.rrd
 -rw-rw-rw- 1 nobody nobody 1260992 Aug 27 20:58 jetty-heap.HeapMemoryUsage_init.rrd
 -rw-rw-rw- 1 nobody nobody 1260992 Aug 27 20:58 jetty-heap.HeapMemoryUsage_max.rrd

Could you guys advise ?

Thanks

Crash if override_hostname Contains Certain Numeric Values

On a node using override_hostname = web02.example.com (not the real domain; I can give that to you if needed), gmond crashes after running for a short time with a message like this:

        sent message 'proc_run' of length 60 with 0 errors                                                                                                                                                                                                                                                              
*** glibc detected *** /usr/sbin/gmond: malloc(): memory corruption (fast): 0x0000000000ec6b30 ***
======= Backtrace: =========                                                 
/lib/libc.so.6(+0x77806)[0x7f932d096806]                                     
/lib/libc.so.6(+0x7bb39)[0x7f932d09ab39]                                     
/lib/libc.so.6(__libc_malloc+0x6e)[0x7f932d09b7de]                           
/usr/sbin/gmond(Ganglia_collection_group_send+0x145)[0x407335]               
/usr/sbin/gmond(process_collection_groups+0x9b)[0x4077fb]                    
/usr/sbin/gmond(main+0x3c6)[0x4094a6]                                        
/lib/libc.so.6(__libc_start_main+0xfd)[0x7f932d03dc4d]                       
/usr/sbin/gmond[0x4047b9]                                                    
======= Memory map: ========                                                 
00400000-0041b000 r-xp 00000000 08:01 380073                             /usr/sbin/gmond
0061a000-0061b000 r--p 0001a000 08:01 380073                             /usr/sbin/gmond
0061b000-0061c000 rw-p 0001b000 08:01 380073                             /usr/sbin/gmond
0061c000-0061d000 rw-p 00000000 00:00 0                                      
00e85000-00ecd000 rw-p 00000000 00:00 0                                  [heap]
7f9324000000-7f9324021000 rw-p 00000000 00:00 0                              
7f9324021000-7f9328000000 ---p 00000000 00:00 0                              
7f932b365000-7f932b37b000 r-xp 00000000 08:01 344136                     /lib/libgcc_s.so.1
7f932b37b000-7f932b57a000 ---p 00016000 08:01 344136                     /lib/libgcc_s.so.1
7f932b57a000-7f932b57b000 r--p 00015000 08:01 344136                     /lib/libgcc_s.so.1
7f932b57b000-7f932b57c000 rw-p 00016000 08:01 344136                     /lib/libgcc_s.so.1
7f932b57c000-7f932b588000 r-xp 00000000 08:01 346038                     /lib/libnss_files-2.11.1.so
7f932b588000-7f932b787000 ---p 0000c000 08:01 346038                     /lib/libnss_files-2.11.1.so
7f932b787000-7f932b788000 r--p 0000b000 08:01 346038                     /lib/libnss_files-2.11.1.so
7f932b788000-7f932b789000 rw-p 0000c000 08:01 346038                     /lib/libnss_files-2.11.1.so
7f932b789000-7f932b793000 r-xp 00000000 08:01 346028                     /lib/libnss_nis-2.11.1.so
7f932b793000-7f932b992000 ---p 0000a000 08:01 346028                     /lib/libnss_nis-2.11.1.so
7f932b992000-7f932b993000 r--p 00009000 08:01 346028                     /lib/libnss_nis-2.11.1.so
7f932b993000-7f932b994000 rw-p 0000a000 08:01 346028                     /lib/libnss_nis-2.11.1.so
7f932b994000-7f932b99c000 r-xp 00000000 08:01 346027                     /lib/libnss_compat-2.11.1.so
7f932b99c000-7f932bb9b000 ---p 00008000 08:01 346027                     /lib/libnss_compat-2.11.1.so
7f932bb9b000-7f932bb9c000 r--p 00007000 08:01 346027                     /lib/libnss_compat-2.11.1.so
7f932bb9c000-7f932bb9d000 rw-p 00008000 08:01 346027                     /lib/libnss_compat-2.11.1.so
7f932bb9d000-7f932bba4000 r-xp 00000000 08:01 466978                     /usr/lib/ganglia/modsys.so
7f932bba4000-7f932bda3000 ---p 00007000 08:01 466978                     /usr/lib/ganglia/modsys.so
7f932bda3000-7f932bda4000 r--p 00006000 08:01 466978                     /usr/lib/ganglia/modsys.so
7f932bda4000-7f932bda5000 rw-p 00007000 08:01 466978                     /usr/lib/ganglia/modsys.so
7f932bda5000-7f932bda6000 rw-p 00000000 00:00 0                              
7f932bda6000-7f932bdad000 r-xp 00000000 08:01 466974                     /usr/lib/ganglia/modproc.so
7f932bdad000-7f932bfac000 ---p 00007000 08:01 466974                     /usr/lib/ganglia/modproc.so
7f932bfac000-7f932bfad000 r--p 00006000 08:01 466974                     /usr/lib/ganglia/modproc.so
7f932bfad000-7f932bfae000 rw-p 00007000 08:01 466974                     /usr/lib/ganglia/modproc.so
7f932bfae000-7f932bfb5000 r-xp 00000000 08:01 466979                     /usr/lib/ganglia/modnet.so
7f932bfb5000-7f932c1b4000 ---p 00007000 08:01 466979                     /usr/lib/ganglia/modnet.so
7f932c1b4000-7f932c1b5000 r--p 00006000 08:01 466979                     /usr/lib/ganglia/modnet.so
7f932c1b5000-7f932c1b6000 rw-p 00007000 08:01 466979                     /usr/lib/ganglia/modnet.so
7f932c1b6000-7f932c1bd000 r-xp 00000000 08:01 466976                     /usr/lib/ganglia/modmem.so
7f932c1bd000-7f932c3bc000 ---p 00007000 08:01 466976                     /usr/lib/ganglia/modmem.so
7f932c3bc000-7f932c3bd000 r--p 00006000 08:01 466976                     /usr/lib/ganglia/modmem.so
7f932c3bd000-7f932c3be000 rw-p 00007000 08:01 466976                     /usr/lib/ganglia/modmem.so
7f932c3be000-7f932c3bf000 rw-p 00000000 00:00 0                              
7f932c3bf000-7f932c3c6000 r-xp 00000000 08:01 466977                     /usr/lib/ganglia/modload.so
7f932c3c6000-7f932c5c5000 ---p 00007000 08:01 466977                     /usr/lib/ganglia/modload.so
7f932c5c5000-7f932c5c6000 r--p 00006000 08:01 466977                     /usr/lib/ganglia/modload.so
7f932c5c6000-7f932c5c7000 rw-p 00007000 08:01 466977                     /usr/lib/ganglia/modload.soAborted

There are instances with names manager00.example.com, ops00.example.com, worker00.example.com, worker01.example.com, web01.example.com, web02.example.com, web03.example.com.

It only fails on web02 and web03, which are identical in every way except the name to web00 and web01.

(It crashes immediately if I include configuration for python modules btw)

Running on an EC2 m1.large instance on Ubuntu 10.04.4 LTS, gmond 3.3.8 (from the package at https://launchpad.net/~rufustfirefly/+archive/ganglia)

diskfree.py python module name_parser

The python module diskfree.py returns information only for filesystems mounted with alphanumeric names ('\w'). Mounted filesystems with hypens in the name are not reported.

--- diskfree-fix.py 2012-03-14 13:05:12.000000000 -0700
+++ diskfree.py 2012-05-02 13:20:36.000000000 -0700
@@ -39,7 +39,7 @@
"""Return a value for the requested metric"""

 # parse unit type and path from name
  • name_parser = re.match("^%s([a-z]+)_([\w-]+)$" % NAME_PREFIX, name)
  • name_parser = re.match("^%s([a-z]+)_(\w+)$" % NAME_PREFIX, name)
    unit_type = name_parser.group(1)
    if name_parser.group(2) == 'rootfs':
    path = '/'

data thread does not always call close, gets stuck in CLOSE_WAIT forever

I've been having intermittent problems (a little under once per day) where gmetad stops reflecting updates for some hosts. To debug and for my sanity I have set up two gmetads polling from the same clusters (about a half dozen). One stopped updating on one cluster (IE TN kept going up for every host/metric), the other did not (which rules out problems with the gmond side of the house). After much flailing around with strace I discovered that a socket is stuck in CLOSE_WAIT

gmetad  24838 nobody    7u  IPv4           39448253      0t0      TCP asu0d-dmz.clearspring.local:58371->adm80.clearspring.local:8649 (CLOSE_WAIT)

This is with 3.4.0 on centos6. As I understand it there is one data_thread per source, so if that thread blocks here, nothing else will proceed to update for that cluster. There are also a variety of warnings to syslog, but I think they all pertain to the server thread and not the data thread and are thus not relevant

Aug 22 09:14:12 asu0d /usr/sbin/gmetad[24838]: server_thread() 1085028672 unable to write root epilog
Aug 22 10:00:28 asu0d /usr/sbin/gmetad[24838]: server_thread() 1085028672 unable to write root epilog
Aug 22 10:08:27 asu0d /usr/sbin/gmetad[24838]: server_thread() 1085028672 unable to write root epilog
Aug 22 10:09:45 asu0d last message repeated 2 times
Aug 22 10:11:00 asu0d /usr/sbin/gmetad[24838]: server_thread() 1085028672 unable to write root epilog
Aug 22 10:19:41 asu0d last message repeated 2 times
Aug 22 10:21:00 asu0d last message repeated 2 times
Aug 22 10:47:16 asu0d /usr/sbin/gmetad[24838]: server_thread() 1085028672 unable to write root epilog
Aug 22 10:48:37 asu0d /usr/sbin/gmetad[24838]: server_thread() 1085028672 unable to write root epilog
Aug 22 10:55:32 asu0d /usr/sbin/gmetad[24838]: server_thread() 1085028672 unable to write root epilog
Aug 22 10:55:32 asu0d /usr/sbin/gmetad[24838]: server_thread() 1085028672 unable to write root preamble (DTD, etc)
Aug 22 10:55:32 asu0d last message repeated 6 times
Aug 22 10:55:36 asu0d /usr/sbin/gmetad[24838]: server_thread() 1085028672 unable to write root epilog

Critical Line NOT visible in Ganglia Views

We are using the ganglia-gweb-3.5.2-1. We want to have a critical line for our view graphs to monitor any spike in the metrics.
Giving critical values from the UI and including individual graphs in a view is easy and it shows the critical line too.

But, we want the critical line for our aggregate graphs in the views. We do it using a 'yaml' file which is processed to generate the 'json' file of the view.

The contents of the yaml file look like:

Example_View_Dashboard:
graph1:
metric_regex: '"^(Eample_Metric)"'
host_regex: '"^(Example_Server[0-9])"'
vertical_label: '"units"'
critical: '"3000.0"'
title: '"Example title"'

Now this view definitions yaml file is processed to generate the actual json file containing the information of the view and its different graphs. Like,

{
"items": [
{
"aggregate_graph": "true",
"critical": "3000.0",
"graph_type": "line",
"host_regex": [
{
"regex": "^(Example_Server[0-9])"
}
],
"metric_regex": [
{
"regex": "^(Example_Server[0-9])"
}
],
"title": "Example title",
"vertical_label": "units"
}
],
"view_name": "Example_View_Dashboard",
"view_type": "standard"
}

Though the critical is defined for this graph in the view, but the critical line is NOT showing in the web UI. Please help and guide.

UNKOWN hostname when restarting gmetad

Hi. We are running 3.3.0-1 version of Ganglia. When we restart gmetad we get into this state where a bunch of severs are shown as UNKOWN.

I wanted to run this through you inquire if there is a fix for this in later versions or in case you have some clue of what can be causing this behavior.

Example reported in Nagios looks like this:

UNKNOWN storm-supervisor_status - Hostname info not available. Likely invalid hostname

Please let me know if there is more info I can provide.

Thanks in advance
Patricio

Missing libuuid confuses configure script

The configure script was failing due to it being unable to find "libconfuse". Since I have libconfuse, I dug into this problem and it turns out ld was failing due to it being unable to find "-luuid". Once I installed libuuid, the error regarding libconfuse went away.

Prior to checking for libconfuse, there needs to be a check for libuuid.

serious bug clobbering TN values

I tracked down a nasty little bug in the Gmeta/gmetad_data.py in my debugger:

Line 218: ย ย ย if hostNode.lastReportedTime == reportedTime:

Should be :

Line 218: ย ย ย if metricNode.lastReportedTime == reportedTime:

...... this was clobbering all my TN values with really old numbers.....

gmetad: whitelist or filtering for metrics to be aggregated

gmetad currently takes each metric and aggregates the metric from every host to build summary RRDs. These are used to display the summary graphs, for example, the graph that shows the sum of memory capacity in all hosts that are online at a given moment.

For some metrics, these summary graphs may be meaningless or useless.

Therefore, it would be useful if gmetad had a config option (in gmetad.conf) to create a whitelist of metrics that are to be summarized. Even better, this could use regular expressions and maybe have a series of include and exclude rules and finally a default action for any metric name that did not match any rule.

gmetad interactive port stops functioning occasionally

This is with gmetad 3.5, I do not believe it is a new problem but had not previously tracked it down to a problem with the interactive port. I have not customized the number of server_threads, which I believe should leave me with the default 4. What I am seeing once a week or so is that the web ui becomes unresponsive (page load blocks indefinitely). Data collection and answering non-interactive xml requests is unaffected.

echo "/?filter=summary" |   nc localhost 8652

Hangs indefinitely.

I saw several ESTABLISHED connections to 8652, after restarting httpd (to see if it was at fault) the connections sat in CLOSE_WAIT. After httpd restart trying to load the web ui get's "There was an error collecting ganglia data (127.0.0.1:8652): XML error: Invalid document end at 1" instead of a hang. Restarting gmetad fixes the problem.

# lsof -p 2400 | grep -i 8652
gmetad  2400 nobody    1u  IPv4            2388480             TCP *:8652 (LISTEN)
gmetad  2400 nobody    6u  IPv4            6481200             TCP lsu02.clearspring.local:8652->lsu02.clearspring.local:51602 (CLOSE_WAIT)
gmetad  2400 nobody    7u  IPv4            7138517             TCP lsu02.clearspring.local:8652->lsu02.clearspring.local:32786 (CLOSE_WAIT)
gmetad  2400 nobody   11u  IPv4            7136011             TCP lsu02.clearspring.local:8652->lsu02.clearspring.local:60970 (CLOSE_WAIT)

(I am not sure why I only end up with 3 suck sockets, instead of 4.)

Thread 23 (Thread 0x418a9940 (LWP 2402)):
#0  0x0000003b31c0db3b in accept () from /lib64/libpthread.so.0
#1  0x0000000000405488 in pthread_attr_setdetachstate ()
#2  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#3  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 22 (Thread 0x422aa940 (LWP 2403)):
#0  0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000405474 in pthread_attr_setdetachstate ()
#4  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#5  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 21 (Thread 0x42cab940 (LWP 2404)):
#0  0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000404999 in pthread_attr_setdetachstate ()
#4  0x0000000000404b45 in pthread_attr_setdetachstate ()
#5  0x0000000000404a3d in pthread_attr_setdetachstate ()
#6  0x0000000000405588 in pthread_attr_setdetachstate ()
#7  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#8  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 20 (Thread 0x436ac940 (LWP 2405)):
#0  0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000404999 in pthread_attr_setdetachstate ()
#4  0x0000000000404b45 in pthread_attr_setdetachstate ()
#5  0x0000000000404a3d in pthread_attr_setdetachstate ()
#6  0x0000000000405588 in pthread_attr_setdetachstate ()
#7  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#8  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 19 (Thread 0x440ad940 (LWP 2406)):
#0  0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000404999 in pthread_attr_setdetachstate ()
#4  0x0000000000404b45 in pthread_attr_setdetachstate ()
#5  0x0000000000404a3d in pthread_attr_setdetachstate ()
#6  0x0000000000405588 in pthread_attr_setdetachstate ()
#7  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#8  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 18 (Thread 0x44aae940 (LWP 2407)):
#0  0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000404999 in pthread_attr_setdetachstate ()
#4  0x0000000000404b45 in pthread_attr_setdetachstate ()
#5  0x0000000000404a3d in pthread_attr_setdetachstate ()
#6  0x0000000000405588 in pthread_attr_setdetachstate ()
#7  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#8  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 17 (Thread 0x454af940 (LWP 2408)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 16 (Thread 0x45eb0940 (LWP 2409)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 15 (Thread 0x468b1940 (LWP 2410)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 14 (Thread 0x472b2940 (LWP 2411)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 13 (Thread 0x47cb3940 (LWP 2412)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 12 (Thread 0x486b4940 (LWP 2413)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x490b5940 (LWP 2414)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x49ab6940 (LWP 2415)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x4a4b7940 (LWP 2416)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x4aeb8940 (LWP 2417)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x4b8b9940 (LWP 2418)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x4c2ba940 (LWP 2419)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x4ccbb940 (LWP 2420)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x4d6bc940 (LWP 2421)):
#0  0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000406d9e in pthread_attr_setdetachstate ()
#4  0x0000003b39809bc9 in ?? () from /lib64/libexpat.so.0
#5  0x0000003b3980ab44 in ?? () from /lib64/libexpat.so.0
#6  0x0000003b3980b66a in ?? () from /lib64/libexpat.so.0
#7  0x0000003b3980cc4b in ?? () from /lib64/libexpat.so.0
#8  0x0000003b39803ef1 in XML_ParseBuffer () from /lib64/libexpat.so.0
#9  0x0000000000405920 in pthread_attr_setdetachstate ()
#10 0x0000000000404522 in pthread_attr_setdetachstate ()
#11 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#12 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x4e0bd940 (LWP 2422)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x4eabe940 (LWP 2423)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x00000000004091b7 in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2ae116e772f0 (LWP 2400)):
#0  0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000403348 in pthread_attr_setdetachstate ()
#4  0x0000003b6b00b558 in hash_foreach () from /usr/lib64/libganglia-3.5.0.so.0
#5  0x00000000004030ca in pthread_attr_setdetachstate ()
#6  0x0000003b3141d994 in __libc_start_main () from /lib64/libc.so.6
#7  0x0000000000402b29 in pthread_attr_setdetachstate ()
#8  0x00007fffed188098 in ?? ()
#9  0x0000000000000000 in ?? ()

Grid CPU graph 100%

Hello,

While using gmetad-python, when I added a new cluster to the grid, the Grid CPU graph hit and stayed at 100%.
There are 7 data sources, 28 servers and 246 cpus being monitored in total. I am using unicast on my gmonds.
A screenshot of the problem is viewable here: http://i.imgur.com/1wUn9.png

To fix it I tried:

  • Restarting gmetad-python - No effect.
  • Setting the send_metadata_interval = 60 on all gmond's. - No effect
  • Disabling the recently added cluster and restarting gmetad-python - Problem went away.
  • Disabling all other cluster's and enabling the recently added one - Problem came back.
  • Blew away the SummaryInfo RRD files for the Grid and restarted gmetad-python - No effect.
  • Removed gmetad-python and installed gmetad instead - Problem was rectified.

I can provide gmetad-python.conf and gmond.conf files if requested.

Regards,

Evan.

version 3.5.0 is slower then 3.4.0

I upgraded from ganglia-core 3.4.0 to 3.5.0. I left ganglia-web at the same version at 3.5.3.
After the upgrade, cpu usage is much higher. This machine is dedicated to running nginx + spawn-cgi (for ganglia-web and gmeta + rrdcached.

Normally we have nagios use ganglia-web for 1000 checks spread over 5 minutes.

gangliacpu

gmetad threads suddenly die with no log messages

gmetad 3.5.0 on centos 6

Stack

Thread 2 (Thread 0x7f4ede5e9700 (LWP 24693)):
#0  0x00000037f0ae62c3 in epoll_wait () from /lib64/libc.so.6
#1  0x000000385ac1ede4 in apr_pollset_poll () from /usr/lib64/libapr-1.so.0
#2  0x00000000004096dc in tcp_listener ()
#3  0x0000003616e077f1 in start_thread () from /lib64/libpthread.so.0
#4  0x00000037f0ae5ccd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f4ee78a8740 (LWP 24685)):
#0  0x00000037f0ae62c3 in epoll_wait () from /lib64/libc.so.6
#1  0x000000385ac1ede4 in apr_pollset_poll () from /usr/lib64/libapr-1.so.0
#2  0x0000000000408ec9 in main ()

lsof with funky "UDP" connections that should be tcp

gmond   24685 nobody    5u  IPv4 107524166      0t0      UDP *:8649 
gmond   24685 nobody    6u  IPv4 107524167      0t0      TCP *:8649 (LISTEN)
gmond   24685 nobody    7u  IPv4 107524172      0t0      UDP lsu0d-a.clearspring.local:43775->lsu00-a.clearspring.local:8649 
gmond   24685 nobody    8u  IPv4 107524175      0t0      UDP lsu0d-a.clearspring.local:33632->lsu01-a.clearspring.local:8649 
gmond   24685 nobody    9u  IPv4 107524178      0t0      UDP lsu0d-a.clearspring.local:37053->lsu03-a.clearspring.local:8649 

No smokeing gun in log lines

server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root preamble (DTD, etc)
server_thread() 7f55888ea700 unable to write root preamble (DTD, etc)
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f5587ee9700 unable to write root epilog
server_thread() 7f5587ee9700 unable to write root epilog
server_thread() 7f5587ee9700 unable to write root epilog
server_thread() 7f5587ee9700 unable to write root preamble (DTD, etc)
server_thread() 7f5587ee9700 unable to write root epilog
server_thread() 7f5587ee9700 unable to write root epilog
server_thread() 7f5587ee9700 unable to write root epilog
server_thread() 7f5587ee9700 unable to write root epilog
server_thread() 7f5587ee9700 unable to write root epilog
server_thread() 7f5587ee9700 unable to write root epilog
server_thread() 7f5587ee9700 unable to write root epilog
server_thread() 7f5587ee9700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root preamble (DTD, etc)
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root preamble (DTD, etc)
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root epilog
server_thread() 7f5587ee9700 unable to write root epilog
server_thread() 7f55888ea700 unable to write root epilog

ganglia python plugin path should be prepended to python path

Hi,

I ran into a problem writing a python plugin. It had the same name as a system wide installed python module.

Since ganglia appends the plugin path to the python path, it will load system wide installed modules before it would load the actual plugin. Even worse, if you have a plugin called 'foo' and install, let's say month later, a python model 'foo' it will break the plugin.

https://github.com/ganglia/monitor-core/blob/master/gmond/modules/python/mod_python.c#L593 says:

PyObject *sys_path = PySys_GetObject("path");
PyObject *addpath = PyString_FromString(path);
PyList_Append(sys_path, addpath)

It should be easy to change that to prepend, instead of append, the plugin path without any downsides.

Suse Startup Files broken

For both gmetad and gmond you find the following code in the Suse specific startup files:

gmetad

Determine the base and follow a runlevel link name.

base=${0##/}
link=${base#
[SK][0-9][0-9]}

Force execution if not called by a runlevel directory.

test $link = $base && START_GMETAD=yes
test "$START_GMETAD" = yes || exit 0

gmond

Determine the base and follow a runlevel link name.

base=${0##/}
link=${base#
[SK][0-9][0-9]}

Force execution if not called by a runlevel directory.

test $link = $base && START_GMOND=yes
test "$START_GMOND" = yes || exit 0

The problem is that START_GMOND/START_GMETAD are never even tried to be defined before that code. As a result, that code prevents startup when called out of a runlevel directory.

Ability to limit "spikes" in positive slope/counter metrics

There doesn't seem to be an ability to deal with huge jumps (beyond anything that could be considered reasonable for a metric) in counter/positive slope metrics, as anything that is submitting data for them generally does not track the previous value.

It would be "nice" to have an optional reasonable "maximum delta value" to check positive metric data against, definable per metric, so that any value which exceeded that "maximum delta value" would be ignored.

gmetad poll() timeout

We setup a cluster in Amazon EC2.

  • One instance in US region running gmetad
  • Four instance in EU region running gmond
    • One of them running as gmond master
  • Use public unicast IPs to transmit data among gmond and gmetad

Sometimes, gmetad faild to receive data from gmond master in 15 minutes. The error log is:
Aug 15 08:30:08 /usr/sbin/gmetad[3253]: poll() timeout from source 0 for [RS EU] data source after 2531 bytes read
Aug 15 08:30:22 /usr/sbin/gmetad[3253]: poll() timeout from source 0 for [RS EU] data source after 0 bytes read
Aug 15 08:30:37 /usr/sbin/gmetad[3253]: poll() timeout from source 0 for [RS EU] data source after 0 bytes read
Aug 15 08:30:51 /usr/sbin/gmetad[3253]: poll() timeout from source 0 for [RS EU] data source after 0 bytes read
...

Please let me know if there is more info I need provide.
Thanks a lot.

ganglia-gmetad: gmetad.service wrong => patch attached

the gmetad.service.in is wrong in serveral ways:
1/ gmtad doesn't recognize the -f option
2/ missing Type= fork
3/ missing EnvironmentFile=
4/ missing --pid-file option
5/ User=ganglia is wrong as if gmetad starts as user ganglia (provided it exists in the systems), then it fails to setuid to user configured in gmetad.conf.....which is by default nobody....
Patch available here:
http://olivier.lahaye1.free.fr/OSCAR/ganglia-gmetad_fix_systemd.patch

cygwin 1.7 + ganglia 3.5

gmond.c:160: error: parse error before '*' token
gmond.c:160: warning: type defaults to int' in declaration ofhosts_mutex'
gmond.c:160: warning: data definition has no type or storage class
gmond.c: In function Ganglia_host_get': gmond.c:1029: warning: implicit declaration of functionapr_thread_mutex_create'
gmond.c:1029: error: APR_THREAD_MUTEX_DEFAULT' undeclared (first use in this function) gmond.c:1029: error: (Each undeclared identifier is reported only once gmond.c:1029: error: for each function it appears in.) gmond.c:1055: warning: implicit declaration of functionapr_thread_mutex_lock'
gmond.c:1057: warning: implicit declaration of function apr_thread_mutex_unlock' gmond.c: In functiontcp_listener':
gmond.c:3056: warning: implicit declaration of function apr_thread_exit' gmond.c: In functionmain':
gmond.c:3174: error: `APR_THREAD_MUTEX_DEFAULT' undeclared (first use in this function)

Note: This issue was raised by Char Tao on SourceForge SF wiki. I'm logging it here so that all issues are tracked in the one place.

libmetrics/linux/metrics.c and network counter bit width

libmetrics/linux/metrics.c assumes the presence of 64 bit network counters in /proc/dev/net if configure detects stroull() (thus defining HAVE_STRTOULL).

This causes issues on iX86 (and presumably other 32 bit) Linux systems, as it has 32 bit network counters. When the 32 bit counters overflow, a huge bias is added, resulting in erratic and erroneous network traffic spikes.

For my personal ganglia deployment, I changed the preprocessor conditional that selected the network statistic type to use i386 instead of HAVE_STRTOULL. This may not be the best solution for integration into the master sources, as it doesn't address the fact that there are other 32 bit Linux architectures. There's also the possibility that a 32 bit gmond could be run on a 64 bit system of the same architectural family (although I'm not sure this is worth consideration, as there may be issues (other than network counter width) that preclude this from working).

Sanitizing metric names breaks python module spoofed metrics

The commit to sanitize metric name 1879103 seems to have broken spoof metrics generated within a python module. I haven't been able to work out how it breaks things but if I bypass the sanitize_metric_name() function (by immediately returning as soon as it's called) the spoof metrics work again.

grid view showing 2x the number of hosts in a cluster

For example:

I have a cluster of 2 nodes

In the main grid page, the www cluster will show 4 nodes

in the main cluster page, the correct number of nodes are displayed for this cluster.

image

image

I am currently using ganglia 3.6 but this was also occurring with 3.4 and 3.5 as well. This seemed to start happening when i started using hsflowd on my clients.

buffer overflow on file /proc/sys/kernel/osrelease

[root@dhcp192 ganglia-3.3.0]# sbin/gmond -V
gmond 3.3.0
[root@dhcp192 ganglia-3.3.0]# uname -a
Linux dhcp192.company.com 2.6.32-71.7.1.el6-0.11.smp.gcc4.1.x86_64 #1 SMP Fri Jan 7 14:43:49 EST 2011 x86_64 GNU/Linux
[root@dhcp192 ganglia-3.3.0]# cat /proc/sys/kernel/osrelease
2.6.32-71.7.1.el6-0.11.smp.gcc4.1.x86_64
[root@dhcp192 ganglia-3.3.0]# sbin/gmond -t > etc/ganglia/gmond.conf
[root@dhcp192 ganglia-3.3.0]# sbin/gmond -f -d 1 -c etc/ganglia/gmond.conf
slurpfile() read() buffer overflow on file /proc/sys/kernel/osrelease
^C[root@dhcp192 ganglia-3.3.0]#

No rule to make target `gmetad.service.in', needed by `gmetad.service'. on Fedora os

The error โ€No rule to make target gmetad.service.in', needed bygmetad.service'. โ€œ appears when i make ganglia3.6.0 on Fedor os.
Making all in gmetad
make[2]: Entering directory /home/q/ganglia/ganglia-3.6.0/gmetad' make[2]: *** No rule to make targetgmetad.service.in', needed by gmetad.service'. Stop. make[2]: Leaving directory/home/q/ganglia/ganglia-3.6.0/gmetad'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/q/ganglia/ganglia-3.6.0'
make: *** [all] Error 2

module path should lookup in ganglia dir first then fallback to system

I have added the jenkins.py module under /usr/lib/ganglia/python_modules/jenkins.py

The python module is configured to use the path /usr/lib/ganglia/python_modules/

When loading gmond :

/usr/sbin/gmond -m
[PYTHON] Can't find the metric_init function in the python module [jenkins].

Looking with strace, it turns out that gmond imports the system module 'jenkins':

$ strace -f /usr/sbin/gmond -m 2>&1|grep jenkins|fgrep -v 'ENOENT'
...
open("/usr/local/lib/python2.7/dist-packages/python_jenkins-0.2-py2.7.egg/jenkins/__init__.py", O_RDONLY) = 4
...

Looking in /gmond/modules/python/mod_python.c , the pyth_metric_init function append the module path to the list of system path:

PyObject *sys_path = PySys_GetObject("path");
PyObject *addpath = PyString_FromString(path);
PyList_Append(sys_path, addpath);

I think it should prepend it instead. That would avoid potential name clashes.

This is with Ganglia v3.5.0.

RPM build: v3.6.0: unpackaged files: gmetad.service and gmond.service. (fix included)

Trying to build on Fedora-17 and had unpackaged files.

LC_ALL=C rpmbuild -tb ganglia-3.6.0.tar.gz
...
...
Processing files: ganglia-debuginfo-3.6.0-1.x86_64
Checking for unpackaged file(s): /usr/lib/rpm/check-files /home/ol222822/rpmbuild/BUILDROOT/ganglia-3.6.0-1.x86_64
error: Installed (but unpackaged) file(s) found:
/usr/lib/systemd/system/gmetad.service
/usr/lib/systemd/system/gmond.service

RPM build errors:
Installed (but unpackaged) file(s) found:
/usr/lib/systemd/system/gmetad.service
/usr/lib/systemd/system/gmond.service

missing gmetad.service.in and gmond.service.in files in v3.6.0 tarball.(fix included)

gmond/gmond.service.in and gmetad/gmetad.service.in are missing in the ganglia-3.6.0.tar.gz that can be downloaded here: http://sourceforge.net/projects/ganglia/?source=dlp

Thus (on fedora17 at least), rpmbuild -tb ganglia-3.6.0.tar.gz fails with the following error:

make[2]: *** No rule to make target gmetad.service.in', needed bygmetad.service'. Stop.
make[2]: Leaving directory /home/ol222822/rpmbuild/BUILD/ganglia-3.6.0/gmetad' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory/home/ol222822/rpmbuild/BUILD/ganglia-3.6.0'
make: *** [all] Error 2
erreur: Mauvais status de sortie pour /var/tmp/rpm-tmp.ZW5oN2 (%build)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.