Giter Site home page Giter Site logo

repeated errors in the netstats module (tcp_retrans_percentage and tcpext_tcploss_percentage metrics) about gmond_python_modules HOT 13 CLOSED

ganglia avatar ganglia commented on September 24, 2024
repeated errors in the netstats module (tcp_retrans_percentage and tcpext_tcploss_percentage metrics)

from gmond_python_modules.

Comments (13)

rubinlinux avatar rubinlinux commented on September 24, 2024 2

I am seeing this still. It is flooding the logs with:

[PYTHON] Can't call the metric handler function for [tcpext_listendrops] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [tcp_attemptfails] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [tcpext_tcploss_percentage] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [tcp_retrans_percentage] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [tcp_outsegs] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [tcp_insegs] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [udp_indatagrams] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [udp_outdatagrams] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [udp_inerrors] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [udp_rcvbuferrors] in the python module [netstats].

...over and over

ganglia 3.6.0-7 (debian)

from gmond_python_modules.

martinwalsh avatar martinwalsh commented on September 24, 2024

Any advice, or pointers to docs, related to troubleshooting issues like this one are greatly appreciated.

from gmond_python_modules.

vvuksan avatar vvuksan commented on September 24, 2024

Unfortunately there is very little documentation available. You can check out

http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_gmond_python_modules

from gmond_python_modules.

martinwalsh avatar martinwalsh commented on September 24, 2024

Okay, so I think I've stumbled through this issue, and come up with a potential solution. It seems both the get_tcploss_percentage and get_retrans_percentage functions are prone to ZeroDivisionError, where KeyError is currently the only caught exception.

This eventually exposed the underlying cause: the dict constructor performs what amounts to a shallow copy when passed another dictionary instance. Therefore, nested dicts inside of curr_metrics (aka METRICS) and last_metrics (aka LAST_METRICS) are actually the same dict instance -- and as a consequence the pct assignment results in division by zero. Consider the following:

>>> d1 = {'outside': {'inside': "I'm inside"}}
>>> d2 = dict(d1)
>>> d2 is d1
False
>>> d2['outside'] is d1['outside']
True
>>> d2['outside']['inside'] = "inside dict is a reference to the same instance"
>>> d1
{'outside': {'inside': 'inside dict is a reference to the same instance'}}
>>> d2
{'outside': {'inside': 'inside dict is a reference to the same instance'}}

See patch: #93

from gmond_python_modules.

vvuksan avatar vvuksan commented on September 24, 2024

Thanks for tracking this one down :)

from gmond_python_modules.

vvuksan avatar vvuksan commented on September 24, 2024

Fix merged per #93

from gmond_python_modules.

johsod avatar johsod commented on September 24, 2024

This one seem to be back again in the 3.6.0 release. I get the same error messages in my environment and I'm running CentOS 6.4 together with gmond 3.6.0 and python 2.6.6.

from gmond_python_modules.

shawn174 avatar shawn174 commented on September 24, 2024

Perhaps this will help: ganglia/monitor-core#123

from gmond_python_modules.

bphusted avatar bphusted commented on September 24, 2024

I have the same problem on 3 of the nodes: All the other nodes are fine.
Oct 23 12:24:30 /usr/sbin/gmond[3764]: [PYTHON] Can't call the metric handler function for [tcpext_tcploss_percentage] in the python module [netstats].#12
Oct 23 12:24:30 /usr/sbin/gmond[3764]: [PYTHON] Can't call the metric handler function for [tcp_retrans_percentage] in the python module [netstats].#12

Used vvuksan build of ganglia 3.6.0, Centos 6.5 and python 2.6.6 and libconfuse (2.7-4.el6) from the epel rep.

For now I have disabled that section with these two metrics in netstats.pyconf

Any suggestions what to look for?

from gmond_python_modules.

acutchin avatar acutchin commented on September 24, 2024

This is a real PITA. Any updates on a solution?

from gmond_python_modules.

refraction-ray avatar refraction-ray commented on September 24, 2024

As pointed by @shawn174 , the solution is this patch: ganglia/monitor-core#123 in ganglia/monitor-core. I will elaborate this issue a bit here just in case it is helpful for future viewers with the same issue.

The problem is because /usr/lib/ganglia/python_modules/netstats.py (typical installation path on ubuntu by apt) doesn't handle ZeroDivisionError. Therefore, if some metrics are unchanged in the last period, the denominator is just zero causing exception in the function. The solution is rather simple, just catch ZeroDivisionError in get_tcploss_percentage and get_retrans_percentage functions as shown in the above PR.

This fix is already merged as early as Oct 16, 2013. However, ganglia 3.6.0 release is even older than this, which is released on Apr 30, 2013. It is surprising to see that apt still provided such an older version of ganglia (at least it is still 3.6.0 on Ubuntu 18.04). Therefore, the easiest way is modifing /usr/lib/ganglia/python_modules/netstats.py directly, or one can install a newer version of ganglia instead of the default version provided by apt.

To sum up, the repeated errors in syslog is caused by an old python script bug (not catching ZeroDivision) and a very old version of ganglia provided by some distribution (Ubuntu18.04 provided a release 18-13=5 years old!).

from gmond_python_modules.

 avatar commented on September 24, 2024

We are on 3.7.2 and still experiencing this problem. Maybe it was reintroduced in one of the last versions released.

Will likely try editing netstats.py manually as a solution. Commenting this here though in case anyone else on 3.7.2 might encounter this

from gmond_python_modules.

 avatar commented on September 24, 2024

Still monitoring, but so far I believe I was able to implement a fix for 3.7.2

Manually changed /usr/lib64/ganglia/python_modules/netstats.py (your installation location may differ) to catch ZeroDivisionErrors in the delta function. I can't see how this would have a negative effect on monitoring since the exception should only invoke in cases of divide-by-zero situations. So far so good on my end on a node I'm testing this on.

def get_delta(name):
    """Return change over time for the requested metric"""

    # get metrics
    [curr_metrics, last_metrics] = get_metrics()

    parts = name.split("_")
    group = parts[0]
    metric = "_".join(parts[1:])

    try:
        delta = (float(curr_metrics['data'][group][metric]) - float(last_metrics['data'][group][metric])) / (curr_metrics['time'] - last_metrics['time'])
        if delta < 0:
            print name + " is less 0"
            delta = 0
    except KeyError:
        delta = 0.0
    except ZeroDivisionError:
        delta = 0.0

    return delta

The only change being the addition of the lines:

except ZeroDivisionError:
        delta = 0.0

from gmond_python_modules.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.