Comments (13)
I am seeing this still. It is flooding the logs with:
[PYTHON] Can't call the metric handler function for [tcpext_listendrops] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [tcp_attemptfails] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [tcpext_tcploss_percentage] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [tcp_retrans_percentage] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [tcp_outsegs] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [tcp_insegs] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [udp_indatagrams] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [udp_outdatagrams] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [udp_inerrors] in the python module [netstats].
[PYTHON] Can't call the metric handler function for [udp_rcvbuferrors] in the python module [netstats].
...over and over
ganglia 3.6.0-7 (debian)
from gmond_python_modules.
Any advice, or pointers to docs, related to troubleshooting issues like this one are greatly appreciated.
from gmond_python_modules.
Unfortunately there is very little documentation available. You can check out
http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_gmond_python_modules
from gmond_python_modules.
Okay, so I think I've stumbled through this issue, and come up with a potential solution. It seems both the get_tcploss_percentage
and get_retrans_percentage
functions are prone to ZeroDivisionError
, where KeyError
is currently the only caught exception.
This eventually exposed the underlying cause: the dict
constructor performs what amounts to a shallow copy when passed another dictionary instance. Therefore, nested dicts inside of curr_metrics
(aka METRICS
) and last_metrics
(aka LAST_METRICS
) are actually the same dict instance -- and as a consequence the pct
assignment results in division by zero. Consider the following:
>>> d1 = {'outside': {'inside': "I'm inside"}}
>>> d2 = dict(d1)
>>> d2 is d1
False
>>> d2['outside'] is d1['outside']
True
>>> d2['outside']['inside'] = "inside dict is a reference to the same instance"
>>> d1
{'outside': {'inside': 'inside dict is a reference to the same instance'}}
>>> d2
{'outside': {'inside': 'inside dict is a reference to the same instance'}}
See patch: #93
from gmond_python_modules.
Thanks for tracking this one down :)
from gmond_python_modules.
Fix merged per #93
from gmond_python_modules.
This one seem to be back again in the 3.6.0 release. I get the same error messages in my environment and I'm running CentOS 6.4 together with gmond 3.6.0 and python 2.6.6.
from gmond_python_modules.
Perhaps this will help: ganglia/monitor-core#123
from gmond_python_modules.
I have the same problem on 3 of the nodes: All the other nodes are fine.
Oct 23 12:24:30 /usr/sbin/gmond[3764]: [PYTHON] Can't call the metric handler function for [tcpext_tcploss_percentage] in the python module [netstats].#12
Oct 23 12:24:30 /usr/sbin/gmond[3764]: [PYTHON] Can't call the metric handler function for [tcp_retrans_percentage] in the python module [netstats].#12
Used vvuksan build of ganglia 3.6.0, Centos 6.5 and python 2.6.6 and libconfuse (2.7-4.el6) from the epel rep.
For now I have disabled that section with these two metrics in netstats.pyconf
Any suggestions what to look for?
from gmond_python_modules.
This is a real PITA. Any updates on a solution?
from gmond_python_modules.
As pointed by @shawn174 , the solution is this patch: ganglia/monitor-core#123 in ganglia/monitor-core. I will elaborate this issue a bit here just in case it is helpful for future viewers with the same issue.
The problem is because /usr/lib/ganglia/python_modules/netstats.py
(typical installation path on ubuntu by apt) doesn't handle ZeroDivisionError
. Therefore, if some metrics are unchanged in the last period, the denominator is just zero causing exception in the function. The solution is rather simple, just catch ZeroDivisionError in get_tcploss_percentage
and get_retrans_percentage
functions as shown in the above PR.
This fix is already merged as early as Oct 16, 2013. However, ganglia 3.6.0 release is even older than this, which is released on Apr 30, 2013. It is surprising to see that apt still provided such an older version of ganglia (at least it is still 3.6.0 on Ubuntu 18.04). Therefore, the easiest way is modifing /usr/lib/ganglia/python_modules/netstats.py
directly, or one can install a newer version of ganglia instead of the default version provided by apt.
To sum up, the repeated errors in syslog is caused by an old python script bug (not catching ZeroDivision) and a very old version of ganglia provided by some distribution (Ubuntu18.04 provided a release 18-13=5 years old!).
from gmond_python_modules.
We are on 3.7.2 and still experiencing this problem. Maybe it was reintroduced in one of the last versions released.
Will likely try editing netstats.py manually as a solution. Commenting this here though in case anyone else on 3.7.2 might encounter this
from gmond_python_modules.
Still monitoring, but so far I believe I was able to implement a fix for 3.7.2
Manually changed /usr/lib64/ganglia/python_modules/netstats.py (your installation location may differ) to catch ZeroDivisionErrors in the delta function. I can't see how this would have a negative effect on monitoring since the exception should only invoke in cases of divide-by-zero situations. So far so good on my end on a node I'm testing this on.
def get_delta(name):
"""Return change over time for the requested metric"""
# get metrics
[curr_metrics, last_metrics] = get_metrics()
parts = name.split("_")
group = parts[0]
metric = "_".join(parts[1:])
try:
delta = (float(curr_metrics['data'][group][metric]) - float(last_metrics['data'][group][metric])) / (curr_metrics['time'] - last_metrics['time'])
if delta < 0:
print name + " is less 0"
delta = 0
except KeyError:
delta = 0.0
except ZeroDivisionError:
delta = 0.0
return delta
The only change being the addition of the lines:
except ZeroDivisionError:
delta = 0.0
from gmond_python_modules.
Related Issues (20)
- [PYTHON] Can't find the metric_init function in the python module [varnish].
- The GPU Nvidia module's documentation should make clear what do with the patch file HOT 3
- Outstanding ipmi pull requests
- mysqld python module have error "Can't call the metric_init function in the python module"
- GPU custom graphs dont work with default installation in CentOS 6 HOT 2
- How to add metrics to Ganglia
- i can't monitor my infiniband network HOT 2
- User of sudo within python plugins in Ganglia
- Unsupported NVML Commands
- cpu_temp: wrong directory
- Can't call the metric handler function for... HOT 6
- [Ganglia restart extended metrics data lost]
- ipmi.py sends ipmi metrics with forward-slash in their names HOT 1
- Support for aggregate GPU usage statistics
- How to monitor AMD GPUs ?
- No support for 64bit counters? HOT 2
- Migrate LGTM.com installation from OAuth to GitHub App
- json button return numbers not json data format for the graph
- Proposing a PR to fix a few small typos
- inconsistence between nvidia.py and nvidia-ml-py-3.295.00 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gmond_python_modules.