zalando-zmon / zmon-worker Goto Github PK
View Code? Open in Web Editor NEWZMON Python Worker
Home Page: https://zmon.io/
License: Other
ZMON Python Worker
Home Page: https://zmon.io/
License: Other
Merge our Zalando internal version into this Open Source worker (goal: only have one version!).
Please add unit tests for the JSON Path functions introduced in 91ace78
It seem as the {{}}-pattern substitution for capture is not working for HipChat notifications, if the message is given explicitly. E.g. in XXX is producing the first line produces
ALERT ENDED: Balance AWS: Business Partner Service Not Found Responses in Last Hour ({details}) on ad-app-tier-business-xxx-service-test-463[aws:xxx:eu-central-1]
instead of
ALERT ENDED: Balance AWS: Business Partner Service Not Found Responses in Last Hour ({details}) on ad-app-tier-business-xxx-service-481[aws:xxx:eu-central-1]
flak8
was failing on Travis. Check if it is successful and enable it again.
...to mitigate partial data in "now" period.
Unit tests are failing right now ๐ https://travis-ci.org/zalando/zmon-worker
Ideally I could pick from multiple templates, but for now it would be okay to just pass some flags like include_value
, include_definition
etc to the default template(s).
Example plugin.history.enabled: false
Default is enabled.
We want to create a /health/ endpoint in our master cherrypy process that reflects the status of the system.
Background:
The master process, which spawns all the workers, contains a cherrypy HTTP server and an RPC server for internal communication with its child processes.
Each child worker process has a Main thread which runs the ZMON checks, and a Reactor thread which react in special circumstances and report it to the master via RPC call. Currently the only functionality the Reactor thread has is for detecting when the main thread is stuck in a long check and triggering an RPC call for the master process to terminate this child worker.
We want to expand the Reactor Thread to periodically report its health status to the master process. The master process will aggregate the health feedback it receives from all child workers in a way that it can be presented it in a HTTP endpoint that reflects when the whole system is malfunctioning.
Proposed specs:
endpoint: /health
return:
Criteria for unhealthy system:
what else...?
To be checked: broken notification settings (e.g. invalid syntax in https://docs.zmon.io/en/latest/user/alert-definitions.html#notifications) should not prevent the alert from popping up (e.g. on the dashboard).
I think we should have a clean set of "core" plugins which are 100% supported and unit tested:
Some plugins such as SNMP and Nagios are currently not very useful and should move to a new "extra" plugin section. The "extra" plugins should be located in the same git repo, but they should only be loaded on-demand by setting an environment variable.
Benefits:
Support for analytics data queries.
The main worker test (using multiple processes) sometimes fails:
for string in expected_strings:
> assert string in data['zmon:checks:123:77']
E TypeError: 'NoneType' object has no attribute '__getitem__'
Please add the support for local execution of the check_dig nagios plugin.
Queries with curly braces end up raising exception:
IndexError: tuple index out of range
In order to reuse CloudWatch Alarms on my ZMON dashboard, I would like to be able to write checks, that fetch the Alarm State of CloudWatch Alarms:
For the check defined like
def check():
status_code = http('https://service.dns.name/file/very-large-file.zip').code()
return {"status_code": status_code}
I would like to have status_code
= 302 when the server returns redirect. The reason is that I only need to check that the very-large-file.zip
is accessible, but I don't want to download this file in ZMON.
Currently this check returns status_code
= 200, which means that ZMON follow redirects. Would be good to somehow disable such behaviour.
We can optimize the CloudWatch wrapper (and reduce the probability of running into stupid AWS rate limits) by allowing using it without the "list_metrics" call:
Introduce new method (e.g. "query_one") to directly call get_metric_statistics if all parameters are known.
Not the highest priority, but we should migrate the ZMON Worker as we do everything in Python 3 now.
There are some APIs that return different timestamps using epoch (ZMON's is one of these, to give an example). It would be nice to have the time() helper function handling these. One further improvement could possibly be adding the datetime.strptime() functionality to make life easier.
Cassandra wrapper execute raises exception with python 2.7.12 (cassandra-driver 2.7.2)
('Unable to connect to any servers', {'cassandra-node': TypeError('ref() does not take keyword arguments',)})
der Check-Command ping() liefert mir ein "[Errno 2] No such file or directory"
in unseren System aber auch unter demo.zmon.io.Aktuell habe ich nur einen Ping-Check mit folgenden Inhalt:
ping()
We need to additionally support https://www.pagerduty.com/ for alert notifications.
Could be useful in supplying special variables that are accessible to all check commands. One use case is authorization tokens that can be used in http
wrapper to initiate authorized requests.
Suggestion:
Store in a dict in config
New command, Eg: secrets()
or vars()
etc ...
Example usage
vars('my_service_token')
TypeError: set([u'i-0bfb11b2', u'i-29fdbe90', u'i-0305b6bb']) is not JSON serializable
We still use the file-based eventlog
Python module which does not properly work in a Docker-context (files are written within the Docker container's filesystem..)
We need to merge the Metric Cache reporting from our old internal worker.
Ideally this should be handled gracefully triggering the alert, right now this does not get executed or reporated add all
ERROR [worker-35] zmon_worker_monitor.zmon_worker.tasks.main/notify: Notification for check Webapp HTTP Status reached soft time limit
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/zmon_worker-cd156-py2.7.egg/zmon_worker_monitor/zmon_worker/tasks/main.py", line 1445, in notify
downtimes = self._evaluate_downtimes(alert_id, entity_id)
File "/usr/local/lib/python2.7/dist-packages/zmon_worker-cd156-py2.7.egg/zmon_worker_monitor/zmon_worker/tasks/main.py", line 1635, in _evaluate_downtimes
if now > d['start_time'] and now < d['end_time']:
KeyError: 'start_time'
Zmon worker returns an error on trying to get timeseries based logs from scalyr
Here is a simple
check definition
def check(): query = "$serverHost='my-awesome-service' $severity >= '5'" return scalyr().timeseries(query, minutes=5)
Results in
{"status":"error/client/badRequest","message":"request must have 'token' field"}
The size of check results is currently unbounded, this leads to problems where users generate (mostly accidentally) Megabytes (!) of result data for a single check. As the data is stored in JSON format in Redis (and additionally in KairosDB), we might run into memory issues (e.g. Redis memory fragmentation and total database size).
A simple and effective approach would be to introduce a reasonable (configurable) maximum for check results, let's say 64KiB.
currently only handled for KairosDB writes to align data points.
sourceType
could be unique per application, which leads to queries returning no results for the default sourceType:application-log
filtering.
Add convenience wrapper to query AppDynamics API.
The HTTP wrapper and other HTTP requests should set a proper User-Agent HTTP header
Right now resolve
is added to TCP wrapper, which is not exactly related related to TCP.
Adding DNS wrapper would make more sense here.
We should get rid of the "ignore" list in tox.ini and fix all flake8 warnings/errors.
We are still using the CherryPy configuration file format inside ZMON Worker, but actually app.py
writes environment variable values to it.
Get rid of this legacy dependency and use environment variables (+ args) directly.
Logging is not very helpful right now:
2015-12-22 05:03:06,306 - INFO - zmon_worker_monitor.zmon_worker.tasks.notacelery_task - send_metrics - Send metrics, end storing metrics in redis count: 0, duration: 0.002s
Dec 22 05:03:06 ip-172-31-163-67 docker/b6eb55fa92b8[840]: 2015-12-22 05:03:06,382 - INFO - zmon_worker_monitor.zmon_worker.tasks.notacelery_task - send_metrics - Send metrics, end storing metrics in redis count: 0, duration: 0.002s
Dec 22 05:03:07 ip-172-31-136-190 docker/d65d8e292bd8[839]: 2015-12-22 05:03:07,125 - INFO - zmon_worker_monitor.redis_context_manager - __exit__ - IdleLoop: No task received... pid=20, count: 146
Dec 22 05:03:08 ip-172-31-136-190 docker/d65d8e292bd8[839]: 2015-12-22 05:03:08,031 - INFO - zmon_worker_monitor.redis_context_manager - __exit__ - IdleLoop: No task received... pid=14, count: 13276
Dec 22 05:03:08 ip-172-31-136-190 docker/d65d8e292bd8[839]: 2015-12-22 05:03:08,032 - INFO - zmon_worker_monitor.redis_context_manager - __exit__ - IdleLoop: No task received... pid=17, count: 12994
Dec 22 05:03:08 ip-172-31-136-190 docker/d65d8e292bd8[839]: 2015-12-22 05:03:08,031 - INFO - zmon_worker_monitor.redis_context_manager - __exit__ - IdleLoop: No task received... pid=16, count: 13041
This is useful when query result include aggregation results. Right now returning only result hits
which won't be sufficient in that case.
We need to upgrade the Python/SSL version of ZMON Worker as TLS v1.2 is currently not supported.
Proof : https://demo.zmon.io/#/alert-details/6 (my personal website https://srcco.de only supports the latest TLS v1.2 protocol)
Upgrading the Docker base image to Ubuntu 15.10 should do the trick.
The group_by
option should be passed thru to the REST API http://kairosdb.github.io/docs/build/html/restapi/QueryMetrics.html#metric-properties
These names are historic and make no sense, let's rename the module and class to something meaningful ๐
/<account>/<check-id>/<region>
Mean != median, use the p50 value as median here: https://github.com/zalando/zmon-worker/blob/master/zmon_worker_monitor/zmon_worker/functions/http.py#L112
It would be helpful if I could trigger an HTTP call on alert.
example: send_http("https://www.example.de/trigger?param1=...")
In certain cases, returning response object could be needed. One use case is a REST API with pagination headers (Link
). In this case, both Response JSON and Headers are required to complete the check. HEAD
method is not always expected to return neither Link
headers nor payload with pagination links.
Suggestion is to either return requests.Response
object or return a simplified ZmonHTTPResponse with fixed properties (headers, json(), status_code, text, ok).
The check
cloudwatch().query({'AvailabilityZone': 'NOT_SET', 'AutoScalingGroupName': 'tailor-*' }, 'NetworkIn', 'Average')
may fail if the amount of metrics returned exceed the boto3 cloudwatch metrics page size (currently 500)
ZMON does the filtering and only uses the first page.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.