Giter Site home page Giter Site logo

zmon-worker's People

Contributors

a1exsh avatar aermakov-zalando avatar alexeyklyukin avatar anton-ryzhov avatar avaczi avatar bkecskemeti avatar csenol avatar cvirus avatar dneuhaeuser-zalando avatar drummerwolli avatar gargravarr avatar heroldus avatar hjacobs avatar jan-m avatar lerovitch avatar lfroment0 avatar lorenzhawkes avatar losbossos avatar mohabusama avatar mroderick avatar mtesseract avatar olevchyk avatar pitr avatar porrl avatar prayerslayer avatar szuecs avatar tkrop avatar twz123 avatar vetinari avatar whiskeysierra avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zmon-worker's Issues

Capture is not working correctly for HipChat notifications.

It seem as the {{}}-pattern substitution for capture is not working for HipChat notifications, if the message is given explicitly. E.g. in XXX is producing the first line produces

ALERT ENDED: Balance AWS: Business Partner Service Not Found Responses in Last Hour ({details}) on ad-app-tier-business-xxx-service-test-463[aws:xxx:eu-central-1]

instead of

ALERT ENDED: Balance AWS: Business Partner Service Not Found Responses in Last Hour ({details}) on ad-app-tier-business-xxx-service-481[aws:xxx:eu-central-1]

Allow more control over email body

Ideally I could pick from multiple templates, but for now it would be okay to just pass some flags like include_value, include_definition etc to the default template(s).

Create /health http endpoint in master process

We want to create a /health/ endpoint in our master cherrypy process that reflects the status of the system.

Background:
The master process, which spawns all the workers, contains a cherrypy HTTP server and an RPC server for internal communication with its child processes.
Each child worker process has a Main thread which runs the ZMON checks, and a Reactor thread which react in special circumstances and report it to the master via RPC call. Currently the only functionality the Reactor thread has is for detecting when the main thread is stuck in a long check and triggering an RPC call for the master process to terminate this child worker.
We want to expand the Reactor Thread to periodically report its health status to the master process. The master process will aggregate the health feedback it receives from all child workers in a way that it can be presented it in a HTTP endpoint that reflects when the whole system is malfunctioning.

Proposed specs:

endpoint: /health
return:

  • 200 OK: System healthy
  • 503 Service Unavailable: System unhealthy

Criteria for unhealthy system:

  • If n/2 + 1 worker processes are not responding, meaning they have stopped contacting the rpc server.
  • If n/2 + 1 worker processes were killed (because they got stuck) or died for unknown causes in some unit of time (30 min?)

what else...?

Consider moving SNMP and Nagios plugins to "extra" plugins

I think we should have a clean set of "core" plugins which are 100% supported and unit tested:

  • HTTP
  • Time
  • ZMON
  • PostgreSQL
  • ..

Some plugins such as SNMP and Nagios are currently not very useful and should move to a new "extra" plugin section. The "extra" plugins should be located in the same git repo, but they should only be loaded on-demand by setting an environment variable.

Benefits:

  • Clean separation of 100% supported plugins and "legacy" stuff
  • We can exclude the "extra" plugins from code coverage as we will probably never write a full test suite for them
  • Startup and test time is faster as less code is loaded

Fix flaky unit test

The main worker test (using multiple processes) sometimes fails:

        for string in expected_strings:
>           assert string in data['zmon:checks:123:77']
E           TypeError: 'NoneType' object has no attribute '__getitem__'

Add an option to disable redirects in "http" check command.

For the check defined like

def check():
  status_code = http('https://service.dns.name/file/very-large-file.zip').code()
  return {"status_code": status_code}

I would like to have status_code = 302 when the server returns redirect. The reason is that I only need to check that the very-large-file.zip is accessible, but I don't want to download this file in ZMON.

Currently this check returns status_code = 200, which means that ZMON follow redirects. Would be good to somehow disable such behaviour.

Allow querying CloudWatch without additional "list_metrics" call

We can optimize the CloudWatch wrapper (and reduce the probability of running into stupid AWS rate limits) by allowing using it without the "list_metrics" call:

Introduce new method (e.g. "query_one") to directly call get_metric_statistics if all parameters are known.

Support for epochs

There are some APIs that return different timestamps using epoch (ZMON's is one of these, to give an example). It would be nice to have the time() helper function handling these. One further improvement could possibly be adding the datetime.strptime() functionality to make life easier.

Cassandra CQL exception with python 2.7.12

Cassandra wrapper execute raises exception with python 2.7.12 (cassandra-driver 2.7.2)

('Unable to connect to any servers', {'cassandra-node': TypeError('ref() does not take keyword arguments',)})

ping() does not work in Docker image

der Check-Command ping() liefert mir ein "[Errno 2] No such file or directory"
in unseren System aber auch unter demo.zmon.io.

Aktuell habe ich nur einen Ping-Check mit folgenden Inhalt:
ping()

Add support for custom config variables

Could be useful in supplying special variables that are accessible to all check commands. One use case is authorization tokens that can be used in http wrapper to initiate authorized requests.

Suggestion:

Store in a dict in config

New command, Eg: secrets() or vars() etc ...

Example usage

vars('my_service_token')

Fix EventLog

We still use the file-based eventlog Python module which does not properly work in a Docker-context (files are written within the Docker container's filesystem..)

Resilience to broken downtime if we want to.

Ideally this should be handled gracefully triggering the alert, right now this does not get executed or reporated add all

ERROR [worker-35] zmon_worker_monitor.zmon_worker.tasks.main/notify: Notification for check Webapp HTTP Status reached soft time limit
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/zmon_worker-cd156-py2.7.egg/zmon_worker_monitor/zmon_worker/tasks/main.py", line 1445, in notify
    downtimes = self._evaluate_downtimes(alert_id, entity_id)
  File "/usr/local/lib/python2.7/dist-packages/zmon_worker-cd156-py2.7.egg/zmon_worker_monitor/zmon_worker/tasks/main.py", line 1635, in _evaluate_downtimes
    if now > d['start_time'] and now < d['end_time']:
KeyError: 'start_time'

Consider limiting the size of a single check result (total bytes and number of keys)

The size of check results is currently unbounded, this leads to problems where users generate (mostly accidentally) Megabytes (!) of result data for a single check. As the data is stored in JSON format in Redis (and additionally in KairosDB), we might run into memory issues (e.g. Redis memory fragmentation and total database size).

A simple and effective approach would be to introduce a reasonable (configurable) maximum for check results, let's say 64KiB.

Add DNS wrapper

Right now resolve is added to TCP wrapper, which is not exactly related related to TCP.

Adding DNS wrapper would make more sense here.

Get rid of CherryPy configuration file (web.conf)

We are still using the CherryPy configuration file format inside ZMON Worker, but actually app.py writes environment variable values to it.

Get rid of this legacy dependency and use environment variables (+ args) directly.

Improve logging

Logging is not very helpful right now:

2015-12-22 05:03:06,306 - INFO - zmon_worker_monitor.zmon_worker.tasks.notacelery_task - send_metrics - Send metrics, end storing metrics in redis count: 0, duration: 0.002s
 Dec 22 05:03:06 ip-172-31-163-67 docker/b6eb55fa92b8[840]: 2015-12-22 05:03:06,382 - INFO - zmon_worker_monitor.zmon_worker.tasks.notacelery_task - send_metrics - Send metrics, end storing metrics in redis count: 0, duration: 0.002s
 Dec 22 05:03:07 ip-172-31-136-190 docker/d65d8e292bd8[839]: 2015-12-22 05:03:07,125 - INFO - zmon_worker_monitor.redis_context_manager - __exit__ - IdleLoop: No task received... pid=20, count: 146
 Dec 22 05:03:08 ip-172-31-136-190 docker/d65d8e292bd8[839]: 2015-12-22 05:03:08,031 - INFO - zmon_worker_monitor.redis_context_manager - __exit__ - IdleLoop: No task received... pid=14, count: 13276
 Dec 22 05:03:08 ip-172-31-136-190 docker/d65d8e292bd8[839]: 2015-12-22 05:03:08,032 - INFO - zmon_worker_monitor.redis_context_manager - __exit__ - IdleLoop: No task received... pid=17, count: 12994
 Dec 22 05:03:08 ip-172-31-136-190 docker/d65d8e292bd8[839]: 2015-12-22 05:03:08,031 - INFO - zmon_worker_monitor.redis_context_manager - __exit__ - IdleLoop: No task received... pid=16, count: 13041
  • Remove date/time prefix (already provided by syslog)
  • Reduce number of non-relevant log lines (e.g. idle loop)

HTTP wrapper response object

In certain cases, returning response object could be needed. One use case is a REST API with pagination headers (Link). In this case, both Response JSON and Headers are required to complete the check. HEAD method is not always expected to return neither Link headers nor payload with pagination links.

Suggestion is to either return requests.Response object or return a simplified ZmonHTTPResponse with fixed properties (headers, json(), status_code, text, ok).

cloudwatch scraping may fail with wildcard dimensions

The check

cloudwatch().query({'AvailabilityZone': 'NOT_SET', 'AutoScalingGroupName': 'tailor-*' }, 'NetworkIn', 'Average')

may fail if the amount of metrics returned exceed the boto3 cloudwatch metrics page size (currently 500)

ZMON does the filtering and only uses the first page.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.