Giter Site home page Giter Site logo

nymms's People

Contributors

berndtbrkt avatar phobologic avatar synfinatic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nymms's Issues

Acknowledge/redirect alerts for issue

We need a way to acknowledge a problem, especially 'acknowledge with a timeout'. Also, it might be useful to be able to say: "I'm working on this, send all alerts for this monitor/host to me for X amount of time." That'll be harder to do because of the way the reactors work, but it's worth looking into.

reactor handlers need more docs

The docs for reactor handlers is a little sparse. It would be good to have:

  • list of builtin reactors
  • sample config
  • how to subclass and filter

Command context not passed along with task_context

This means that we don't get to see things like the 'Command.type', which might be useful. For example in the case of passive monitors, we might want to set Command.type to 'passive', and then have some reactors only deal with passive monitors (because maybe the format of the output is different) and some with non-passive.

config needs deep defaults

Current the defaults only work for top level config objects, once you go deeper (say ['probe']['task_expiration']) it fails because it's a simple dictionary.

I need to revisit how I handle default config.

passive monitoring daemon

It'd be good to have a way to run passive checks (ie: non-probe based monitors - useful for monitoring stats or events in some systems) and submit them into the results queue (submitting should be easy).

Need a way to clear old states

Currently once something is recorded in the state database it's there forever. An example of when this is bad:

  • You launch an instance and configure it to be monitored in NYMMS
  • Eventually you replace that instance with a new instance and update your NYMMS config. Now you're monitoring the new instance, but not the old.

In that case the old state will remain in the state backend, even though it won't be used for anything. It might be worth having the scheduler or something similar clean up old states - the only issue with that is that we'd have to find some way to deal with passive monitors since the scheduler has no concept of passive monitors (the passive monitors just send in results themselves). That may be a bug in and of itself.

Update PPA

The current PPA is over a year old, could you update it to the latest version of NYMMS?

NYMMS crashes when security tokens are invalid

This is really just a transient error. We can just retry.

nymms_reactor stopped. Last 20 lines of /var/log/upstart/nymms_reactor.log:
[2015/02/07 06:02:57 UTC] pid:2199 DEBUG nymms.state.sdb_state sdb_state(save_state):55 - passive:agent_heartbeat.GCEStorageUpgrade - saving state: {'state_type': 1, 'state': 0, 'last_state_change': 1416254146, 'id': u'passive:agent_heartbeat.GCEStorageUpgrade', 'last_update': 1423288977}
Traceback (most recent call last):
  File "/usr/bin/nymms_reactor", line 30, in <module>
    visibility_timeout=visibility_timeout)
  File "/usr/lib/python2.7/dist-packages/nymms/daemon.py", line 14, in main
    self.run(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nymms/reactor/Reactor.py", line 126, in run
    result.delete()
  File "/usr/lib/python2.7/dist-packages/nymms/data_types.py", line 49, in delete
    return self._origin.delete()
  File "/usr/lib/python2.7/dist-packages/boto/sqs/message.py", line 145, in delete
    return self.queue.delete_message(self)
  File "/usr/lib/python2.7/dist-packages/boto/sqs/queue.py", line 314, in delete_message
    return self.connection.delete_message(self, message)
  File "/usr/lib/python2.7/dist-packages/boto/sqs/connection.py", line 224, in delete_message
    return self.get_status('DeleteMessage', params, queue.id)
  File "/usr/lib/python2.7/dist-packages/boto/connection.py", line 1223, in get_status
    raise self.ResponseError(response.status, response.reason, body)
boto.exception.SQSError: SQSError: 403 Forbidden
<?xml version="1.0"?><ErrorResponse xmlns="http://queue.amazonaws.com/doc/2012-11-05/"><Error><Type>Sender</Type><Code>InvalidClientTokenId</Code><Message>The security token included in the request is invalid.</Message><Detail/></Error><RequestId>7baf1b04-a820-5ba1-97a0-564b04f1352e</RequestId></ErrorResponse>

####################################################
Created by /etc/init/process_monitor.conf

AWS config tester

Ensuring all the SNS/SQS/SDB/IAM setup is correct is a manual task and prone to error. It might be nice to have a config checker which will bounce packets through the config and automatically detect errors.

scheduler high-availability

Currently the scheduler is the big single point of failure. I can't imagine it will need multiple hosts for performance reasons, but in the case that we lose the scheduler on a host we should have another sitting by in at least a warm standby mode to take over.

The easy way out is to use Zookeeper for this, but that means adding a dependency on Zookeeper.

Another option is to try and implement some sort of Bully Leader Election algorithm in something like SDB.

Need a way to handle monitor credentials more securely

Right now you either need to have a way for the probes to figure out credentials when necessary (external config files?) or you can send them along in the task context. I'd like to have a more standardized way to handle this that doesn't involve sending credentials along with the task context.

alert re-alerting

Right now the standard way things work with filters is that if no stage change has occurred, no alert is sent. This is fine in most cases, but in the case where something has been broken for a while, it'd be good if NYMMS would throw another alert every so often (configurable).

common 'daemon' class refactor

The Scheduler, Probe and Reactor classes have at least a few common concepts. It might be worth refactoring some of that into a common daemon class.

Result timestamp should set on creation

Right now if you don't provide a timestamp it provides it when you run validate(). It makes more sense for it to do this at object creation time. This likely applies to other datatypes as well.

nymms should gracefully fail without permissions

I wanted to check out the ami! It looks pretty cool - only I hadn't set up an IAM profile yet and I noticed in syslog this:


Oct 18 23:07:19 ip-10-20-242-215 kernel: [9675236.422718] init: nymms-scheduler main process ended, respawning
Oct 18 23:07:19 ip-10-20-242-215 kernel: [9675236.890671] init: nymms-reactor main process (12129) terminated with status 1
Oct 18 23:07:19 ip-10-20-242-215 kernel: [9675236.890734] init: nymms-reactor main process ended, respawning
Oct 18 23:07:20 ip-10-20-242-215 kernel: [9675237.036472] init: nymms-probe main process (12130) terminated with status 1

it seems like it'd be an easy fix to have nymms fail a bit more gracefully if it doesn't have permissions to do what it needs to do.

configuration propagation

The base groundwork is there for allowing for configuration propagation, but there's still a lot of work to be done. It'd be great if the scheduler pushed it's config + resources into an S3 bucket with a key name that matches the version hash for them. It should then include the version of the config that tasks are generated from - when a probe receives a task from a version it isn't running it could update itself from S3.

Same with the reactors - when they get a result from a config version they aren't running they should grab the new config from S3 and restart.

Realms support

Need to add in the concept of 'realms' for monitors, allowing some probe instances to handle monitors created in their realms by the scheduler.

This would be useful to have probes behind firewalls.

suppressions should be able to work on more than just the id

Right now you can only specify a regex for the ID of an alert for suppressions. This works in some cases, but in a lot of other cases it'd be really great to be able to suppress based on, for example, monitoring_group. (ie: disable all alerts on all monitors on all zookeeper nodes for an hour while we do the upgrade).

Consider new ways to mix filters & handlers for reactor

Right now we expect the handler class to have a filter() method that returns True or False. It would be nice if filters & handlers were decoupled so that you could have a base library of filters that can be used with any handler.

Could use simple methods for filters - seems like the best candidate.

Should I allow multiple filters for a given handler? If so it would be an AND for all filters. Not going to even think about syntax to include an OR concept.

nymms/config/yaml_config tests file access poorly

This bit of code:

                        if os.path.isfile(f):
                            logger.debug("Parsing include (%s:%d): %s",
                                         filename, lineno, f)
                            c.extend(recursive_preprocess(f))
                        else:
                            logger.warning("%s is not a regular file, "
                                           "skipping (%s:%d).", f, filename,
                                           lineno)

Will give the error if, for example, it cannot access the file due to not having permissions to the directory the file is in. Need a better set of checks and error messages. try/open/except maybe?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.