cloudtools / nymms Goto Github PK

View Code? Open in Web Editor NEW

46.0 46.0 12.0 418 KB

Not Your Mother's Monitoring System

License: BSD 2-Clause "Simplified" License

Python 99.77% Shell 0.23%

nymms's People

Contributors

Stargazers

Watchers

Forkers

remotesyssupport berndtj imclab cloudxtreme adamchainz 40a etsangsplk pyramidtek rizplate uncleweb orquestracd isabella232

nymms's Issues

Acknowledge/redirect alerts for issue

We need a way to acknowledge a problem, especially 'acknowledge with a timeout'. Also, it might be useful to be able to say: "I'm working on this, send all alerts for this monitor/host to me for X amount of time." That'll be harder to do because of the way the reactors work, but it's worth looking into.

reactor handlers need more docs

The docs for reactor handlers is a little sparse. It would be good to have:

list of builtin reactors
sample config
how to subclass and filter

Command context not passed along with task_context

This means that we don't get to see things like the 'Command.type', which might be useful. For example in the case of passive monitors, we might want to set Command.type to 'passive', and then have some reactors only deal with passive monitors (because maybe the format of the output is different) and some with non-passive.

look into SQS queue timeouts

We have some timeout code in the probes, we should consider doing something on the SQS queues themselves.

config needs deep defaults

Current the defaults only work for top level config objects, once you go deeper (say ['probe']['task_expiration']) it fails because it's a simple dictionary.

I need to revisit how I handle default config.

Need a way to view current state of all monitors

Eventually a webpage seems like its inevitable. For now that may be overkill. Need to think about what is really necessary.

passive monitoring daemon

It'd be good to have a way to run passive checks (ie: non-probe based monitors - useful for monitoring stats or events in some systems) and submit them into the results queue (submitting should be easy).

Handlers should have a configurable timeout

The reactor should only give handlers so much time (configurable) to handle a result.

Need a way to clear old states

Currently once something is recorded in the state database it's there forever. An example of when this is bad:

You launch an instance and configure it to be monitored in NYMMS
Eventually you replace that instance with a new instance and update your NYMMS config. Now you're monitoring the new instance, but not the old.

In that case the old state will remain in the state backend, even though it won't be used for anything. It might be worth having the scheduler or something similar clean up old states - the only issue with that is that we'd have to find some way to deal with passive monitors since the scheduler has no concept of passive monitors (the passive monitors just send in results themselves). That may be a bug in and of itself.

Update PPA

The current PPA is over a year old, could you update it to the latest version of NYMMS?

add the ability to add custom headers to SES Handler

Should be pretty easy to make this a configurable thing, and people may find it useful.

NYMMS crashes when security tokens are invalid

This is really just a transient error. We can just retry.

nymms_reactor stopped. Last 20 lines of /var/log/upstart/nymms_reactor.log:
[2015/02/07 06:02:57 UTC] pid:2199 DEBUG nymms.state.sdb_state sdb_state(save_state):55 - passive:agent_heartbeat.GCEStorageUpgrade - saving state: {'state_type': 1, 'state': 0, 'last_state_change': 1416254146, 'id': u'passive:agent_heartbeat.GCEStorageUpgrade', 'last_update': 1423288977}
Traceback (most recent call last):
  File "/usr/bin/nymms_reactor", line 30, in <module>
    visibility_timeout=visibility_timeout)
  File "/usr/lib/python2.7/dist-packages/nymms/daemon.py", line 14, in main
    self.run(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nymms/reactor/Reactor.py", line 126, in run
    result.delete()
  File "/usr/lib/python2.7/dist-packages/nymms/data_types.py", line 49, in delete
    return self._origin.delete()
  File "/usr/lib/python2.7/dist-packages/boto/sqs/message.py", line 145, in delete
    return self.queue.delete_message(self)
  File "/usr/lib/python2.7/dist-packages/boto/sqs/queue.py", line 314, in delete_message
    return self.connection.delete_message(self, message)
  File "/usr/lib/python2.7/dist-packages/boto/sqs/connection.py", line 224, in delete_message
    return self.get_status('DeleteMessage', params, queue.id)
  File "/usr/lib/python2.7/dist-packages/boto/connection.py", line 1223, in get_status
    raise self.ResponseError(response.status, response.reason, body)
boto.exception.SQSError: SQSError: 403 Forbidden
<?xml version="1.0"?><ErrorResponse xmlns="http://queue.amazonaws.com/doc/2012-11-05/"><Error><Type>Sender</Type><Code>InvalidClientTokenId</Code><Message>The security token included in the request is invalid.</Message><Detail/></Error><RequestId>7baf1b04-a820-5ba1-97a0-564b04f1352e</RequestId></ErrorResponse>

####################################################
Created by /etc/init/process_monitor.conf

AWS config tester

Ensuring all the SNS/SQS/SDB/IAM setup is correct is a manual task and prone to error. It might be nice to have a config checker which will bounce packets through the config and automatically detect errors.

Need to refactor scheduler

The code is looking crufty compared to reactor/probe.

result should have a 'output_first_line' or something similar

To use in subject lines or pages, etc.

scheduler high-availability

Currently the scheduler is the big single point of failure. I can't imagine it will need multiple hosts for performance reasons, but in the case that we lose the scheduler on a host we should have another sitting by in at least a warm standby mode to take over.

The easy way out is to use Zookeeper for this, but that means adding a dependency on Zookeeper.

Another option is to try and implement some sort of Bully Leader Election algorithm in something like SDB.

Need a way to handle monitor credentials more securely

Right now you either need to have a way for the probes to figure out credentials when necessary (external config files?) or you can send them along in the task context. I'd like to have a more standardized way to handle this that doesn't involve sending credentials along with the task context.

Optional api endpoints

See https://github.com/cloudtools/nymms/pull/32/files#diff-f7e9ac1edc3b8d8ce0d80969bf081a5cR48

Basically it'd be cool to have a way for API end points to be optionally enabled when certain handlers are enabled.

@berndtj

NYMMS alerts when SOFT recoveries become HARD recoveries

Per this: http://nagios.sourceforge.net/docs/3_0/statetypes.html

When we get a soft failure state, then an OK failure state (which will be soft), and then finally go to a HARD/OK state no notification should be sent by default. I've seen it send alerts though. Need to decide if this should be configurable via filters (should be easy enough to do) or if it should be a core part of the reactor logic.

alert re-alerting

Right now the standard way things work with filters is that if no stage change has occurred, no alert is sent. This is fine in most cases, but in the case where something has been broken for a while, it'd be good if NYMMS would throw another alert every so often (configurable).

result should have full, rendered command string available

It'll make it easier to see what commands are being executed without having to jump through hoops.

common 'daemon' class refactor

The Scheduler, Probe and Reactor classes have at least a few common concepts. It might be worth refactoring some of that into a common daemon class.

Result timestamp should set on creation

Right now if you don't provide a timestamp it provides it when you run validate(). It makes more sense for it to do this at object creation time. This likely applies to other datatypes as well.

nymms should gracefully fail without permissions

I wanted to check out the ami! It looks pretty cool - only I hadn't set up an IAM profile yet and I noticed in syslog this:


Oct 18 23:07:19 ip-10-20-242-215 kernel: [9675236.422718] init: nymms-scheduler main process ended, respawning
Oct 18 23:07:19 ip-10-20-242-215 kernel: [9675236.890671] init: nymms-reactor main process (12129) terminated with status 1
Oct 18 23:07:19 ip-10-20-242-215 kernel: [9675236.890734] init: nymms-reactor main process ended, respawning
Oct 18 23:07:20 ip-10-20-242-215 kernel: [9675237.036472] init: nymms-probe main process (12130) terminated with status 1

it seems like it'd be an easy fix to have nymms fail a bit more gracefully if it doesn't have permissions to do what it needs to do.

configuration propagation

The base groundwork is there for allowing for configuration propagation, but there's still a lot of work to be done. It'd be great if the scheduler pushed it's config + resources into an S3 bucket with a key name that matches the version hash for them. It should then include the version of the config that tasks are generated from - when a probe receives a task from a version it isn't running it could update itself from S3.

Same with the reactors - when they get a result from a config version they aren't running they should grab the new config from S3 and restart.

Realms support

Need to add in the concept of 'realms' for monitors, allowing some probe instances to handle monitors created in their realms by the scheduler.

This would be useful to have probes behind firewalls.

suppressions should be able to work on more than just the id

Right now you can only specify a regex for the ID of an alert for suppressions. This works in some cases, but in a lot of other cases it'd be really great to be able to suppress based on, for example, monitoring_group. (ie: disable all alerts on all monitors on all zookeeper nodes for an hour while we do the upgrade).

come up with AMI/script to setup a simple all in one deployment of NYMMS

This should give folks a way to bring up an all-in-one NYMMS node simply, with some basic configuration done.

Consider new ways to mix filters & handlers for reactor

Right now we expect the handler class to have a filter() method that returns True or False. It would be nice if filters & handlers were decoupled so that you could have a base library of filters that can be used with any handler.

Could use simple methods for filters - seems like the best candidate.

Should I allow multiple filters for a given handler? If so it would be an AND for all filters. Not going to even think about syntax to include an OR concept.

nymms/config/yaml_config tests file access poorly

This bit of code:

                        if os.path.isfile(f):
                            logger.debug("Parsing include (%s:%d): %s",
                                         filename, lineno, f)
                            c.extend(recursive_preprocess(f))
                        else:
                            logger.warning("%s is not a regular file, "
                                           "skipping (%s:%d).", f, filename,
                                           lineno)

Will give the error if, for example, it cannot access the file due to not having permissions to the directory the file is in. Need a better set of checks and error messages. try/open/except maybe?

Add autodoc for classes

http://sphinx-doc.org/ext/autodoc.html