cloudtools / nymms Goto Github PK
View Code? Open in Web Editor NEWNot Your Mother's Monitoring System
License: BSD 2-Clause "Simplified" License
Not Your Mother's Monitoring System
License: BSD 2-Clause "Simplified" License
We need a way to acknowledge a problem, especially 'acknowledge with a timeout'. Also, it might be useful to be able to say: "I'm working on this, send all alerts for this monitor/host to me for X amount of time." That'll be harder to do because of the way the reactors work, but it's worth looking into.
The docs for reactor handlers is a little sparse. It would be good to have:
This means that we don't get to see things like the 'Command.type', which might be useful. For example in the case of passive monitors, we might want to set Command.type to 'passive', and then have some reactors only deal with passive monitors (because maybe the format of the output is different) and some with non-passive.
We have some timeout code in the probes, we should consider doing something on the SQS queues themselves.
Current the defaults only work for top level config objects, once you go deeper (say ['probe']['task_expiration']) it fails because it's a simple dictionary.
I need to revisit how I handle default config.
Eventually a webpage seems like its inevitable. For now that may be overkill. Need to think about what is really necessary.
It'd be good to have a way to run passive checks (ie: non-probe based monitors - useful for monitoring stats or events in some systems) and submit them into the results queue (submitting should be easy).
The reactor should only give handlers so much time (configurable) to handle a result.
Currently once something is recorded in the state database it's there forever. An example of when this is bad:
In that case the old state will remain in the state backend, even though it won't be used for anything. It might be worth having the scheduler or something similar clean up old states - the only issue with that is that we'd have to find some way to deal with passive monitors since the scheduler has no concept of passive monitors (the passive monitors just send in results themselves). That may be a bug in and of itself.
The current PPA is over a year old, could you update it to the latest version of NYMMS?
Should be pretty easy to make this a configurable thing, and people may find it useful.
This is really just a transient error. We can just retry.
nymms_reactor stopped. Last 20 lines of /var/log/upstart/nymms_reactor.log:
[2015/02/07 06:02:57 UTC] pid:2199 DEBUG nymms.state.sdb_state sdb_state(save_state):55 - passive:agent_heartbeat.GCEStorageUpgrade - saving state: {'state_type': 1, 'state': 0, 'last_state_change': 1416254146, 'id': u'passive:agent_heartbeat.GCEStorageUpgrade', 'last_update': 1423288977}
Traceback (most recent call last):
File "/usr/bin/nymms_reactor", line 30, in <module>
visibility_timeout=visibility_timeout)
File "/usr/lib/python2.7/dist-packages/nymms/daemon.py", line 14, in main
self.run(*args, **kwargs)
File "/usr/lib/python2.7/dist-packages/nymms/reactor/Reactor.py", line 126, in run
result.delete()
File "/usr/lib/python2.7/dist-packages/nymms/data_types.py", line 49, in delete
return self._origin.delete()
File "/usr/lib/python2.7/dist-packages/boto/sqs/message.py", line 145, in delete
return self.queue.delete_message(self)
File "/usr/lib/python2.7/dist-packages/boto/sqs/queue.py", line 314, in delete_message
return self.connection.delete_message(self, message)
File "/usr/lib/python2.7/dist-packages/boto/sqs/connection.py", line 224, in delete_message
return self.get_status('DeleteMessage', params, queue.id)
File "/usr/lib/python2.7/dist-packages/boto/connection.py", line 1223, in get_status
raise self.ResponseError(response.status, response.reason, body)
boto.exception.SQSError: SQSError: 403 Forbidden
<?xml version="1.0"?><ErrorResponse xmlns="http://queue.amazonaws.com/doc/2012-11-05/"><Error><Type>Sender</Type><Code>InvalidClientTokenId</Code><Message>The security token included in the request is invalid.</Message><Detail/></Error><RequestId>7baf1b04-a820-5ba1-97a0-564b04f1352e</RequestId></ErrorResponse>
####################################################
Created by /etc/init/process_monitor.conf
Ensuring all the SNS/SQS/SDB/IAM setup is correct is a manual task and prone to error. It might be nice to have a config checker which will bounce packets through the config and automatically detect errors.
The code is looking crufty compared to reactor/probe.
To use in subject lines or pages, etc.
Currently the scheduler is the big single point of failure. I can't imagine it will need multiple hosts for performance reasons, but in the case that we lose the scheduler on a host we should have another sitting by in at least a warm standby mode to take over.
The easy way out is to use Zookeeper for this, but that means adding a dependency on Zookeeper.
Another option is to try and implement some sort of Bully Leader Election algorithm in something like SDB.
Right now you either need to have a way for the probes to figure out credentials when necessary (external config files?) or you can send them along in the task context. I'd like to have a more standardized way to handle this that doesn't involve sending credentials along with the task context.
See https://github.com/cloudtools/nymms/pull/32/files#diff-f7e9ac1edc3b8d8ce0d80969bf081a5cR48
Basically it'd be cool to have a way for API end points to be optionally enabled when certain handlers are enabled.
Per this: http://nagios.sourceforge.net/docs/3_0/statetypes.html
When we get a soft failure state, then an OK failure state (which will be soft), and then finally go to a HARD/OK state no notification should be sent by default. I've seen it send alerts though. Need to decide if this should be configurable via filters (should be easy enough to do) or if it should be a core part of the reactor logic.
Right now the standard way things work with filters is that if no stage change has occurred, no alert is sent. This is fine in most cases, but in the case where something has been broken for a while, it'd be good if NYMMS would throw another alert every so often (configurable).
It'll make it easier to see what commands are being executed without having to jump through hoops.
The Scheduler, Probe and Reactor classes have at least a few common concepts. It might be worth refactoring some of that into a common daemon class.
Right now if you don't provide a timestamp it provides it when you run validate(). It makes more sense for it to do this at object creation time. This likely applies to other datatypes as well.
I wanted to check out the ami! It looks pretty cool - only I hadn't set up an IAM profile yet and I noticed in syslog this:
Oct 18 23:07:19 ip-10-20-242-215 kernel: [9675236.422718] init: nymms-scheduler main process ended, respawning
Oct 18 23:07:19 ip-10-20-242-215 kernel: [9675236.890671] init: nymms-reactor main process (12129) terminated with status 1
Oct 18 23:07:19 ip-10-20-242-215 kernel: [9675236.890734] init: nymms-reactor main process ended, respawning
Oct 18 23:07:20 ip-10-20-242-215 kernel: [9675237.036472] init: nymms-probe main process (12130) terminated with status 1
it seems like it'd be an easy fix to have nymms fail a bit more gracefully if it doesn't have permissions to do what it needs to do.
The base groundwork is there for allowing for configuration propagation, but there's still a lot of work to be done. It'd be great if the scheduler pushed it's config + resources into an S3 bucket with a key name that matches the version hash for them. It should then include the version of the config that tasks are generated from - when a probe receives a task from a version it isn't running it could update itself from S3.
Same with the reactors - when they get a result from a config version they aren't running they should grab the new config from S3 and restart.
Need to add in the concept of 'realms' for monitors, allowing some probe instances to handle monitors created in their realms by the scheduler.
This would be useful to have probes behind firewalls.
Right now you can only specify a regex for the ID of an alert for suppressions. This works in some cases, but in a lot of other cases it'd be really great to be able to suppress based on, for example, monitoring_group. (ie: disable all alerts on all monitors on all zookeeper nodes for an hour while we do the upgrade).
This should give folks a way to bring up an all-in-one NYMMS node simply, with some basic configuration done.
Right now we expect the handler class to have a filter() method that returns True or False. It would be nice if filters & handlers were decoupled so that you could have a base library of filters that can be used with any handler.
Could use simple methods for filters - seems like the best candidate.
Should I allow multiple filters for a given handler? If so it would be an AND for all filters. Not going to even think about syntax to include an OR concept.
This bit of code:
if os.path.isfile(f):
logger.debug("Parsing include (%s:%d): %s",
filename, lineno, f)
c.extend(recursive_preprocess(f))
else:
logger.warning("%s is not a regular file, "
"skipping (%s:%d).", f, filename,
lineno)
Will give the error if, for example, it cannot access the file due to not having permissions to the directory the file is in. Need a better set of checks and error messages. try/open/except maybe?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.