Giter Site home page Giter Site logo

chaostoolkit / chaostoolkit Goto Github PK

View Code? Open in Web Editor NEW
1.8K 43.0 182.0 1.16 MB

Chaos Engineering Toolkit & Orchestration for Developers

Home Page: https://chaostoolkit.org

License: Apache License 2.0

Python 98.77% Dockerfile 1.23%
chaostoolkit chaos-engineering automation resiliency reliability reliability-engineering devops-tools sre

chaostoolkit's Introduction


Chaos Toolkit - Chaos Engineering for All Engineers

Release Build GitHub issues License Python version

CommunityInstallationTutorialsReferenceChangeLog


Chaos Toolkit - Chaos Engineering for All Engineers

The Chaos Toolkit, or as we love to call it “ctk”, is a simple CLI-driven tool who helps you write and run Chaos Engineering experiment. It supports any target platform you can think of through existing extensions or the ones you write as you need.

Chaos Toolkit is versatile and works really well in settings where other Chaos Engineering tools may not fit: cloud environments, datacenters, CI/CD, etc.

Install or Upgrade

Provided you have Python 3.8+ installed, you can install it as follows:

$ pip install -U chaostoolkit

Getting Started

Once you have installed the Chaos Toolkit you can use it through its simple command line tool.

Running an experiment is as simple as:

$ chaos run experiment.json

Get involved!

Chaos Toolkit's mission is to provide an open API to chaos engineering in all its forms. As such, we encourage and welcome you to join our open community Slack team to discuss and share your experiments and needs with the community. You can also use StackOverflow to ask any questions regarding using the Chaos Toolkit or Chaos Engineering.

If you'd prefer not to use Slack then we welcome the raising of GitHub issues on this repo for any questions, requests, or discussions around the Chaos Toolkit.

Finally you can always email [email protected] with any questions as well.

Contribute

Contributors to this project are welcome as this is an open-source effort that seeks discussions and continuous improvement.

From a code perspective, if you wish to contribute, you will need to run a Python 3.8+ environment. Please, fork this project, write unit tests to cover the proposed changes, implement the changes, ensure they meet the formatting standards set out by ruff, add an entry into CHANGELOG.md, and then raise a PR to the repository for review

The project is driven by PDM, so install it and you can run the following commands:

$ pdm install
$ pdm run test
$ pdm run format
$ pdm run lint

The Chaos Toolkit projects require all contributors must sign a Developer Certificate of Origin on each commit they would like to merge into the master branch of the repository. Please, make sure you can abide by the rules of the DCO before submitting a PR.

chaostoolkit's People

Contributors

arunachalam-j avatar cdsre avatar charliemoon37 avatar ciaranevans avatar claymccoy avatar dastergon avatar dhapola avatar dimzak avatar dmartin35 avatar glan1k avatar joshuaroot avatar lawouach avatar nstjelja avatar pravarag avatar roeik-wix avatar russmiles avatar shoito avatar sudoq avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chaostoolkit's Issues

Documentaion about developing extensions

Hi
Im kind of new here.So sorry for newb question.
I need some more probes from k8s.
So ive forked chaostoolkit-kubernetes and added missing methods
https://github.com/wix-playground/chaostoolkit-k8s-wix
Ive published it to pip
Ive installed on my computer
When i run chaos run experiment it gives me error

could not find Python module 'chaosk8s_wix.node.probes' in activity 'All nodes are healthy'

My question is:
Do you have some documentation on extension api with guidelines on how to write them properly?

Provide a self-running package

Currently, you need to install chaostoolkit (and dependencies) by going through the creation of a virtual env. This may be a little too involved for simple runs.

Find a way to create a standalone package (potentially with the only need of Python3):

Candidates to build such artefact:

The package should likely contain all extensions to make it ready to use.

Change wording on a chaos experiment execution that finds a weakness from "failed" to "complete"

At the moment, if a weakness is discovered the command line output states:

[2018-04-27 17:36:38 INFO] Experiment ended with status: failed

This is confusing as the experiument has not failed, in fact it's been successful in finding a weakness, but when this experiment is run "as a test" the tested conditions of the steady-state hypothesis have failed.

Suggest changing to "

[2018-04-27 17:36:38 INFO] Experiment ended with status: completed
[2018-04-27 17:36:38 INFO] Steady-state hypothesis discovered weaknesses

Or something similar in this case.

Fix 'tolerance' for process probes

I can not figure out how to create a steady state hypothesis with a process probe. The problem comes from the way the tolerance is compared to the output of the process.

Ideally, when you define a probe of type 'process' for the steady state hypothesis, and the tolerance is of type 'int', it should simply check the return code of the process against the tolerance.

Optionally, it might be useful to handle tolerance of type str, by comparing it to stdout

I temporarily fixed the 'int' problem by applying this change:

diff --git a/chaoslib/provider/process.py b/chaoslib/provider/process.py
index a6d2b4e..50a9b44 100644
--- a/chaoslib/provider/process.py
+++ b/chaoslib/provider/process.py
@@ -49,11 +49,11 @@ def run_process_activity(activity: Activity, configuration: Configuration,
     except subprocess.TimeoutExpired:
         raise FailedActivity("process activity took too long to complete")
 
-    return (
-        proc.returncode,
-        proc.stdout.decode('utf-8'),
-        proc.stderr.decode('utf-8')
-    )
+    return {
+        "status": proc.returncode,
+        "stdout": proc.stdout.decode('utf-8'),
+        "stderr": proc.stderr.decode('utf-8')
+    }

Ask for permission before running an experiment

When I run an experiment with chaos run experiment.json, it would be better if the tool showed me a prompt to verify if I would like to proceed with the experiment or not. An extra flag --y or --no-verification` could be also implemented to skip that step.

Profile chaostoolkit

The CLI has shown a slight slowdown recently and I wonder if it's just my environment or if there is a larger issue (likely the former but let's be thorough).

Support structured logging

It would be nice to support structured logging, not just raw strings.

Question is, what do we want to see in the structured payload?

Also, are we talking about the full logs or the one displayed on the console.

use 'probe' outputs in 'actions'

Hi guys,

As far as I can understand the general workflow of the tool is:

  • check steady state hypothesis
  • run actions
  • check steady state hypothesis again

In this flow probes represent steady state hypothesis and are not used for anything else.

For the use cases I see it would be extremely useful to use outputs of 'probes' as parameters to actions and have some conditional logic in between.

Example use case:
Having hundreds of instances in AWS Region/AZ I don't want to mess up with all of them as some instances/ASGs are not built for HA. So, I want to filter ASGs that:

  • match certain tags (easy)
  • have more than X healthy instances registered to load balancer and serving traffic (not so easy)
  • pass resulting list of ASGs to action so that it will operate with this subset of instances choosing random ones to stop

I cam imagine doing all the checks necessary in action itself but this will lead to code duplication as I can see some of checks I'd like to do in probes already and naturally want to reuse existing code.

Regarding conditional logic (like, don't stop instances in ASG if it has less than 2 healthy instances), do you think it is possible to integrate this to experiment yaml/json or the best place for this is action itself and one could pass conditions as parameters to action?

Please let me know your thoughts on this subject.

Do we need a synchronization mechanism?

Right now, the only way to give room for a process to happen is to use a pause before/after. This is obviously fine for fairly simple scenarios, but sometimes this is not enough.

So, do we need a more evolved synchronization mechanism?

I can see the benefit but it feels like a slippery slope because that means the toolkit becomes a state machine and increases in complexity. At first sight, I'm scared of that.

But I need the input from the community to make a better judgement.

Add an inspect command to query information about an activity

Right now, the only way to know what parameters an activity takes is by looking at the extension code. It could be handy to have a command that would tell us:

$ chaos inspect chaostoolkit-kubernetes kill_pod

Revealing, its parameters and output.

should that be core though? I'm leaning on yes.

Paving the way for greater and more powerful automation

Hey all,

Recently, various members of the community have put forth ideas that put good pressure on our current "run once from A-to-B" approach to execute an experiment.

When the Chaos Toolkit project started, we realised we didn't have all the answers and we would likely fail to find the right model from the get go. So we staged a basic, yet effective, approach that we hoped would help get the discussion started around the Chaos Engineering experiment model.

The result is the current specification as we know it. However, it has, what some may see has limitations due to its simplicity. Here are a few of those that were raised:

  1. output from an activity is not propagated further down the experiment, meaning you cannot refer to a value generated during your run (here and here)
  2. sometimes, you want to run the steady state either only before or after the method and see what happens. Right now, it always runs before and after which is not always suitable. As an example, assume your system is in a degraded state and you want to run an experiment that try out a recovery process you've defined (in other word, use Chaos Engineering not to shake your system down but to put it back up). Your steady state would be something like "are we back on track?", if that was executed before the method, it would fail before we even had a chance to run the method! So there seems to be a value in turning off the steady state run either before or after.
  3. applying an experiment to a pool of data cannot be done at once natively. Say you have a bunch of environments (or other resources...) you want to apply the experiment at once, you have no other choice but write your own script to call handle the matrix and call the toolkit manually for each combination
  4. a similar one to the previous one is being able to run a set of activities in a loop (not the entire experiment)
  5. being able to abort the experiment run as soon as the method emits such a signal would avoid wasting time having the whole experiment going through for no reasons
  6. following the previous one, is there value in a test mode when we always imagined the Chaos Engineering flow to have value because it always goes to the end (black box/closer to user behavior?)
  7. do we need to improve the synchronization of activities? We are solely time-based now (wait for 5s after the activity...) but could we imagine event-based? Meaning we would have to listen for them.

Overall, those ideas make a lot of sense and could turn the toolkit into something more capable. But, more capabilities always come at a price which is greater complexity leading to potential increased fragility.

I created this thread not to find solutions to each of these issues (please comment on the according tickets for specifics ;)) but so that we, as the Chaos Toolkit community, conduct a clear discussion about where we want to take the next steps.

I thinks some of those could make a 1.0 if they feel appropriate and not rushed. Others will have to wait for a 2.0.

Please make sure to speak your mind :)

Cheers,

  • Sylvain

installer complains when it can't download a wheel archive

For some dependencies, pip complains when it installs them if they don't have a wheel distribution.

Building wheels for collected packages: click-plugins
  Running setup.py bdist_wheel for click-plugins ... error
  Complete output from command /home/cristian/.venvs/chaostk/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-6fsfb6db/click-plugins/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /tmp/tmpeyfytuqvpip-wheel- --python-tag cp35:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help
  
  error: invalid command 'bdist_wheel'
  
  ----------------------------------------
  Failed building wheel for click-plugins
  Running setup.py clean for click-plugins

Show help page when executing a plain chaos run command

Rather than having to run chaos run --help to show the available options, the plain chaos run command could also show the help page below the Error: Missing argument "path".

Example

Current output:

$ chaos run
Usage: chaos run [OPTIONS] PATH

Error: Missing argument "path".

Potential new output:

$ chaos run
Error: Missing argument "path".
Usage: chaos run [OPTIONS] PATH

  Run the experiment given at PATH.

Options:
  --journal-path TEXT  Path where to save the journal from the execution.
  --dry                Run the experiment without executing activities.
  --no-validation      Do not validate the experiment before running.
  --help               Show this message and exit.

Make `steady-state-hypothesis` block optional

When you initially begin to explore the weaknesses of a system you ttypically start by simply probing and performaing some attempts to "see what happens". This stage of an experiment's development shouldn't need a steady-state-hypothesis, but rather a method only as you explore various scenarios to retrieve and then interpret the resulting data.

This is a very common approach in regular scientific work, and is explained a number of times in the excellent book "Ignorance: How it drives science", i.e. "Let's get the data, and then we can figure out the hypothesis", Chapter 2, Page 19.

Collaborator Promotion Process Call for Thoughts

Currently we have a collection of open source projects and a wonderfully vibrant community around the Chaos Toolkit and the Incubator. As people from the community get involved, it becomes desirable that more responsibility for those projects is ideally subsumed by the community rather than the org owners (myself and @Lawouach currently).

I've like to propose that we collectively formulate a process whereby someone who everyone recognises is going beyond simple contribution to a project and in fact is becoming the trusted chaperone of the project can be "promoted" to have direct write access to master for that project.

The idea so far is to perhaps have something like the following:

  1. Propose the promotion, who and to what repo, and reasoning as an issue on this repo.
  2. Allow a sensible period for discussion for and against the proposal.
  3. Final say to be made by the existing responsible project contributors.
  4. Announcement and justification on the issue prior to it being closed and the new rights being allocated to the successfully promoted collaborator.

That's just my take so far though. This issue is a call out for ideas on what that process might be, so please add your own suggestions to this issue and then we can hopefully establish a V1 of our own community collaborator promotion process, which we'll add to the project documentation.

Add an `extensions` optional block to the experiment

Depending on where an experiment is used, stored and indexed, there is a good cause for needing to specify optional, additional information about the experiment. This issue suggests that a block called extensions that in turn contains name-scoped extension blocks, be added to the chaos toolkit's experiment specification.

Over time some of those "candidate" extension properties could be moved into the full chaos toolkit experiment specification once they are sufficiently mature and generally applicable.

Add init feature

It should be possible to create an almost-working experiment from the discovered features.

boolean in 'tolerance' does not seem to work correctly

It is mentioned in doc that 'tolerance' supports boolean.
Quote:
A boolean tolerance:
"tolerance": true

If I put it like that in experiment json like that it fails:

$ chaos run experiment-asg.json
Traceback (most recent call last):
File "/home/dima/.venvs/chaostk/bin/chaos", line 11, in
sys.exit(cli())
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/chaostoolkit/cli.py", line 124, in run
experiment = load_experiment(click.format_filename(source), settings)
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/loader.py", line 76, in load_experiment
return parse_experiment_from_file(experiment_source)
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/loader.py", line 31, in parse_experiment_from_file
return json.load(f)
File "/usr/lib/python3.6/json/init.py", line 299, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/usr/lib/python3.6/json/init.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 42 column 30 (char 1376)

Please advice if this is the bug in docs or elsewhere

meanwhile I have workaround adding str() for the functions that return boolean value

chaos init should offer to create rollbacks actions

I think that similarly to #28, it would be interesting that "chaos init" offer the possibility to generate the rollbacks action linked to the method done.

For example, if I used a method "scale_microservice" to scale my application to 5, it would be nice to have in rollback at the end of the experiment that reverts it to my previous number of replica, so init would ask me to give me the opportunity to fill the "rollbacks", be it automatic with the discovery beforehand or manual with an input from user.

That would probably extend the current :
""" [2018-01-30 16:50:27 INFO] Let's build a new experiment
Experiment's title: Experiment #1
Add an activity to your method """

by doing """
[2018-01-30 16:50:27 INFO] Let's build a new experiment
Experiment's title: Experiment #1
1)Add an activity to your method
2)Add a steady-state hypothesis
3)Add an activity to your rollbacks """

What do you guys think?

Consider adding a "test" mode to the toolkit, in addition to the default "experiment" mode

I'd like to consider whether there could be a "test" mode to the toolkit, where method execution failures result in different reporting etc. It's an interesting idea. At the moment the default mode is "experiment", i.e. we want the toolkit to act like we are focussed on discovering new information. "test" would affect the runtime reporting wholescale to support a focus on validation and verification of already-known facts.

Make steady state hypothesis the before and after experiment check

The steady state hypothesis is used for two purposes:

  • To check that the system is within a specified set of tolerances before an experiment can be run
  • To be used to assess the system at the end of the experiment and to report any deviations from the steady state hypothesis as areas of potential weakness.

Add configure command

A configure command allows the toolkit installation to manage its extensions (plugins and drivers) according to specific versions, and also to manipulate settings (stored in ~/.chaostoolkit/settings.yaml).

The idea for the workflow is:

  • chaos configure -settings key:value key:value -plugins pluginName:1.0.0 pluginName:0.3.0 -drivers driverName driverName:0.4.0
  • "Would you like to create a new, local Python virtual environment for this configuration?"
  • Yes ... new, fresh virtual environment, install it all including the toolkit itself.
  • No ... continue...
  • If a collision in versions is detected in the current environment then...
  • "You have specified driverName version 0.4.0, you already have version 0.2.1 installed, do you want to keep your current version (Y), or override with the configured version (O)?"

At the moment, configure cannot specify a version of the chaos toolkit itself. However it could do this in the future, particularly in the case where a new virtual environment is being specified.

Also, the choices in the above workflow could be defaulted with:

  • -virtualenv to automatically create a local virtual environment
  • -override to automatically override any version collisions, preferring the versions specified in the configure command.
  • --no-override to prefer existing versions over those specified in the configure command, so ignoring any versions that are a collision with what is already present.

chaostoolkit should expose the I/O API for probes and actions

The first prototype show cased the general principles behind chaostoolkit by implementing the probes and actions as Python functions part of the package itself.

This won't scale well and will not help the community building its own set of probes and actions.

We should expose a clean API that describes what are the expectations from the chaostoolkit runtime and allow anyone to implement probes and actions as they see fit.

Obviously, we will provide implementations for certain target already for ease of use but likely in their own separate packages.

Implementations should not be forced to be in Python, processes or remote calls should be supported as well.

When applying a steady-state probe, it should be "steady-state" not "steady" as currently reported

When you run an experiment the logging mentions that a steady probe is being applied. This should probably be renamed a steady-state probe, and might just be a name change in the experimental method.

See the following output for where this occurs when running the service-down-not-visible-to-users sample experiment:

[I 170930 14:01:15 plan:134]  Applying steady probe 'microservices-all-healthy'

Pass command name to checker

As commands evolve from one version to another, it would be good to know what users try to use on a given version. This could mean better error messages.

Add “AbortActivity” exception

Add “AbortActivity” exception to docs and toolkit to support interrupting and immediately exiting an experiment if it is encountered from any activity. Add to docs also and ensure journaled correctly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.