chaostoolkit / chaostoolkit Goto Github PK

View Code? Open in Web Editor NEW

1.8K 43.0 182.0 1.16 MB

Chaos Engineering Toolkit & Orchestration for Developers

Home Page: https://chaostoolkit.org

License: Apache License 2.0

Python 98.77% Dockerfile 1.23%

chaostoolkit chaos-engineering automation resiliency reliability reliability-engineering devops-tools sre

chaostoolkit's Introduction

Chaos Toolkit - Chaos Engineering for All Engineers

Community • Installation • Tutorials • Reference • ChangeLog

Chaos Toolkit - Chaos Engineering for All Engineers

The Chaos Toolkit, or as we love to call it “ctk”, is a simple CLI-driven tool who helps you write and run Chaos Engineering experiment. It supports any target platform you can think of through existing extensions or the ones you write as you need.

Chaos Toolkit is versatile and works really well in settings where other Chaos Engineering tools may not fit: cloud environments, datacenters, CI/CD, etc.

Install or Upgrade

Provided you have Python 3.8+ installed, you can install it as follows:

$ pip install -U chaostoolkit

Getting Started

Once you have installed the Chaos Toolkit you can use it through its simple command line tool.

Running an experiment is as simple as:

$ chaos run experiment.json

Get involved!

Chaos Toolkit's mission is to provide an open API to chaos engineering in all its forms. As such, we encourage and welcome you to join our open community Slack team to discuss and share your experiments and needs with the community. You can also use StackOverflow to ask any questions regarding using the Chaos Toolkit or Chaos Engineering.

If you'd prefer not to use Slack then we welcome the raising of GitHub issues on this repo for any questions, requests, or discussions around the Chaos Toolkit.

Finally you can always email [email protected] with any questions as well.

Contribute

Contributors to this project are welcome as this is an open-source effort that seeks discussions and continuous improvement.

From a code perspective, if you wish to contribute, you will need to run a Python 3.8+ environment. Please, fork this project, write unit tests to cover the proposed changes, implement the changes, ensure they meet the formatting standards set out by ruff, add an entry into CHANGELOG.md, and then raise a PR to the repository for review

The project is driven by PDM, so install it and you can run the following commands:

$ pdm install
$ pdm run test
$ pdm run format
$ pdm run lint

The Chaos Toolkit projects require all contributors must sign a Developer Certificate of Origin on each commit they would like to merge into the master branch of the repository. Please, make sure you can abide by the rules of the DCO before submitting a PR.

chaostoolkit's People

Contributors

Stargazers

Watchers

Forkers

grantburgess-developer intigabriel kband maniacs-oss crizzs sergiocollado josselinchevalay dastergon elicherla01 djd4352 blaisep lightoflife66 wendysegura fossabot ddavtian everydaynerd joshuaroot sudoq jpreese boyone harshanarayana hardeep18 neo4reo mattiekat dmartin35 ik-infrastructure-testing saurabhdevops naheedmk catufunwa jordo1138 mukteshkrmishra stretchcloud anthony-cervantes mianguanwu paulusdong cadu-goncalves fociceo dimzak pravarag gautamdivgi mikeln berimboloenterprises mistshi kritika-saxena-guavus raygenyang dmitrytokarev ellerbrock cloud-architecture mattsapphire27 kananthk hochristinawuiyan tools-env khansahab krelltin raghu999 nstjelja nickksun kdandamudi12 vijayto nikileshsa y00521901 andypeng2015 vfarcic tdevilleduc willingc davidmkinman bainard jackge007 divya-mohan0209 lay-le koulvivek sarkadin qingjie dthirumalaibe develop-x punmechanic domdoescode laashub-soa ravi-code-ranjan c20o20 taza1 alexander-gorelik regalius tolidano michael-gehtman-wix hmsvigle littleflowerfa krishnaprasanth xiaojiayu404 ying-j-li amitkrishna fcsu joypersonal iratemonkey bhai840 rdkamali clix-dev-llc marvel-works devopstoday11 jiamaozheng

chaostoolkit's Issues

Documentaion about developing extensions

Hi
Im kind of new here.So sorry for newb question.
I need some more probes from k8s.
So ive forked chaostoolkit-kubernetes and added missing methods
https://github.com/wix-playground/chaostoolkit-k8s-wix
Ive published it to pip
Ive installed on my computer
When i run chaos run experiment it gives me error

could not find Python module 'chaosk8s_wix.node.probes' in activity 'All nodes are healthy'

My question is:
Do you have some documentation on extension api with guidelines on how to write them properly?

Provide a self-running package

Currently, you need to install chaostoolkit (and dependencies) by going through the creation of a virtual env. This may be a little too involved for simple runs.

Find a way to create a standalone package (potentially with the only need of Python3):

Candidates to build such artefact:

shiv: https://github.com/linkedin/shiv
pyinstaller: https://www.pyinstaller.org/
pex: https://github.com/pantsbuild/pex

The package should likely contain all extensions to make it ready to use.

New function to measure the number of replicas to serve concurrent users

Provide a new function to measure the total time taken for execution of a test script. This will allow to measure if XXX concurrent user requests can be met with YYY replicas of a service.

FYI @adrianco

Reference extension documentation when completed

Can only be completed when chaostoolkit/chaostoolkit-documentation#4 is done.

Change wording on a chaos experiment execution that finds a weakness from "failed" to "complete"

At the moment, if a weakness is discovered the command line output states:

[2018-04-27 17:36:38 INFO] Experiment ended with status: failed

This is confusing as the experiument has not failed, in fact it's been successful in finding a weakness, but when this experiment is run "as a test" the tested conditions of the steady-state hypothesis have failed.

Suggest changing to "

[2018-04-27 17:36:38 INFO] Experiment ended with status: completed
[2018-04-27 17:36:38 INFO] Steady-state hypothesis discovered weaknesses

Or something similar in this case.

Fix 'tolerance' for process probes

I can not figure out how to create a steady state hypothesis with a process probe. The problem comes from the way the tolerance is compared to the output of the process.

Ideally, when you define a probe of type 'process' for the steady state hypothesis, and the tolerance is of type 'int', it should simply check the return code of the process against the tolerance.

Optionally, it might be useful to handle tolerance of type str, by comparing it to stdout

I temporarily fixed the 'int' problem by applying this change:

diff --git a/chaoslib/provider/process.py b/chaoslib/provider/process.py
index a6d2b4e..50a9b44 100644
--- a/chaoslib/provider/process.py
+++ b/chaoslib/provider/process.py
@@ -49,11 +49,11 @@ def run_process_activity(activity: Activity, configuration: Configuration,
     except subprocess.TimeoutExpired:
         raise FailedActivity("process activity took too long to complete")
 
-    return (
-        proc.returncode,
-        proc.stdout.decode('utf-8'),
-        proc.stderr.decode('utf-8')
-    )
+    return {
+        "status": proc.returncode,
+        "stdout": proc.stdout.decode('utf-8'),
+        "stderr": proc.stderr.decode('utf-8')
+    }

chaos init should offer to create a steady state hypothesis

chaos init should at least offer the possibility to create a steady state from the get go.

Provide the docs for functions for actions

The docs for different functions that can be used in actions would be very useful.

FYI @adrianco

Allow Yaml for the experiment

Currently we only load JSON, we could probably support yaml as well.

Ask for permission before running an experiment

When I run an experiment with chaos run experiment.json, it would be better if the tool showed me a prompt to verify if I would like to proceed with the experiment or not. An extra flag --y or --no-verification` could be also implemented to skip that step.

Make experiment configuration namespace scoped and reference-able by probes and actions

Similar behaviour as is currently implemented for secrets

Add links to new open API documentation instead of concepts when the docs are published

Add 'info' command to list current setup and status

Also could be overloaded to suggest plugins that are indexed by the Hub?

Prompt user for secrets during the init workflow

It would be good if we asked the user for global secrets (and referencing to them afterwards ) from the init command.

Make it possible to specify a set of acceptable HTTP response codes as a tolerance in a steady state hypothesis

Profile chaostoolkit

The CLI has shown a slight slowdown recently and I wonder if it's just my environment or if there is a larger issue (likely the former but let's be thorough).

Support structured logging

It would be nice to support structured logging, not just raw strings.

Question is, what do we want to see in the structured payload?

Also, are we talking about the full logs or the one displayed on the console.

use 'probe' outputs in 'actions'

Hi guys,

As far as I can understand the general workflow of the tool is:

check steady state hypothesis
run actions
check steady state hypothesis again

In this flow probes represent steady state hypothesis and are not used for anything else.

For the use cases I see it would be extremely useful to use outputs of 'probes' as parameters to actions and have some conditional logic in between.

Example use case:
Having hundreds of instances in AWS Region/AZ I don't want to mess up with all of them as some instances/ASGs are not built for HA. So, I want to filter ASGs that:

match certain tags (easy)
have more than X healthy instances registered to load balancer and serving traffic (not so easy)
pass resulting list of ASGs to action so that it will operate with this subset of instances choosing random ones to stop

I cam imagine doing all the checks necessary in action itself but this will lead to code duplication as I can see some of checks I'd like to do in probes already and naturally want to reuse existing code.

Regarding conditional logic (like, don't stop instances in ASG if it has less than 2 healthy instances), do you think it is possible to integrate this to experiment yaml/json or the best place for this is action itself and one could pass conditions as parameters to action?

Please let me know your thoughts on this subject.

Add configuration and configuration provider to experiment specification

Do we need a synchronization mechanism?

Right now, the only way to give room for a process to happen is to use a pause before/after. This is obviously fine for fairly simple scenarios, but sometimes this is not enough.

So, do we need a more evolved synchronization mechanism?

I can see the benefit but it feels like a slippery slope because that means the toolkit becomes a state machine and increases in complexity. At first sight, I'm scared of that.

But I need the input from the community to make a better judgement.

Add a new argument to specify the path to the config file

While the default value is enough for generic cases, as soon as you want to embed the toolkit anywhere else, it is helpful to be able to provide a different path.

Add an inspect command to query information about an activity

Right now, the only way to know what parameters an activity takes is by looking at the extension code. It could be handy to have a command that would tell us:

$ chaos inspect chaostoolkit-kubernetes kill_pod

Revealing, its parameters and output.

should that be core though? I'm leaning on yes.

Paving the way for greater and more powerful automation

Hey all,

Recently, various members of the community have put forth ideas that put good pressure on our current "run once from A-to-B" approach to execute an experiment.

When the Chaos Toolkit project started, we realised we didn't have all the answers and we would likely fail to find the right model from the get go. So we staged a basic, yet effective, approach that we hoped would help get the discussion started around the Chaos Engineering experiment model.

The result is the current specification as we know it. However, it has, what some may see has limitations due to its simplicity. Here are a few of those that were raised:

output from an activity is not propagated further down the experiment, meaning you cannot refer to a value generated during your run (here and here)
sometimes, you want to run the steady state either only before or after the method and see what happens. Right now, it always runs before and after which is not always suitable. As an example, assume your system is in a degraded state and you want to run an experiment that try out a recovery process you've defined (in other word, use Chaos Engineering not to shake your system down but to put it back up). Your steady state would be something like "are we back on track?", if that was executed before the method, it would fail before we even had a chance to run the method! So there seems to be a value in turning off the steady state run either before or after.
applying an experiment to a pool of data cannot be done at once natively. Say you have a bunch of environments (or other resources...) you want to apply the experiment at once, you have no other choice but write your own script to call handle the matrix and call the toolkit manually for each combination
a similar one to the previous one is being able to run a set of activities in a loop (not the entire experiment)
being able to abort the experiment run as soon as the method emits such a signal would avoid wasting time having the whole experiment going through for no reasons
following the previous one, is there value in a test mode when we always imagined the Chaos Engineering flow to have value because it always goes to the end (black box/closer to user behavior?)
do we need to improve the synchronization of activities? We are solely time-based now (wait for 5s after the activity...) but could we imagine event-based? Meaning we would have to listen for them.

Overall, those ideas make a lot of sense and could turn the toolkit into something more capable. But, more capabilities always come at a price which is greater complexity leading to potential increased fragility.

I created this thread not to find solutions to each of these issues (please comment on the according tickets for specifics ;)) but so that we, as the Chaos Toolkit community, conduct a clear discussion about where we want to take the next steps.

I thinks some of those could make a 1.0 if they feel appropriate and not rushed. Others will have to wait for a 2.0.

Please make sure to speak your mind :)

Cheers,

Sylvain

Add support for Rancher probes and actions

As suggested by Bernd Dulfer ([email protected]) support for working with Rancher (http://rancher.com/docs/rancher/latest/en/) to probe and manipulate packaging, deployment and runtime management of Rancher containers.

Kubernetes is the underlying support system, so the Kube support is already helpful, but there is more power in supporting fully what Rancher exposes.

installer complains when it can't download a wheel archive

For some dependencies, pip complains when it installs them if they don't have a wheel distribution.

Building wheels for collected packages: click-plugins
  Running setup.py bdist_wheel for click-plugins ... error
  Complete output from command /home/cristian/.venvs/chaostk/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-6fsfb6db/click-plugins/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /tmp/tmpeyfytuqvpip-wheel- --python-tag cp35:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help
  
  error: invalid command 'bdist_wheel'
  
  ----------------------------------------
  Failed building wheel for click-plugins
  Running setup.py clean for click-plugins

Add tags to the experiment definition

General purpose tags could be valuable for consumer of experiments, for organising them.

Add code coverage to build

Report code coverage for the each build and create a target code coverage percentage

Show help page when executing a plain chaos run command

Rather than having to run chaos run --help to show the available options, the plain chaos run command could also show the help page below the Error: Missing argument "path".

Example

Current output:

$ chaos run
Usage: chaos run [OPTIONS] PATH

Error: Missing argument "path".

Potential new output:

$ chaos run
Error: Missing argument "path".
Usage: chaos run [OPTIONS] PATH

  Run the experiment given at PATH.

Options:
  --journal-path TEXT  Path where to save the journal from the execution.
  --dry                Run the experiment without executing activities.
  --no-validation      Do not validate the experiment before running.
  --help               Show this message and exit.

Make `steady-state-hypothesis` block optional

When you initially begin to explore the weaknesses of a system you ttypically start by simply probing and performaing some attempts to "see what happens". This stage of an experiment's development shouldn't need a steady-state-hypothesis, but rather a method only as you explore various scenarios to retrieve and then interpret the resulting data.

This is a very common approach in regular scientific work, and is explained a number of times in the excellent book "Ignorance: How it drives science", i.e. "Let's get the data, and then we can figure out the hypothesis", Chapter 2, Page 19.

Collaborator Promotion Process Call for Thoughts

Currently we have a collection of open source projects and a wonderfully vibrant community around the Chaos Toolkit and the Incubator. As people from the community get involved, it becomes desirable that more responsibility for those projects is ideally subsumed by the community rather than the org owners (myself and @Lawouach currently).

I've like to propose that we collectively formulate a process whereby someone who everyone recognises is going beyond simple contribution to a project and in fact is becoming the trusted chaperone of the project can be "promoted" to have direct write access to master for that project.

The idea so far is to perhaps have something like the following:

Propose the promotion, who and to what repo, and reasoning as an issue on this repo.
Allow a sensible period for discussion for and against the proposal.
Final say to be made by the existing responsible project contributors.
Announcement and justification on the issue prior to it being closed and the new rights being allocated to the successfully promoted collaborator.

That's just my take so far though. This issue is a call out for ideas on what that process might be, so please add your own suggestions to this issue and then we can hopefully establish a V1 of our own community collaborator promotion process, which we'll add to the project documentation.

Add an `extensions` optional block to the experiment

Depending on where an experiment is used, stored and indexed, there is a good cause for needing to specify optional, additional information about the experiment. This issue suggests that a block called extensions that in turn contains name-scoped extension blocks, be added to the chaos toolkit's experiment specification.

Over time some of those "candidate" extension properties could be moved into the full chaos toolkit experiment specification once they are sufficiently mature and generally applicable.

Add init feature

It should be possible to create an almost-working experiment from the discovered features.

boolean in 'tolerance' does not seem to work correctly

It is mentioned in doc that 'tolerance' supports boolean.
Quote:
A boolean tolerance:
"tolerance": true

If I put it like that in experiment json like that it fails:

$ chaos run experiment-asg.json
Traceback (most recent call last):
File "/home/dima/.venvs/chaostk/bin/chaos", line 11, in
sys.exit(cli())
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/chaostoolkit/cli.py", line 124, in run
experiment = load_experiment(click.format_filename(source), settings)
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/loader.py", line 76, in load_experiment
return parse_experiment_from_file(experiment_source)
File "/home/dima/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/loader.py", line 31, in parse_experiment_from_file
return json.load(f)
File "/usr/lib/python3.6/json/init.py", line 299, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/usr/lib/python3.6/json/init.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 42 column 30 (char 1376)

Please advice if this is the bug in docs or elsewhere

meanwhile I have workaround adding str() for the functions that return boolean value

Make it possible to specify a URL for the location of an experiment, rather than only local files

Based on this discussion: chaostoolkit-incubator/kubernetes-job#1 it would be useful to be able to supply, for example, a GIT URL for an experiment so that it can be run by the chaos command, for example as packaged in a Kubernetes Job

chaos init should offer to create rollbacks actions

I think that similarly to #28, it would be interesting that "chaos init" offer the possibility to generate the rollbacks action linked to the method done.

For example, if I used a method "scale_microservice" to scale my application to 5, it would be nice to have in rollback at the end of the experiment that reverts it to my previous number of replica, so init would ask me to give me the opportunity to fill the "rollbacks", be it automatic with the discovery beforehand or manual with an input from user.

That would probably extend the current :
""" [2018-01-30 16:50:27 INFO] Let's build a new experiment
Experiment's title: Experiment #1
Add an activity to your method """

by doing """
[2018-01-30 16:50:27 INFO] Let's build a new experiment
Experiment's title: Experiment #1
1)Add an activity to your method
2)Add a steady-state hypothesis
3)Add an activity to your rollbacks """

What do you guys think?

Add rollback declaration

It stands quite clear that we need to let users declare their rollback strategy for actions.

Consider adding a "test" mode to the toolkit, in addition to the default "experiment" mode

I'd like to consider whether there could be a "test" mode to the toolkit, where method execution failures result in different reporting etc. It's an interesting idea. At the moment the default mode is "experiment", i.e. we want the toolkit to act like we are focussed on discovering new information. "test" would affect the runtime reporting wholescale to support a focus on validation and verification of already-known facts.

Add blast radius declaration

Blast radius is a critical component of an experiment. This should be declared comprehensively.

chaos init should accept to bypass parameters

When a user wants to not set a value for a parameter, it cannot unless it passes space which is not valid in most cases.

Move all references to "report" from the chaos run command and replace with "journal"

Currently the chaos run command refers to its output as a report in the help text, and the default filename is chaos-report.json. This should be changed to chaos-journal.json and the help text should be changed to --journal-path, and the tutorials and docs updated to reflect this now that we have the chaos report command.

Make steady state hypothesis the before and after experiment check

The steady state hypothesis is used for two purposes:

To check that the system is within a specified set of tolerances before an experiment can be run
To be used to assess the system at the end of the experiment and to report any deviations from the steady state hypothesis as areas of potential weakness.

Add configure command

A configure command allows the toolkit installation to manage its extensions (plugins and drivers) according to specific versions, and also to manipulate settings (stored in ~/.chaostoolkit/settings.yaml).

The idea for the workflow is:

chaos configure -settings key:value key:value -plugins pluginName:1.0.0 pluginName:0.3.0 -drivers driverName driverName:0.4.0
"Would you like to create a new, local Python virtual environment for this configuration?"
Yes ... new, fresh virtual environment, install it all including the toolkit itself.
No ... continue...
If a collision in versions is detected in the current environment then...
"You have specified driverName version 0.4.0, you already have version 0.2.1 installed, do you want to keep your current version (Y), or override with the configured version (O)?"

At the moment, configure cannot specify a version of the chaos toolkit itself. However it could do this in the future, particularly in the case where a new virtual environment is being specified.

Also, the choices in the above workflow could be defaulted with:

-virtualenv to automatically create a local virtual environment
-override to automatically override any version collisions, preferring the versions specified in the configure command.
--no-override to prefer existing versions over those specified in the configure command, so ignoring any versions that are a collision with what is already present.

chaostoolkit should expose the I/O API for probes and actions

The first prototype show cased the general principles behind chaostoolkit by implementing the probes and actions as Python functions part of the package itself.

This won't scale well and will not help the community building its own set of probes and actions.

We should expose a clean API that describes what are the expectations from the chaostoolkit runtime and allow anyone to implement probes and actions as they see fit.

Obviously, we will provide implementations for certain target already for ease of use but likely in their own separate packages.

Implementations should not be forced to be in Python, processes or remote calls should be supported as well.

chaos init should create an empty experiment when no discovery was performed beforehand

Right now, chaos init expects you have executed chaos discover before hand. It feels it would be handy to simply create an empty experiment.json file when the discover method was not run beforehand. This is still a valuable output.

When applying a steady-state probe, it should be "steady-state" not "steady" as currently reported

When you run an experiment the logging mentions that a steady probe is being applied. This should probably be renamed a steady-state probe, and might just be a name change in the experimental method.

See the following output for where this occurs when running the service-down-not-visible-to-users sample experiment:

[I 170930 14:01:15 plan:134]  Applying steady probe 'microservices-all-healthy'

chaostoolkit / chaostoolkit Goto Github PK

chaostoolkit's Introduction

Chaos Toolkit - Chaos Engineering for All Engineers

Chaos Toolkit - Chaos Engineering for All Engineers

Install or Upgrade

Getting Started

Get involved!

Contribute

chaostoolkit's People

Contributors

Stargazers

Watchers

Forkers

chaostoolkit's Issues

Example

Recommend Projects

Recommend Topics

Recommend Org