chaostoolkit / chaostoolkit-lib Goto Github PK

View Code? Open in Web Editor NEW

76.0 10.0 46.0 791 KB

The Chaos Toolkit core library

Home Page: https://chaostoolkit.org/

License: Apache License 2.0

Python 100.00%

chaos-engineering chaostoolkit chaostoolkit-core reliability-engineering

chaostoolkit-lib's People

Contributors

Stargazers

Watchers

Forkers

klarrio fossabot alexshemeshwix mirimi arpiagar dimzak k0sky gautamdivgi albertosh mistshi snej- grantburgess-developer joshuaroot dmartin35 nstjelja kdandamudi12 tdevilleduc willingc idanto ojongerius ykskb littleflowerfa joypersonal saravanan30erd iratemonkey clix-dev-llc mickael-roger tam-lin camillegr roeik-wix wixoleo masgari alexander-gorelik dimaka-wix azurecloudmonk farhanangullia dipakdusane wix-chaos-hub samuel1707 pradeepkumart ceccatoandrea mcastellin marcelraschke vodafone mattiascockburn cdsre

chaostoolkit-lib's Issues

Do not fail when loading secrets that do not require vault and hvac is not installed

As of chaoslib 0.12, hvac is not installed by default but it will fail on loading secrets because of that.

Expand Tolerance options

According to documentation i can only get responses from an http method as string.
But tolerance using [int, int] only allows int and fails on the steady state validation.

Example:

{
    "type": "probe",
    "name": "CCCCC",
    "tolerance":: [1,10000],
    "provider": {
      "type": "http",
      "url": "http://localhost:3000/metrics/query",
      "method": "POST",
      "arguments": {
        "query": any service returning a string",
        "datasource": 161
      }

}

Update yaml loader to use the safe loader

Currently we don't load yaml using the safe mechanism of the yaml package. This should be the case as we read from unsafe places.

when using probe in `method` tolerance is not validated

Hi,

I just noticed that it is allowed to use probes inside method as opposed to steady-state-hypothesis but at the same time tolerance is not validated when doing so.

I find it useful to use probes inside method, first example use case I have is when I don't want to run probe both before and after some actions. Instead the probe can be used to validate some conditions in the middle of experiment. For example, I stop a random instance in ASG and if it's not marked unhealthy (or perhaps is not replaced fast) I don't want to continue with the experiment.

Please advice if this behavior is by intention or by accident.

Add and document some strategies around time-constrained experiments

Original content from docs: # Adding Time Constraints to an Experiment

It is a common requirement to execute a chaos experiment for a certain period of
time, ending the experiment if it goes on indefinitely.

We've been very careful to rely on other tools for these types of concerns, and
so timing constraints are not a built-in feature of the Chaos Toolkit's
experiments.

Pass th experiment to all controls

Currently, only the current context (steady-state, method, activity...) is passed down to a control fonction. We should also pass the experiment as it contains a larger context that may be useful as well.

Ideally this should go into 1.0.0rc2

Add settings support for the chaostoolkit

In order to support #33, it will be necessary to store settings for the toolkit.

Control level is overriden

The control level to determine the Python function to call is overriden and should be preserved.

NameError: name 'ModuleNotFoundError' is not defined

We can't rely on ModuleNotFoundError which was defined Python 3.6 and invalid in 3.5

name is not declared in control/python.py

In the validation function, the name variable is undeclared.

Ensure steady state hypothesis is met after Rollback

Hi,

I've been testing chaostoolkit and stumbled upon below scenario:

During a successful experiment run, rollback was unsuccessful, changing the system and basically bringing the app down, yet experiment was successful:

Rollback configuration in experiment:

app-must-be-healthy is a probe ref of steady-state-hypothesis


    "rollbacks": [
        {
            "type": "action",
            "name": "restart-app",
            "provider": {
                "type": "process",
                ....
                ....
          },
            "pauses": {
                "after": 5
            }
        },
        {
            "ref": "app-must-be-healthy"
        }
    ]

Experiment logs:

chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Steady state hypothesis is met!
chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Let's rollback...
chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Rollback: restart-app
chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Action: restart-app
chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Pausing after activity for 5s...
chaostoolkit_1  | [2019-04-04 13:48:18 INFO] Rollback: None
chaostoolkit_1  | [2019-04-04 13:48:18 INFO] Probe: app-must-be-healthy
chaostoolkit_1  | [2019-04-04 13:48:18 ERROR]   => failed: failed to connect to http://nginx:80/health: HTTPConnectionPool(host='nginx', port=80): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f91218d54e0>: Failed to establish a new connection: [Errno -2] Name does not resolve',))
chaostoolkit_1  | [2019-04-04 13:48:18 INFO] Experiment ended with status: completed

Would it make sense to re-evaluate steady-state-hypothesis and experiment result after rollback?

P.S. I hope I opened this issue correctly here and not in https://github.com/chaostoolkit :)

Migrate FailedActivity exception to ActivityFailed

All config/secrets to be passed directly to an action or probe, overriding global defaults

Can we pass results from one activity to the rest of the experiment?

The tookit does its best to not have a global state and so far, there was never really a need to take the output of an activity and feed it into another activity. But there are cases when this is useful (when an operation returns an ID for instance).

Let's see how we can add this.

Set author name to contact address.

Allow to setup headers when loading experiments over HTTP

Currently loading experiments over HTTP forces the Accept header to static values of:

"application/json, application/x-yaml"

In some cases, this should be amended by the operator.

Provide a way to mak an action as "dangerous"

While Chaos Engineering make degrade the system, we should be careful not to harm too massively. So, having a mechanism to mark an action as "dangerous" could help those use cases.

From the CLI side, this could translate into asking the users before running an experiment?

Do not fail on discovery of module which don't export all

Right now the discovery mechanism expects module to have a __all__ attribute. Do not fail when it is missing.

Fail more gracefully when process doesn't return utf-8

When a process returns non-utf-8 data, the activity fails quite poorly. Try to be smarter here.

Call to hvac read_secret fails on KV v2

I got tricked into thinking that you could call client.secrets.kv.read_secret(path) as per the documentation but it seems the documentation is quite out of sync with the code.

Correct activity-level control behaviour where missing controls are simply warned of in the logging

Bail cleanly when environment key was not found

It appears the toolkit doesn't tell you when a key couldn't be found in the environment.

Chaos Toolkit model link to doc is broken

In the readme, the Chaos Toolkit model link (http://chaostoolkit.org/overview/concepts/) is broken (404)

by the way, your 404 page contains weird content "Cloud bread lo-fi woke echo park cronut plaid banjo hammock fingerstache ennui gentrify fashion axe poke. ... " is this wanted ?

HTTP provider must allow requests against HTTPS endpoint that are self-signed

When performing local tests, a user may rely on a self-signed certificate for their server, the HTTP probe must take a parameter to disable TLS verification.

Make wording around steady-state-hypothesis more informative, or remove if hypothesis block is optional

Log a debug message of the file where an activity was loaded from

For debug purpose, it could be handy to log a message where a particular activity was loaded from.

Present warning on the command line output when a HTTP or process activity fails

For process calls, anything other than a 0 return code should result in a warning. For HTTP, a status code greater than 399 should trigger a warning message.

Support for hooks/events

I would be nice to be able to perform actions before and after certain points in an experiment.
Some ideas/examples:

Before running an experiment, announce to a slack channel that we are about to run an chaos experiment.
After finishing the experiment, announce that it has finished.
Before and after each probe, log the results to some log server on a customised format.
If we had to run a rollback, send an email to the service owner so they can check that everything looks fine.

Change "Validating experiment's syntax" to "Validating the experiment's syntax"

Small change, just better grammatically.

Add authorization support to HTTP activities

Most HTTP API are behind authorizations, it should be straightforward to provide credentials to experiments when needed.

Improve error handling when using discovery and a conflict occurs with an existing extension

I got the following unfriendly output when there was a collision with an existing integration extension:

chaos discover chaostoolkit-kubernetes
[2018-01-30 15:35:15 INFO] Attempting to download and install package 'chaostoolkit-kubernetes'
[2018-01-30 15:35:19 INFO] Package downloaded and installed in current environment
Traceback (most recent call last):
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/discovery/package.py", line 85, in get_importname_from_package
    name = dist.get_metadata('top_level.txt').split("\n)", 1)[0]
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1493, in get_metadata
    value = self._get(self._fn(self.egg_info, name))
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1605, in _get
    with open(path, 'rb') as stream:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaostoolkit_kubernetes-0.8.0.dist-info/top_level.txt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/russellmiles/.venvs/chaostk/bin/chaos", line 11, in <module>
    sys.exit(cli())
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaosiq/cli.py", line 140, in discover
    download_and_install=not no_install)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/discovery/discover.py", line 30, in discover
    package = load_package(package_name)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/discovery/package.py", line 45, in load_package
    name = get_importname_from_package(package_name)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/discovery/package.py", line 89, in get_importname_from_package
    "Was the package installed properly?".format(p=package_name))
chaoslib.exceptions.DiscoveryFailed: failed to load package 'chaostoolkit-kubernetes' metadata. Was the package installed properly?

Allow validation of experiment without importing modules

Sometimes, we want to validate the experiment in a more shallow fashion and we can't load the Python providers in those cases. Add a flag to support that case.

Add discover feature

The current interface of chaostoolkit supports the run of experiments. However, as part of the goal for chaostoolkit, we always wantedt o make it simpler to get into chaos engineering.

The new discover command is aiming to collecting information about a specific target and offer suggestions about potential chaos engineering experiments.

discover has the goal to let look up what an extension is capable of doing, as well as a summary of the platform/application this extension targets (if available) and a list of chaos experiment suggestions.

As not all extensions are Python packages, discover should eventually be able to load a spec file which describes an extension made of process calls or HTTP calls. This may be done in a second iteration of the command.

Control is duplicated

Controls seem to be duplicated while they are applied

Catch expired vault secret id

When the vault secret id of the app role has expired, this blows up the whole process. Catch and fail gracefully.

Fix needed for activity-level controls to work as expected

Currently there is a bug when a controls block as applied at the activity level, i.e. not top level, and there are no top level controls applied either.

A fix such as the following needs to be applied to the chaos lib from line 201 onwards:

    for c in controls.copy():
        if "ref" in c:
            for top_level_control in top_level_controls:
                if c["ref"] == top_level_control["name"]:
                    controls.append(deepcopy(top_level_control))
                    break
        else:
            tc = None
            for tc in top_level_controls:
                if c.get("name") == tc.get("name"):
                    break
            else:
                if tc and tc.get("automatic", True):
                    controls.append(deepcopy(tc))

Add requirements.txt for test dependencies

ability to perform tolerance check on 'stdout' property of process probe, rather than 'status'

My hypothesis probe defines a curl request, which outputs its total time to stdout stream.
The probe defines a range tolerance of [0, 1] intended to check if the total time is within one second. Here's that probe for reference:

{
...
    "steady-state-hypothesis": {
        "title": "cURL www.google.com",
        "probes": [
            {
                "type": "probe",
                "name": "http google",
                "tolerance": [0,1],
                "provider": {
                    "type" : "process",
                    "path" : "curl",
                    "arguments": "-o /dev/null -w \"%{time_total}\" -s https://www.google.com"
                }
            }
        ]
    },
...
}

What appears happens is the tolerance range of [0, 1] checks the status value of the process probe rather than the stdout. This means that if the output is 10.234, the hypothesis is still met.

For example, I would expect this to succeed, as stdout is between 0 and 1

[2019-03-19 13:04:23 DEBUG] [process:54] Running: /usr/bin/curl -o /dev/null -w "%{time_total}" -s https://www.google.com
[2019-03-19 13:04:23 DEBUG] [__init__:115] Data encoding detected as 'ascii' with a confidence of 1.0
[2019-03-19 13:04:23 DEBUG] [activity:179]   => succeeded with '{'stderr': '', 'stdout': '0.420', 'status': 0}'
[2019-03-19 13:04:23 DEBUG] [hypothesis:177] allowed tolerance is [0, 1]
[2019-03-19 13:04:23 INFO] [hypothesis:184] Steady state hypothesis is met!

and I would expect the following to fail as stdout is greater than 1, but it passes as status is 0.

[2019-03-19 13:04:23 DEBUG] [process:54] Running: /usr/bin/curl -o /dev/null -w "%{time_total}" -s https://www.google.com
[2019-03-19 13:04:27 DEBUG] [__init__:115] Data encoding detected as 'ascii' with a confidence of 1.0
[2019-03-19 13:04:27 DEBUG] [activity:179]   => succeeded with '{'stderr': '', 'stdout': '3.397', 'status': 0}'
[2019-03-19 13:04:27 DEBUG] [hypothesis:177] allowed tolerance is [0, 1]
[2019-03-19 13:04:27 INFO] [hypothesis:184] Steady state hypothesis is met!

Is there some way to instruct the process probe tolerance which value it needs to check, rather than just using 'status'? Looking at using the HTTP probe type there is no response time property either.

Reading vault secrets is not working as you'd expect

Reading vault secrets is not working as you'd expect.

Currently, the whole Vault payload of a secret is read into the chaostoolkit secret section (including the vault secret metadata). This is not what you'd expect. Also, it's not intuitive that the key argument refers to the path.

Support non Python based extension providers

This issue's goal is for the community to discuss interest and solutions to support extension providers implemented in languages other than Python.

As a reminder, currently, the toolkit supports three extension providers:

http: whereby you declare a URL to call and the toolkit does it for you
process: where you provdie the path to a binary which is executed by the toolkit
python: where you define a Python function that is imported from a module extension

While Python is considered a good choice for the core and most extensions, we always cared for larger than a single community. @dastergon asked on that subject topic on the community slack and he suggested I should kick the ball with a high-level view of what would need to be done.

Generally speaking, it seems the simplest/easiest integration for calling native code from Python is to export a native library that exports its symbols (much like a C library). When doing that, Python has facilities to call them for you with ctypes.

This is what people seem to generally do:

Alternatives to ctypes are CFFI and cython. The latter is quite interesting because you provide a C-like wrapper on your native extensions and the generated Python code makes it look fairly native. It is popular but requires more work.

There could be two paths:

The core of the toolkit makes it loud and clear it officially supports ctypes/CFFI and you declare it like this:

{
   "type": "probe",
   "name": "my-go-blah",
   "provider": {
        "type": "go",
        "lib_name": "my-go-lib.so",
        "func": "func_name_in_lib",
        "arguments": { ... }
}

This is what is done for Python as well but here that would expect simply a native library.

An extension author wraps entirely the native code inside a Python extension using cython. In that case, the "python" provider is enough and would work as it already does.

I think both are valuable but I wonder what communities would prefer.

chaos run ./experiments/consensus-recovery.json
while [ $? -eq 0 ]; do
    chaos run ./experiments/consensus-recovery.json
done

Eventually, the experiment stopped being successful but continued to run.