Giter Site home page Giter Site logo

chaostoolkit / chaostoolkit-lib Goto Github PK

View Code? Open in Web Editor NEW
76.0 10.0 46.0 791 KB

The Chaos Toolkit core library

Home Page: https://chaostoolkit.org/

License: Apache License 2.0

Python 100.00%
chaos-engineering chaostoolkit chaostoolkit-core reliability-engineering

chaostoolkit-lib's People

Contributors

alexshemeshwix avatar arpiagar avatar charliemoon37 avatar ciaranevans avatar claymccoy avatar devatoria avatar dmartin35 avatar idanto avatar joshuaroot avatar lawouach avatar mattiascockburn avatar mirimi avatar ojongerius avatar roeik-wix avatar russmiles avatar snej- avatar tam-lin avatar twuyts avatar wixoleo avatar ykskb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chaostoolkit-lib's Issues

Expand Tolerance options

According to documentation i can only get responses from an http method as string.
But tolerance using [int, int] only allows int and fails on the steady state validation.

Example:

{
    "type": "probe",
    "name": "CCCCC",
    "tolerance":: [1,10000],
    "provider": {
      "type": "http",
      "url": "http://localhost:3000/metrics/query",
      "method": "POST",
      "arguments": {
        "query": any service returning a string",
        "datasource": 161
      }

}

when using probe in `method` tolerance is not validated

Hi,

I just noticed that it is allowed to use probes inside method as opposed to steady-state-hypothesis but at the same time tolerance is not validated when doing so.

I find it useful to use probes inside method, first example use case I have is when I don't want to run probe both before and after some actions. Instead the probe can be used to validate some conditions in the middle of experiment. For example, I stop a random instance in ASG and if it's not marked unhealthy (or perhaps is not replaced fast) I don't want to continue with the experiment.

Please advice if this behavior is by intention or by accident.

Add and document some strategies around time-constrained experiments

Original content from docs: # Adding Time Constraints to an Experiment

It is a common requirement to execute a chaos experiment for a certain period of
time, ending the experiment if it goes on indefinitely.

We've been very careful to rely on other tools for these types of concerns, and
so timing constraints are not a built-in feature of the Chaos Toolkit's
experiments.

Pass th experiment to all controls

Currently, only the current context (steady-state, method, activity...) is passed down to a control fonction. We should also pass the experiment as it contains a larger context that may be useful as well.

Ideally this should go into 1.0.0rc2

Ensure steady state hypothesis is met after Rollback

Hi,

I've been testing chaostoolkit and stumbled upon below scenario:

During a successful experiment run, rollback was unsuccessful, changing the system and basically bringing the app down, yet experiment was successful:

Rollback configuration in experiment:

app-must-be-healthy is a probe ref of steady-state-hypothesis


    "rollbacks": [
        {
            "type": "action",
            "name": "restart-app",
            "provider": {
                "type": "process",
                ....
                ....
          },
            "pauses": {
                "after": 5
            }
        },
        {
            "ref": "app-must-be-healthy"
        }
    ]


Experiment logs:

chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Steady state hypothesis is met!
chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Let's rollback...
chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Rollback: restart-app
chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Action: restart-app
chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Pausing after activity for 5s...
chaostoolkit_1  | [2019-04-04 13:48:18 INFO] Rollback: None
chaostoolkit_1  | [2019-04-04 13:48:18 INFO] Probe: app-must-be-healthy
chaostoolkit_1  | [2019-04-04 13:48:18 ERROR]   => failed: failed to connect to http://nginx:80/health: HTTPConnectionPool(host='nginx', port=80): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f91218d54e0>: Failed to establish a new connection: [Errno -2] Name does not resolve',))
chaostoolkit_1  | [2019-04-04 13:48:18 INFO] Experiment ended with status: completed

Would it make sense to re-evaluate steady-state-hypothesis and experiment result after rollback?

P.S. I hope I opened this issue correctly here and not in https://github.com/chaostoolkit :)

Can we pass results from one activity to the rest of the experiment?

The tookit does its best to not have a global state and so far, there was never really a need to take the output of an activity and feed it into another activity. But there are cases when this is useful (when an operation returns an ID for instance).

Let's see how we can add this.

Provide a way to mak an action as "dangerous"

While Chaos Engineering make degrade the system, we should be careful not to harm too massively. So, having a mechanism to mark an action as "dangerous" could help those use cases.

From the CLI side, this could translate into asking the users before running an experiment?

Support for hooks/events

I would be nice to be able to perform actions before and after certain points in an experiment.
Some ideas/examples:

  • Before running an experiment, announce to a slack channel that we are about to run an chaos experiment.
  • After finishing the experiment, announce that it has finished.
  • Before and after each probe, log the results to some log server on a customised format.
  • If we had to run a rollback, send an email to the service owner so they can check that everything looks fine.

Improve error handling when using discovery and a conflict occurs with an existing extension

I got the following unfriendly output when there was a collision with an existing integration extension:

chaos discover chaostoolkit-kubernetes
[2018-01-30 15:35:15 INFO] Attempting to download and install package 'chaostoolkit-kubernetes'
[2018-01-30 15:35:19 INFO] Package downloaded and installed in current environment
Traceback (most recent call last):
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/discovery/package.py", line 85, in get_importname_from_package
    name = dist.get_metadata('top_level.txt').split("\n)", 1)[0]
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1493, in get_metadata
    value = self._get(self._fn(self.egg_info, name))
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1605, in _get
    with open(path, 'rb') as stream:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaostoolkit_kubernetes-0.8.0.dist-info/top_level.txt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/russellmiles/.venvs/chaostk/bin/chaos", line 11, in <module>
    sys.exit(cli())
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaosiq/cli.py", line 140, in discover
    download_and_install=not no_install)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/discovery/discover.py", line 30, in discover
    package = load_package(package_name)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/discovery/package.py", line 45, in load_package
    name = get_importname_from_package(package_name)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/discovery/package.py", line 89, in get_importname_from_package
    "Was the package installed properly?".format(p=package_name))
chaoslib.exceptions.DiscoveryFailed: failed to load package 'chaostoolkit-kubernetes' metadata. Was the package installed properly?

Add discover feature

The current interface of chaostoolkit supports the run of experiments. However, as part of the goal for chaostoolkit, we always wantedt o make it simpler to get into chaos engineering.

The new discover command is aiming to collecting information about a specific target and offer suggestions about potential chaos engineering experiments.

discover has the goal to let look up what an extension is capable of doing, as well as a summary of the platform/application this extension targets (if available) and a list of chaos experiment suggestions.

As not all extensions are Python packages, discover should eventually be able to load a spec file which describes an extension made of process calls or HTTP calls. This may be done in a second iteration of the command.

Catch expired vault secret id

When the vault secret id of the app role has expired, this blows up the whole process. Catch and fail gracefully.

Fix needed for activity-level controls to work as expected

Currently there is a bug when a controls block as applied at the activity level, i.e. not top level, and there are no top level controls applied either.

A fix such as the following needs to be applied to the chaos lib from line 201 onwards:

    for c in controls.copy():
        if "ref" in c:
            for top_level_control in top_level_controls:
                if c["ref"] == top_level_control["name"]:
                    controls.append(deepcopy(top_level_control))
                    break
        else:
            tc = None
            for tc in top_level_controls:
                if c.get("name") == tc.get("name"):
                    break
            else:
                if tc and tc.get("automatic", True):
                    controls.append(deepcopy(tc))

ability to perform tolerance check on 'stdout' property of process probe, rather than 'status'

My hypothesis probe defines a curl request, which outputs its total time to stdout stream.
The probe defines a range tolerance of [0, 1] intended to check if the total time is within one second. Here's that probe for reference:

{
...
    "steady-state-hypothesis": {
        "title": "cURL www.google.com",
        "probes": [
            {
                "type": "probe",
                "name": "http google",
                "tolerance": [0,1],
                "provider": {
                    "type" : "process",
                    "path" : "curl",
                    "arguments": "-o /dev/null -w \"%{time_total}\" -s https://www.google.com"
                }
            }
        ]
    },
...
}

What appears happens is the tolerance range of [0, 1] checks the status value of the process probe rather than the stdout. This means that if the output is 10.234, the hypothesis is still met.

For example, I would expect this to succeed, as stdout is between 0 and 1

[2019-03-19 13:04:23 DEBUG] [process:54] Running: /usr/bin/curl -o /dev/null -w "%{time_total}" -s https://www.google.com
[2019-03-19 13:04:23 DEBUG] [__init__:115] Data encoding detected as 'ascii' with a confidence of 1.0
[2019-03-19 13:04:23 DEBUG] [activity:179]   => succeeded with '{'stderr': '', 'stdout': '0.420', 'status': 0}'
[2019-03-19 13:04:23 DEBUG] [hypothesis:177] allowed tolerance is [0, 1]
[2019-03-19 13:04:23 INFO] [hypothesis:184] Steady state hypothesis is met!

and I would expect the following to fail as stdout is greater than 1, but it passes as status is 0.

[2019-03-19 13:04:23 DEBUG] [process:54] Running: /usr/bin/curl -o /dev/null -w "%{time_total}" -s https://www.google.com
[2019-03-19 13:04:27 DEBUG] [__init__:115] Data encoding detected as 'ascii' with a confidence of 1.0
[2019-03-19 13:04:27 DEBUG] [activity:179]   => succeeded with '{'stderr': '', 'stdout': '3.397', 'status': 0}'
[2019-03-19 13:04:27 DEBUG] [hypothesis:177] allowed tolerance is [0, 1]
[2019-03-19 13:04:27 INFO] [hypothesis:184] Steady state hypothesis is met!

Is there some way to instruct the process probe tolerance which value it needs to check, rather than just using 'status'? Looking at using the HTTP probe type there is no response time property either.

Reading vault secrets is not working as you'd expect

Reading vault secrets is not working as you'd expect.

Currently, the whole Vault payload of a secret is read into the chaostoolkit secret section (including the vault secret metadata). This is not what you'd expect. Also, it's not intuitive that the key argument refers to the path.

Support non Python based extension providers

This issue's goal is for the community to discuss interest and solutions to support extension providers implemented in languages other than Python.

As a reminder, currently, the toolkit supports three extension providers:

  • http: whereby you declare a URL to call and the toolkit does it for you
  • process: where you provdie the path to a binary which is executed by the toolkit
  • python: where you define a Python function that is imported from a module extension

While Python is considered a good choice for the core and most extensions, we always cared for larger than a single community. @dastergon asked on that subject topic on the community slack and he suggested I should kick the ball with a high-level view of what would need to be done.

Generally speaking, it seems the simplest/easiest integration for calling native code from Python is to export a native library that exports its symbols (much like a C library). When doing that, Python has facilities to call them for you with ctypes.

This is what people seem to generally do:

Alternatives to ctypes are CFFI and cython. The latter is quite interesting because you provide a C-like wrapper on your native extensions and the generated Python code makes it look fairly native. It is popular but requires more work.

There could be two paths:

  1. The core of the toolkit makes it loud and clear it officially supports ctypes/CFFI and you declare it like this:
{
   "type": "probe",
   "name": "my-go-blah",
   "provider": {
        "type": "go",
        "lib_name": "my-go-lib.so",
        "func": "func_name_in_lib",
        "arguments": { ... }
}

This is what is done for Python as well but here that would expect simply a native library.

  1. An extension author wraps entirely the native code inside a Python extension using cython. In that case, the "python" provider is enough and would work as it already does.

I think both are valuable but I wonder what communities would prefer.

Add support for saving settings

At the moment there is a load_settings function but not one to then save settings back if settings have been changed in some way.

Log HTTP notifications

HTTP-based notifications aren't logged into the chaostoolkit.log (unless of an error) so it's hard to know if they worked.

control can't handl ref activity

When a before_activity control is executed, if the activity references another activity, it is not looked up before hand so the control has no real context.

chaos returns 0 exit code for a failed experiment

I was trying to script chaos to run continuously if the experiment was successful.

The script was simple:

chaos run ./experiments/consensus-recovery.json
while [ $? -eq 0 ]; do
    chaos run ./experiments/consensus-recovery.json
done

Eventually, the experiment stopped being successful but continued to run.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.