Giter Site home page Giter Site logo

reckon's Introduction

Reckon - Benchmarking consensus systems for availability under failures.

As seen at HAOC21: Examining Raft's behaviour during partial network failures.

Allowing us to dissect failures:

failure-layout

Installation

First openvswitch must be loaded as a kernel module on the host. On ubuntu this is done by apt install openvswitch-switch, however other distributions will require an equivalent but different invocation.

Recommended method with Docker

Run make docker to build a docker container containing all dependencies required to run a test with any of the clients, and to subsequently start it up, putting you into a Bash shell.

Manual approach

  • Install mininet
  • Build the relevant system and client via cd systems/<relevant-system> && make system && make client

Running a test

A typical test has the following steps:

  • Build docker image and enter the container with make ~10 mins
  • Define the desired test and run with python -m reckon <other arguments>

Positional arguments are as follows: <system> <topology> <workload> <fault>

  • Supported systems:
    • etcd: the strongly consistent key value store used in Kubernetes.
  • Supported topologies:
    • simple: A star topology with end-to-end loss and latency configurable via --link-loss and --link-latency.
    • wan: A WAN style network with --number-nodes data-centers with a node and a client in each data-center.
  • Supported workloads:
    • uniform: Keys are uniformly distributed in [0,--max-key], values are --payload-size bytes long strings.
  • Supported faults:
    • none: no fault occurs
    • leader: The leader is killed at 1/3 of the duration (T), and recovers at 2/3T.
    • partial-partition: A partial partition is injected at 1/3T and removed at 2/3T. This blocks communication between the leader of the cluster and one follower
    • intermittent-partial/intermittent-full: An intermittent full and partial partiion between one node (who is initially the leader) and a follower. The leader is always able to communicate with a majority of nodes. Time between faults is set by --mtbf.
    • kill-n: kill --kill-n at the start of the test. This tests maximal fault situations when some of the cluster has died at the start of the test.
  • Other arguments
    • -d: enter a debug mininet cli where the topology is constructed and system started.
    • --duration: duration of the test.
    • --result-location: where to write the results of the test.

This can be automated as in scripts/tester.py or scripts/lossy_etcd.py, and a safer running environment is scripts/run.sh <command>.

reckon's People

Contributors

cjen1 avatar dks28 avatar mor1 avatar seveneng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

seveneng mor1 jeffa5

reckon's Issues

CI not failing when incorrect arguments applied

Current CI script does not fail when the test does.

This is probably because we are not fail the test on an incorrect output.

The relevant script is below:

#!/usr/bin/env bash

service openvswitch-switch start
ovs-vsctl set-manager ptcp:6640

python benchmark.py etcd_go \
  simple --topo_args n=1,nc=1 uniform --write-ratio 1 none \
  --benchmark_config rate=1000,duration=10,dest=./res,logs=./logs `realpath .`

service openvswitch-switch stop

The test output is below:

Run docker run --privileged --tmpfs /data -v /lib/modules:/lib/modules --network host --name rc --entrypoint /root/scripts/test_entrypoint.sh cjen1/rc:latest
 * Inserting openvswitch module
 * /etc/openvswitch/conf.db does not exist
 * Creating empty database /etc/openvswitch/conf.db
 * Starting ovsdb-server
 * Configuring Open vSwitch system IDs
 * Starting ovs-vswitchd
 * Enabling remote OVSDB managers
usage: benchmark.py [-h] [--topo_args TOPO_ARGS] [--write-ratio WRITE_RATIO]
                    [--payload-size PAYLOAD_SIZE] [--key-range KEY_RANGE]
                    [--fail_args FAIL_ARGS]
                    [--benchmark_config BENCHMARK_CONFIG] [-d]
                    system topology {uniform} failure absolute_path
benchmark.py: error: unrecognized arguments: --write_ratio /root
 * Exiting ovs-vswitchd (56)
 * Exiting ovsdb-server (46)

With the overarching test succeeding.

Partial Network Partitions

It would be useful for us to be able to simulate partial partitions such as : https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/

There are a couple of possible approaches to do this within Mininet.

  1. Implement a router which 'misbehaves' on command to produce the partition
  2. Use the iptables approach used in: https://www.usenix.org/system/files/osdi18-alquraan.pdf

Option 1 would be the best at approximating the exact scenario which caused the partial partition, and thus would probably have the strongest simulation.

Option 2 however should be substantially easier to actually set up, only requiring the failure injector to be handed the leader and a follower (from there it is just a matter of correctly injecting the failure). However there could be issues regarding odd ramifications of this, since the failure is local to the node, the node could inspect it.

Intermittent partitions currently silently failing

Intermittent partitions are currently sliently failing CI.

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/root/reckon/failures/intermittent_partial.py", line 53, in thread_fn
    time.sleep(self.sleep_duration)
AttributeError: 'IntermittentPartialPartition' object has no attribute 'sleep_duration'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.