Giter Site home page Giter Site logo

qumomf's Introduction

GitHub release (latest SemVer) GitHub Workflow Status

Qumomf

Qumomf is a Tarantool vshard high availability tool which supports discovery and recovery.

Table of Contents

Discovery

Qumomf actively crawls through your topologies and analyzes them. It reads basic vshard info such as replication status and configuration.

You should provide at least one router which will be an entrypoint to the discovery process.

Configuration

For a sample qumomf configuration and its description see example.

How to add a new cluster

Edit your configuration file and add a new cluster, e.g.:

clusters:
  my_cluster:
    routers:
      - name: 'my_cluster_router_1'
        addr: 'localhost:3301'

You might override default connection settings for each cluster.

clusters:
  my_cluster:
    connection:
      user: 'tnt'
      password: 'tnt'
      connect_timeout: 10s
      request_timeout: 10s

    routers:
      - name: 'my_cluster_router_1'
        addr: 'localhost:3301'

For a sample vshard configuration, see qumomf example or Tarantool documentation.

Start qumomf, and it will discover all clusters defined in the configuration.

Topology recovery

Just now qumomf supports only automated master recovery. It is a configurable option and can be disabled completely or for a cluster via configuration.

Master election supports two modes: idle and smart. Election mode might be configured for each cluster independently.

Both electors supports those options:

  • reasonable_follower_lsn_lag - on crash recovery, followers that are lagging more than given LSN must not participate in the election.
  • reasonable_follower_idle - on crash recovery, followers that are lagging more than given duration must not participate in the election.

Value of 0 disables this features.

Idle

Naive and simple elector which finds alive replica last communicated to the failed master (received data or heartbeat signal). Followers with the negative priority will be excluded from the master election.

Smart

Elector tries to involve as many metrics as can:

  • vshard configuration consistency (prefer replica which has the same configuration as master),
  • which upstream status did replica have before the crash,
  • how replica is far from the master comparing LSN to the master LSN,
  • last time when replica received data or heartbeat signal from the master,
  • user promotion rules based on the instance priorities.

You can define your own promotion rules which will influence on master election during a failover. Each instance has a priority set via config. Negative priority excludes follower from the election process.

Recovery hooks

Hooks invoked through the recovery process via shell, in particular bash.

These hooks are available:

  • PreFailover: executed immediately before qumomf takes recovery action. Failure (non-zero exit code) of any of these processes aborts the recovery. Hint: this gives you the opportunity to abort recovery based on some internal state of your system.
  • PostSuccessfulFailover: executed at the end of successful recovery.
  • PostUnsuccessfulFailover: executed at the end of unsuccessful recovery.

Any process command that starts with "&" will be executed asynchronously, and a failure for such process is ignored.

Qumomf executes lists of commands sequentially, in order of definition.

A naive implementation might look like:

hooks:
  shell: bash
  pre_failover:
    - "echo 'Will recover from {failureType} on {failureCluster}' >> /tmp/qumomf_recovery.log"
  post_successful_failover:
    - "echo 'Recovered from {failureType} on {failureCluster}. Set: {failureReplicaSetUUID}; Failed: {failedURI}; Successor: {successorURI}' >> /tmp/qumomf_recovery.log"
  post_unsuccessful_failover:
    - "echo 'Failed to recover from {failureType} on {failureCluster}. Set: {failureReplicaSetUUID}; Failed: {failedURI}' >> /tmp/qumomf_recovery.log"

Hooks arguments and environment

Qumomf provides all hooks with failure/recovery related information, such as the UUID/URI of the failed instance, UUID/URI of promoted instance, type of failure, name of cluster, etc.

This information is passed independently in two ways, and you may choose to use one or both:

Environment variables:

  • QUM_FAILURE_TYPE
  • QUM_FAILED_UUID
  • QUM_FAILED_URI
  • QUM_FAILURE_CLUSTER
  • QUM_FAILURE_REPLICA_SET_UUID
  • QUM_COUNT_FOLLOWERS
  • QUM_COUNT_WORKING_FOLLOWERS
  • QUM_COUNT_REPLICATING_FOLLOWERS
  • QUM_COUNT_INCONSISTENT_VSHARD_CONF
  • QUM_IS_SUCCESSFUL

And, if a recovery was successful:

  • QUM_SUCCESSOR_UUID
  • QUM_SUCCESSOR_URI

Command line text replacement.

Qumomf replaces the following tokens in your hook commands:

  • {failureType}
  • {failedUUID}
  • {failedURI}
  • {failureCluster}
  • {failureReplicaSetUUID}
  • {countFollowers}
  • {countWorkingFollowers}
  • {countReplicatingFollowers}
  • {countInconsistentVShardConf}
  • {isSuccessful}

And, if a recovery was a successful:

  • {successorUUID}
  • {successorURI}

API

Qumomf exposes several debug endpoints:

  • /debug/metrics - runtime and app metrics in Prometheus format,
  • /debug/health - health check,
  • /debug/about - the app version and build date.

API documentation for getting information about cluster states, recoveries and problems.

Hacking

Feel free to open issues and pull requests with your ideas how to improve qumomf.

To run unit and integration tests:

make env_up
make run_tests
make env_down

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.