hseeberger / constructr Goto Github PK

View Code? Open in Web Editor NEW

212.0 16.0 37.0 346 KB

Coordinated (etcd, ...) cluster construction for dynamic (cloud, containers) environments

License: Apache License 2.0

Scala 99.24% Shell 0.76%

akka akka-cluster etcd

constructr's Introduction

ConstructR

ConstructR is for bootstrapping (construction) an Akka cluster by using a coordination service.

Disambiguation: Despite the similar name, ConstructR is not related to Lightbend ConductR.

ConstructR utilizes a key-value coordination service like etcd to automate bootstrapping or joining a cluster. It stores each member node under the key /constructr/$clusterName/nodes/$address where $clusterName is for disambiguating multiple clusters and $address is a Base64 encoded Akka Address. These keys expire after a configurable time in order to avoid stale information. Therefore ConstructR refreshes each key periodically.

In a nutshell, ConstructR is a state machine which first tries to get the nodes from the coordination service. If none are available it tries to acquire a lock, e.g. via a CAS write for etcd, and uses itself or retries getting the nodes. Then it joins using these nodes as seed nodes. After that it adds its address to the nodes and starts the refresh loop:

                  ┌───────────────────┐              ┌───────────────────┐
              ┌──▶│   GettingNodes    │◀─────────────│BeforeGettingNodes │
              │   └───────────────────┘    delayed   └───────────────────┘
              │             │     │                            ▲
  join-failed │   non-empty │     └──────────────────────┐     │ failure
              │             ▼               empty        ▼     │
              │   ┌───────────────────┐              ┌───────────────────┐
              └───│      Joining      │◀─────────────│      Locking      │
                  └───────────────────┘    success   └───────────────────┘
                            │
              member-joined │
                            ▼
                  ┌───────────────────┐
                  │    AddingSelf     │
                  └───────────────────┘
                            │     ┌────────────────────────────┐
                            │     │                            │
                            ▼     ▼                            │
                  ┌───────────────────┐              ┌───────────────────┐
                  │ RefreshScheduled  │─────────────▶│    Refreshing     │
                  └───────────────────┘              └───────────────────┘

If something goes finally wrong when interacting with the coordination service, e.g. a permanent timeout after a configurable number of retries, ConstructR terminates its ActorSystem in the spirit of "fail fast".

// All releases including intermediate ones are published here,
// final ones are also published to Maven Central.
resolvers += Resolver.bintrayRepo("hseeberger", "maven")

libraryDependencies ++= Vector(
  "de.heikoseeberger" %% "constructr" % "0.19.0",
  "de.heikoseeberger" %% "constructr-coordination-etcd" % "0.19.0", // in case of using etcd for coordination
  ...
)

Simply add the ConstructrExtension to the extensions configuration setting:

akka.extensions = [de.heikoseeberger.constructr.ConstructrExtension]

This will start the Constructr actor as a system actor. Alternatively start it yourself as early as possible if you feel so inclined.

The following listing shows the available configuration settings with their defaults:

constructr {
  coordination {
    host = localhost
    port = 2379
  }

  coordination-timeout    = 3 seconds  // Maximum response time for coordination service (e.g. etcd)
  join-timeout            = 15 seconds // Might depend on cluster size and network properties
  abort-on-join-timeout   = false      // Abort the attempt to join if true; otherwise restart the process from scratch
  max-nr-of-seed-nodes    = 0          // Any nonpositive value means Int.MaxValue
  nr-of-retries           = 2          // Nr. of tries are nr. of retries + 1
  refresh-interval        = 30 seconds // TTL is refresh-interval * ttl-factor
  retry-delay             = 3 seconds  // Give coordination service (e.g. etcd) some delay before retrying
  ttl-factor              = 2.0        // Must be greater or equal 1 + ((coordination-timeout * (1 + nr-of-retries) + retry-delay * nr-of-retries)/ refresh-interval)!
  ignore-refresh-failures = false      // Ignore failures once machine is already in "Refreshing" state. It prevents from FSM being terminated due to exhausted number of retries.

}

Coordination

ConstructR comes with out-of-the-box support for etcd: simply depend on the "constructr-coordination-etcd" module. If you want to use some other coordination backend, e.g. Consul, simply implement the Coordination trait from the "constructr-coordination" module and make sure to provide the fully qualified class name via the constructr.coordination.class-name configuration setting.

Community Coordination Implementations

There are some implementations for other coordination backends than etcd:

Tecsisa/constructr-consul: This library enables to use Consul as cluster coordinator in a ConstructR based cluster.
everpeace/constructr-redis: This library enables to use Redis as cluster coordinator in a ConstructR based cluster.
typesafehub/constructr-zookeeper: This library enables to use ZooKeeper as cluster coordinator in a ConstructR based cluster.

Testing

etcd must be running, e.g.:

docker run \
  --detach \
  --name etcd \
  --publish 2379:2379 \
  quay.io/coreos/etcd:v2.3.8 \
  --listen-client-urls http://0.0.0.0:2379 \
  --advertise-client-urls http://192.168.99.100:2379

Contribution policy

Contributions via GitHub pull requests are gladly accepted from their original author. Along with any pull requests, please state that the contribution is your original work and that you license the work to the project under the project's open source license. Whether or not you state this explicitly, by submitting any copyrighted material via pull request, email, or other means you agree to license the material under the project's open source license and warrant that you have the legal authority to do so.

Please make sure to follow these conventions:

For each contribution there must be a ticket (GitHub issue) with a short descriptive name, e.g. "Respect seed-nodes configuration setting"
Work should happen in a branch named "ISSUE-DESCRIPTION", e.g. "32-respect-seed-nodes"
Before a PR can be merged, all commits must be squashed into one with its message made up from the ticket name and the ticket id, e.g. "Respect seed-nodes configuration setting (closes #32)"

License

This code is open source software licensed under the Apache 2.0 License.

constructr's People

Contributors

Stargazers

Watchers

constructr's Issues

Removing the key from the KV when gracefully leaving the akka cluster

When an akka cluster node gracefully leaves the cluster, wouldn't it make sense to remove it from the KV store (instead of waiting until it times out)?

Improve error message when adding self to consul fails

putKeyWithSession returns Success(false) when the PUT call to consul returnes status code 200, and indicates (the body of the response) that the update has not taken place (https://consul.io/docs/agent/http/kv.html)).

In that case, the if result in ConsulCoordination.addSelf will cause the result to be a Failure(_) with the generic error message java.util.NoSuchElementException: Future.filter predicate is not satisfied, and no obvious clue in the stacktrace either.

Upgrade to Akka HTTP 2.0.2

Remove constructr-cassandra and fold constructr-akka into constructr-machine

I have learned that it's not a real live scenario to run Cassandra on top of an elastic topology. Therefore I want to simplify ConstructR by removing constructr-cassandra and folding constructr-akka into constructr-machine.

Use MemberJoined instead of MemberUp

This depends on merging akka/akka#18729 and a new Akka release (probably 2.4.1).

Allow 201 Created as status code for EtcdCoordination.refresh

When refreshing for some reason (GC pause or other delays) can't refresh timely enough, the entry in etcd store has disappeared and hence refresh results in a 201 Created instead of a 200 OK. This is perfectly fine, since the system is still up an running.

Make TTL-factor rules consistent

The README (and reference.conf) mention:

ttl-factor // Must be greater than 1 + (coordination-timeout * (1 + coordination-retries) / refresh-interval)!

However, ConstructrMachineSettings checks:

require(
      ttlFactor > 1 + coordinationTimeout / refreshInterval,
      s"ttl-factor must be greater than one plus coordination-timeout divided by refresh-interval, but was $ttlFactor!"
    )

In other words: the automated check does not take into account the allowed number of coordination retries.

Do we want to make those consistent?

Not handling MemberJoined and MemberUp cluster events

It is possible that akka cluster MemberJoined and MemberUp events are send while the ConstructrMachine is in state AddingSelf. In this state the events are unhandled and therefore the following messages are written as warnings to the log:

2016-09-01T13:38:46Z MacBook-Pro-6.local WARN  ConstructrMachine [sourceThread=conductr-akka.actor.default-dispatcher-2, akkaTimestamp=13:38:46.609UTC, akkaSource=akka.tcp://[email protected]:9024/user/reaper/constructr/constructr-machine, sourceActorSystem=conductr] - unhandled event MemberUp(Member(address = akka.tcp://[email protected]:9044, status = Up)) in state AddingSelf
2016-09-01T13:38:46Z MacBook-Pro-6.local WARN  ConstructrMachine [sourceThread=conductr-akka.actor.default-dispatcher-30, akkaSource=akka.tcp://[email protected]:9034/user/reaper/constructr/constructr-machine, sourceActorSystem=conductr, akkaTimestamp=13:38:46.141UTC] - unhandled event MemberJoined(Member(address = akka.tcp://[email protected]:9054, status = Joining)) in state AddingSelf

The current code first unsubscribes to the akka cluster events before moving into AddingSelf state. However, it might be a raise condition that other MemberUp or MemberJoined events are received while Akka still hasn't processed the unsubscription.

The 'Unexpected status code' message does not log the URI of the failed request

Respect seed-nodes configuration setting

If akka.cluster.seed-nodes is defined, akka-constructr should keep still and just log at info level.

Kill Cassandra in case of termination of ConstructR

Rename address to node

E.g. in AddressSerialization and selfAddress.

'docker-machine ip default' fails

When trying to run the tests, I usually get:

[info] * de.heikoseeberger.constructr.akka.MultiNodeConsulConstructrSpec
[JVM-5] Error saving host to store: remove /home/aengelen/.docker/machine/machines/default/config.json: no such file or directory
[JVM-5] *** RUN ABORTED ***
[JVM-5]   java.lang.ExceptionInInitializerError:
[JVM-5]   at de.heikoseeberger.constructr.akka.MultiNodeConsulConstructrSpec.<init>(MultiNodeConsulConstructrSpec.scala:67)
[JVM-5]   at de.heikoseeberger.constructr.akka.MultiNodeConsulConstructrSpecMultiJvmNode5.<init>(MultiNodeConsulConstructrSpec.scala:65)
[JVM-5]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[JVM-5]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[JVM-5]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[JVM-5]   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
[JVM-5]   at java.lang.Class.newInstance(Class.java:442)
[JVM-5]   at org.scalatest.tools.Runner$.genSuiteConfig(Runner.scala:2644)
[JVM-5]   at org.scalatest.tools.Runner$$anonfun$37.apply(Runner.scala:2461)
[JVM-5]   at org.scalatest.tools.Runner$$anonfun$37.apply(Runner.scala:2460)
[JVM-5]   ...
[JVM-5]   Cause: java.lang.RuntimeException: Nonzero exit value: 1
[JVM-5]   at scala.sys.package$.error(package.scala:27)
[JVM-5]   at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:132)
[JVM-5]   at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:102)
[JVM-5]   at de.heikoseeberger.constructr.akka.ConsulConstructrMultiNodeConfig$.<init>(MultiNodeConsulConstructrSpec.scala:40)
[JVM-5]   at de.heikoseeberger.constructr.akka.ConsulConstructrMultiNodeConfig$.<clinit>(MultiNodeConsulConstructrSpec.scala)
[JVM-5]   at de.heikoseeberger.constructr.akka.MultiNodeConsulConstructrSpec.<init>(MultiNodeConsulConstructrSpec.scala:67)
[JVM-5]   at de.heikoseeberger.constructr.akka.MultiNodeConsulConstructrSpecMultiJvmNode5.<init>(MultiNodeConsulConstructrSpec.scala:65)
[JVM-5]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[JVM-5]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[JVM-5]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[JVM-5]   ...

... which suggests 'docker-machine ip default' fails. The strange thing is it sometimes works for the Etcd test.

Running 'docker-machine ip default' on the command-line works fine.

I'm running on Linux

When hard-coding the docker-machine ip the tests run without further problems.

Upgrade to akka-http 2.0.1

[cas] Use official cassandra image

Currently we use hseeberger/cassandra which is based on a PR against the official cassandra image. Yet the PR won't be accepted and hseeberger/cassandra dropped.

Using RUN we can rewrite docker-entrypoint.sh to use the seed-provider from ConstructR.

Allow multiple endpoints for etcd coordinator

To take advantage of etcd high availability cluster it would be nice to be able to define more than one endpoint. Should be not that hard to do.

@hseeberger If you agree, I would like to work on it and create a PR?

Outgoing Connection 'caches' the IP in Coordination

In Constructr.scala the Connection is initialized and the solved IPs are cached but it's using the same one in every flow materialization. As a result, if that Consul node got down, ConstructR won't try with another IP.

val connection = Http()(context.system).outgoingConnection(host, port)
Coordination("akka", context.system.name, context.system.settings.config)
            (connection, ActorMaterializer())

This behaviour might be observed during DNS resolution of coordination cluster nodes.

Possible related info:

akka/akka#19419
https://gitter.im/akka/akka?at=569552acee13050b38a32091

Using the consul service catalog

When using consul, ConstructR currently uses the consul KV store both for storing the lock that determines whether there is already a seed node initializing, and for storing the addresses of seed nodes.

Wouldn't it make sense to use the consul service catalog to keep track of the addresses of seed nodes?

Cleanup ttl handling

ttl handling in ConstructrMachine is confusing
ttl documentation is inconsistent (see #64)
ttl should always be a FiniteDuration

Investigate changes for etcd 3.0

A feature complete beta of etcd 3.0 has been released: https://github.com/coreos/etcd/releases/tag/v3.0.0-beta.0. There seem to be significant changes in the API that probably affect ConstructR, e.g. "No key-value hierarchy / directories" (we use hierarchies), "Leases" (instead of TTL) and "Lock: mutex on top of etcd ".

Make constructr-coordination depend on akka-actor

Else constructr-coordination depends on Akka 2.3.

Uniform treatment of outdated refresh

As described in #22, etcd doesn't really care about refreshing an outdated (TTLed) entry, it simply creates a new one. Consul instead, when trying to refresh an outdated session, returns NotFound.

Therefore we need to change the general behavior to transitioning back to AddingSelf in case of such an issue with refreshing. In etcd we can simply add a prevExist=true to the PUT to also get a NotFound.

A RefreshResult needs to be introduced with the existing Refreshed and a new SelfNotFound as subtypes.

Consul support

Excellent work!. Thanks for sharing this. Do you have any plans to support consul in addition to etcd in the near future?. Would you accept PRs in this regard?.

Upgrade to Akka 2.4.9

Fix configuration import in MultiNodeConsulConstructrSpec

Currently is importing the etcd configuration object instead of the consul one. Minor change.

Make locking 'idempotent' in consul coordination

After fixing #59, a review of locking stage in consul coordination is needed. See if it's neccessary to update the full TTL after a retry.

Graceful system shutdown when Constructr is terminated

When Constructr is terminated (see #92) and the node is member of the cluster, instead of just terminating the system, the node should leave the cluster so that is not marked as unreachable.

Add tests for EtcdCoordination

Provide custom supervisor strategy for cassandra.Conductr

Rethink failure handling

Currently ConstructR is pretty aggressive when it comes to failure: in many cases the system is terminated. Further on it's not easy to customize failure handling.

My current thoughts are:

Make things as simple as possible
By default the system should not terminate in the face of failure

[cas] Use Settings.clusterName for Coordination initialization

Currently the name of the actor system is used, see https://github.com/hseeberger/constructr/blob/master/constructr-cassandra/src/main/scala/de/heikoseeberger/constructr/cassandra/Constructr.scala#L69.

unhandled event MemberUp(Member(address = ...., status = Up)) in state AddingSelf

When starting multiple nodes at the same time I sometimes see this.

Mention contructs-coordination-consul in the README

https://github.com/Tecsisa/constructr-consul

Of high quality and uses Akka HTTP to talk to Consul.

Isolated mode

I have an application that is intended to run clustered, but e.g. for local development it might be convenient to be able to run a single, isolated instance without connecting etcd/consul.

What would be a convenient way to achieve that? Perhaps we could introduce a 'dummy' coordination backend and I could introduce an option to select that one in my application?

Check that ttl-factor > 1

While README.md and the comments in reference.conf say that "ttl-factor must be greater than one", that's never actually checked.

Limit number of nodes used as seed nodes

This should be configurable. Maybe use negative values for "no limit".

The party creating the Coordination instance should pass in a LoggingAdapter

It would be nice to have access to the Akka logger in every particular coordination implementation.

See Tecsisa/constructr-consul#9

Add git.useGitDescribe to build

git.useGitDescribe := true

Split off coordination implementations

The etcd based impl remains in this project, the Consul one is recreated by @juanjovazquez and others in their own repository.

[mac] Allow existing nodes in EtcdCoordination.addSelf

If a node is restarted quickly, addSelf produces a 200 status code which should be considered valid.

Upgrade to cassandra 3.0.2

Investigate using Circe instead of spray-json

Allow existing nodes in ConsulCoordination.addSelf

If a node is restarted quickly, addSelf doesn't behave correctly since a previous session can already have acquired the key so that a new session cannot obtain the lock. Therefore, the former session will expire after the ttl and the key will disappear.

ZK

Any plans to add support for zookeeper?

Nice project!

More control over what happens when coordination fails

The README still mentions: If something goes wrong, e.g. a timeout (after configurable retries are exhausted) when interacting with the coordination service, ConstructR by default terminates its ActorSystem. At least for constructr-akka this can be changed by providing a custom SupervisorStrategy to the manually started Constructr actor, but be sure you know what you are doing.

The latter is no longer the case, right? de.heikoseeberger.constructr.akka.Constructr is final, fixes its SupervisionStrategy to SupervisorStrategy.stoppingStrategy and terminates its ActorSystem when coordination terminates.

I'd like to have some more control over what to do when coordination fails.

constructr-cassandra not found

Adding
resolvers += Resolver.bintrayRepo("hseeberger", "maven")

libraryDependencies ++= Vector(
"de.heikoseeberger" %% "constructr-cassandra" % "0.13.2",
...
)

in my build.sbt result in an unresolved dependency.

Note: Unresolved dependencies path:
[warn] de.heikoseeberger:constructr-cassandra_2.11:0.13.2 (/Users/elio/progetti/iptm/build.sbt#L29-70)
[warn] +- eu.sia.innhub:iptm_2.11:1.0
sbt.ResolveException: unresolved dependency: de.heikoseeberger#constructr-cassandra_2.11;0.13.2: not found

The library seems not to be present on bintray (http://dl.bintray.com/hseeberger/maven/de/heikoseeberger/)

GettingNodes: transition to BeforeGettingNodes, i.e. effectively unlimited retries
Locking: transition to BeforeGettingNodes, i.e. effectively unlimited retries; also make lock idempotent by first reading
AddingSelf: use explicit retries, fail after exhausted
Refreshing: unlimited retries

In all cases a warning or even an error should be logged.