atomix / copycat Goto Github PK

A novel implementation of the Raft consensus algorithm

License: Apache License 2.0

Shell 0.15% Java 99.85%

atomix raft copycat distributed-systems consensus raft-consensus-algorithm state-machine consensus-algorithm replication database

copycat's Introduction

Copycat

Copycat has moved!

Copycat 2.x is now atomix-raft and includes a variety of improvements to Copycat 1.x:

Multiple state machines per cluster
Multiple sessions per client
Index-free memory mapped log
Per-state-machine snapshots
Framework agnostic serialization
Partitioning
etc

This repository is no longer officially maintained.

copycat's People

Contributors

Stargazers

Watchers

Forkers

madjam moyun nathansgreen easysg shorero bgloeckle gitter-badger superdjz tejdas chicc999 yangjunlin-const electrical didivz georgekankava kinglcc gitanuj holmes07 buybackoff zero-entropy-universe fengshao0907 wirtzg nanne007 brianfdc dmvk yanluowusheng liuinsect h3dema shanling2004 zx247549135 satks narry ckharide trungz xszhangfred maguro jpwatson aruanruan thirugnanamsubbiah mnicky wenxueliu albinsuresh rkavanap keyang512 antoine-tran kangxinghua mathieu-stennier zmyer wrightgithub zhangdinet waltznetworks xydaxia0 mapbased christian-posta sevenvoid yuwentao c225274 n3pathi stephensong zhouheming yanqingluo iuliandumitru jhall11 mawming chenld weikuo0506 stream-iori himanshug adrionictt jerrysearch coding4cc makeyang simbo1905 vsuslov vector20240801 duzhanyuan niubin261 pologood zhangweixin technoboy- zhnlion mmaia zhipwang bjo2008cnx ahsxsk emptystack viyond aaudiber beedoor maniacs-ops fastwei suhuaguo kkllwww007 linghushaoxia heiqiaoxiang xiaot945 nha weir2010 junplus etsangsplk suyanlong

copycat's Issues

client fails if it can't register a session?

The Copycat internals doc says:

If the registration fails, the client attempts to connect to another random server and register a new session again. In the event that the client fails to register a session with any server, the client fails and must be restarted.

Why give up? What if there was just a temporary network outage? How bad is it to restart a client (is that a manual or automatic thing)?

Explaining how to build this repository

Hello,

Can you put in the readme how to build the the raftclient and raftserver jars? I eventually figured out what Maven was and how to build with it but others might find that useful.

Steps I did to build:
Downloaded maven
Cloned this repo to my computer
cd'd into the root directory of this repo
ran 'mvn clean install'

This appeared to create the two client/server jars.

Are these the correct steps for the command line to build this project?

Expired sessions may not be expired in the internal state machine in certain cases

There's a race condition in the way sessions are currently expired. When register, keep alive, and unregister entries are applied to the state machine on each server, timestamps for all sessions are checked and sessions are marked as suspicious if their expiration time has passed. But the leader is responsible for expiring sessions that are marked suspicious. This ensures that sessions cannot be expired during leadership changes or other events that prevent a client from keeping its session alive.

But there is a case where a client may submit a keep alive request - thus renewing its session timeout and marking its session as no longer suspicious - and a leader may log an UnregisterEntry while the client's keep-alive is in the process of being committed. In that case, the keep-alive will be committed and applied before the leader's UnregisterEntry, and the following logic will result in close(Session) but not expire(Session) being called on the state machine:
https://github.com/atomix/copycat/blob/master/server/src/main/java/io/atomix/copycat/server/state/ServerStateMachine.java#L302

The UnregisterEntry should be enriched with an expired flag to indicate whether the leader expired the session or the client explicitly closed its session. This will ensure that the proper state machine methods are called upon session expiration.

ValueStateMachineExample not working

ValueStateMachineExample expects args[0] and args[1] as Integer. However it also expects args[1] in the format 'host:port' because it uses split operation on it in the for loop.

CompletableFuture might complete in unexpected thread - Copycat gets unreliable

Hi!

Copycat uses the class CompletableFuture heavily and expects the handlers along those "pipelines" to be called on the same thread as the complete function was called on: For example the server classes (e.g. ServerState, *State, ServerContext, ...) expect to be only called on the single thread that is created in a SingleThreadContext in the constructor of ServerContext. With this assumption, Copycat has much simpler code throughout these classes, because it does not need to take any care of multithreading. Jordan and I discussed about that already in a thread in the Google Group: https://groups.google.com/d/msg/copycat/p9j8I0SRw3M/xR7fwplvCwAJ.

Exactly the same example that I talk about in the mail thread came up for my implementation in diqube now again (diqube internally uses copycat to have a reliable way to distribute some internal data across the cluster).

In the logs of some of diqubes tests, the following lines started to show up sometimes, and in those cases the copycat server did not start up correctly:

23:42:20.629 [main] INFO  o.d.consensus.DiqubeCopycatServer [DiqubeCopycatServer.java:143] - Starting up consensus node with local data dir at '/home/diqube/.jenkins/workspace/diqube/diqube-server/data/consensus'.
23:42:20.632 [copycat-server-Address[/127.0.0.1:5101]] DEBUG i.a.c.server.storage.SegmentManager [SegmentManager.java:318] - Created segment: Segment[id=1, version=1, index=0, length=0]
23:42:20.634 [copycat-server-Address[/127.0.0.1:5101]] DEBUG i.a.c.s.s.ServerStateMachineExecutor [ServerStateMachineExecutor.java:199] - Registered value operation callback class org.diqube.cluster.ClusterLayoutStateMachine$GetAllTablesServed
23:42:20.634 [copycat-server-Address[/127.0.0.1:5101]] DEBUG i.a.c.s.s.ServerStateMachineExecutor [ServerStateMachineExecutor.java:199] - Registered value operation callback class org.diqube.cluster.ClusterLayoutStateMachine$GetAllNodes
23:42:20.634 [copycat-server-Address[/127.0.0.1:5101]] DEBUG i.a.c.s.s.ServerStateMachineExecutor [ServerStateMachineExecutor.java:190] - Registered void operation callback class org.diqube.cluster.ClusterLayoutStateMachine$RemoveNode
23:42:20.634 [copycat-server-Address[/127.0.0.1:5101]] DEBUG i.a.c.s.s.ServerStateMachineExecutor [ServerStateMachineExecutor.java:190] - Registered void operation callback class org.diqube.cluster.ClusterLayoutStateMachine$SetTablesOfNode
23:42:20.634 [copycat-server-Address[/127.0.0.1:5101]] DEBUG i.a.c.s.s.ServerStateMachineExecutor [ServerStateMachineExecutor.java:199] - Registered value operation callback class org.diqube.cluster.ClusterLayoutStateMachine$IsNodeKnown
23:42:20.635 [copycat-server-Address[/127.0.0.1:5101]] DEBUG i.a.c.s.s.ServerStateMachineExecutor [ServerStateMachineExecutor.java:199] - Registered value operation callback class org.diqube.cluster.ClusterLayoutStateMachine$FindNodesServingTable
23:42:20.635 [copycat-server-Address[/127.0.0.1:5101]] INFO  i.a.c.server.state.ServerContext [ServerContext.java:87] - Server started successfully!
23:42:20.636 [main] DEBUG i.a.copycat.server.state.ServerState [ServerState.java:461] - Address[/127.0.0.1:5101] - Single member cluster. Transitioning directly to leader.
[...more lines of diqube logging on main thread...]

The first line is logged by diqube code that tries to start up the copycat server. That then starts up without having any other nodes in the cluster (single node setup in those tests). Note that in the last line, copycats "ServerState" logs that it identified a single member cluster, but then there are no logs at all any more from the server (in the whole log of that test). That last log entry is made on a different thread, though: "main" instead of the copycat server thread. This should never happen, as copycat expects for the ServerState#join() method (which logs that line) to be called on the copycat server thread; but in this case it does not happen.
It is slightly hard to debug this, as most of the time it works just fine (and the ServerState#join() is not executed on the main thread but on the copycat server thread). Therefore I can just guess that the ServerState#transition() method calls #checkThread() which in turn then throws an IllegalStateException - I cannot see that exception in the log though, diqube continues executing the startup of the test on the "main" thread. That exception is swallowed and I guess it is swallowed in CompletableFuture somewhere.

So, I dug into why that ServerState#join() method is sometimes called on a wrong thread. First of all: diqubes implementation of the catalyst server completes a call to the #listen() method very quickly and the returned future of that method is completed quickly as well. Now, I'm pretty sure that is is what happens:

diqube calls the #open method on CopycatServer
This then executes (on the main thread): openFuture = context.open().thenCompose(completionFunction); (see here)
The call to context.open() is executed first, therefore ServerContext#open() is executed (see here)
That switches to the server thread, executes the #listen() on the catalyst server. The future that is returned by #open() is completed as soon as #listen replies.
The main thread continues up until just before the "return" statement of ServerContext#open() - this means it installs the whenComplete handler on the future.
The Server (on the server thread) then completes the listen-future, and then ServerContext#open() completes its future, too (after creating the new ServerState, line 85)
The whenComplete of the result future of ServerContext#open() is executed and logs "Server started successfully" (this is logged in the correct thread above!)
The completion of the result future of ServerContext#open() finishes (because everything is done)
ServerContex#open() returns the (already completed) future
CopycatServer receives the result of context.open() and then executes thenCompose(completionFunction);
As the future is completed already though, the completionFunction can only be called on the current thread (= "main") by the CompletableFuture, which it does -> the completionFunction is executed on thread "main".

This shows that copycat cannot rely on the handlers that are registered on a CompletableFuture (like thenCompose, whenComplete, ...) to be called in the same thread that called the complete method on the CompletableFuture. This might work "most of the time", but that is most probably not good enough.

I have an immediate solution for my concrete problem: I can simply delay the completion of the future that is returned on my servers #listen call to later.

But I think before any release candidates or final releases of copycat can be created, the usage of all CompletableFutures throughout the whole codebase (copycat and catalyst) have to be inspected and all such problems have to be fixed, so that copycat does not rely on any timings of when CompletableFutures may be completed. (EDIT: I understood the Google Group thread in the way that copycat usually assumes that the handlers are called on the same thread. As this is obviously not true, I think all occurrences have to be inspected. If copycat does not assume this usually, then this issue might just be a single bug, not a general implementation pattern issue)
It could be a solution to do context.executor().execute(() -> ...) in each result handler again, to simply ensure that everything is executed on the correct thread. Or all those methods that rely on being executed on a specific thread (that's like at least all the methods in *State, ServerContext etc.) do that.

Support for a Leader to leave the cluster cleanly

When a node that is the current leader leaves the cluster it will do so without properly updating the cluster configuration. This happens because the state transition from LEADER to LEAVE prevents committing the updating cluster configuration.

Lagging follower is forced to truncate its log repeatedly

I noticed this during a test run. This is a 3 node set up. One of the follower keeps resetting itself (truncate log) as it always discovers that the globalIndex is greater than what it is aware of. A brief look at the logic in PassiveState::checkGlobalIndex tells me that this should not happen on start up.

Here are some log lines. (This pattern keeps repeating)

2016-02-18 23:18:49,572 | DEBUG | FollowerState | 10.254.1.207/10.254.1.207:9876 - Received AppendRequest[term=1, leader=184429489, logIndex=0, logTerm=0, entries=[0], commitIndex=3288, globalIndex=3288]
2016-02-18 23:18:49,572 | DEBUG | SegmentManager | Closing segment: 1
2016-02-18 23:18:49,573 | DEBUG | SegmentManager | Created segment: Segment[id=1, version=1, index=0, length=0]
2016-02-18 23:18:49,573 | DEBUG | FollowerState | 10.254.1.207/10.254.1.207:9876 - Sent AppendResponse[status=OK, term=1, succeeded=true, logIndex=0]
2016-02-18 23:18:49,573 | DEBUG | FollowerState | 10.254.1.207/10.254.1.207:9876 - Received AppendRequest[term=1, leader=184429489, logIndex=0, logTerm=0, entries=[0], commitIndex=3292, globalIndex=3292]
2016-02-18 23:18:49,573 | DEBUG | SegmentManager | Closing segment: 1
2016-02-18 23:18:49,573 | DEBUG | SegmentManager | Created segment: Segment[id=1, version=1, index=0, length=0]
2016-02-18 23:18:49,573 | DEBUG | FollowerState | 10.254.1.207/10.254.1.207:9876 - Sent AppendResponse[status=OK, term=1, succeeded=true, logIndex=0]
2016-02-18 23:18:49,574 | DEBUG | FollowerState | 10.254.1.207/10.254.1.207:9876 - Received ConfigureRequest[term=1, leader=184429489, index=3295, members=[ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/10.254.1.207:9876, clientAddress=/10.254.1.207:9876], ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/10.254.1.201:9876, clientAddress=/10.254.1.201:9876], ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/10.254.1.202:9876, clientAddress=/10.254.1.202:9876]]]
2016-02-18 23:18:49,574 | DEBUG | FollowerState | 10.254.1.207/10.254.1.207:9876 - Sent ConfigureResponse[status=OK]
2016-02-18 23:18:49,750 | DEBUG | FollowerState | 10.254.1.207/10.254.1.207:9876 - Received AppendRequest[term=1, leader=184429489, logIndex=0, logTerm=0, entries=[180], commitIndex=3294, globalIndex=3294]
2016-02-18 23:18:49,750 | DEBUG | SegmentManager | Closing segment: 1
2016-02-18 23:18:49,750 | DEBUG | SegmentManager | Created segment: Segment[id=1, version=1, index=0, length=0]

I'll continue looking into this more but thought I'd log an issue in case @kuujo you already know what is causing this.

Node cannot (re)join cluster if restarted (quickly)

Assume a cluster with 2 nodes. Node 1 is the leader and 2 is a follower or is passive. Now, node 2 is shut down (=kill -9), but restarted quickly. This means that node 2 lost all of its main memory state and does not now about the leader of the cluster anymore. The CopycatServer will then be #open()ed again and it will try to re-join the cluster: It will send a JoinRequest to node 1 and receive a JoinResponse and then directly transition to passive or follower again (in ServerState#join(Iterator, CompletableFuture)). The problem now is that CopycatServer#open() on node 2 decides to wait until it receives information on who is the leader and only then it completes its CompletableFuture. But it won't receive the leaders address. This is set (if I see this correctly) in *State#append (LeaderState, ActiveState, PassiveState), but that method won't be called, as node 1 (the leader) decides to not send a new ConfigurationEntry, because it thinks that the node belongs to the cluster already (see LeaderState#join(JoinRequest), the if where the clusters members are checked).

Am I now using the CopycatServer wrong somehow?
AFAIK there cannot be a "timeout" of a node when it is assumed that the node left the cluster without it sending a LeaveRequest (as that would effectively allow split-brains), therefore the above problem might not even be one if a node is restarted quickly, but also if restarted after some time, e.g. after it has been killed (kill -9, network partition where the Ops people decided to restart the servers, power-loss on the machines, ...).
Are there any integration tests that test such behavior? This should be fairly simple to test: Start a cluster, kill one node that is not the leader, restart the node and check if it connected to the cluster successfully again.

I can see that this would not be a that big problem in a cluster where there's a lot of activity (= a lot of AppendRequests issued), as the node will reconnect on the first AppendRequest being sent. But there might be clusters where there's not that much activity, but the application (which uses copycat) wants to wait for the local copycat server to be initialized and then send some Commands to the cluster...

Perhaps the leader should append a NoOpEntry?

Support snapshottable state machines

Add support for state machine snapshots. This will require an update to the protocol to facilitate sending snapshots to new nodes.

SerializationException

Hi,

When I started 3 CopycatServer locally, got the following exceptions.
Any idea on this?
Thanks!

16:21:04.880 [copycat-server-Address[localhost/127.0.0.1:8521]] ERROR i.a.c.u.c.SingleThreadContext - An uncaught exception occurred
io.atomix.catalyst.serializer.SerializationException: cannot serialize unregistered type: class io.atomix.copycat.server.request.PollRequest
        at io.atomix.catalyst.serializer.Serializer.writeObject(Serializer.java:589) ~[rafty-server-1.0-SNAPSHOT.jar:na]
        at io.atomix.catalyst.transport.NettyConnection.writeRequest(NettyConnection.java:180) ~[rafty-server-1.0-SNAPSHOT.jar:na]
        at io.atomix.catalyst.transport.NettyConnection.lambda$send$11(NettyConnection.java:295) ~[rafty-server-1.0-SNAPSHOT.jar:na]
        at io.atomix.catalyst.transport.NettyConnection$$Lambda$57/1712713197.run(Unknown Source) ~[na:na]
        at io.atomix.catalyst.util.concurrent.Runnables.lambda$logFailure$17(Runnables.java:20) ~[rafty-server-1.0-SNAPSHOT.jar:na]
        at io.atomix.catalyst.util.concurrent.Runnables$$Lambda$5/1555690610.run(Unknown Source) [rafty-server-1.0-SNAPSHOT.jar:na]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_45]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_45]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_45]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_45]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_45]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_45]

Ensure tombstones are applied on all servers before being removed from the log

The changes in #31 removed the use of globalIndex to indicate when it was safe to compact server logs. Instead, they were replaced with the highest index for which events have been received by all clients. But it turns out that while this did correctly address the issue of log compaction while servers were down, it introduced a bug into compaction for tombstones.

Because tombstones represent the removal of state, they need to be applied on all servers in order to ensure state is properly removed from memory. With the changes in #31, it's not possible for tombstones to be removed from the log prior to being applied on all servers. For instance, imagine the following scenario:

In a three node cluster, a client submits a put 1 operation and then a remove 1 operation. put 1 is a normal command, and remove 1 is a tombstone
The leader commits both operations, but one follower stores only put 1 before crashing
The leader and remaining follower compact both operations from their logs
When the down follower recovers, its log contains put 1 but the leader and other follower have compacted both put 1 and remove 1 from their logs. Thus, the recovered follower will never receive the remove 1 operation and will therefore maintain an inconsistent state until/if some other operation overrides that state

Essentially, servers need to keep track of separate indexes for safely performing minor and major compaction. Minor compaction removes normal commands from the log and should be based on client event indexes, and major compaction removes tombstones from the log and should be based on the highest index stored on all servers.

The simple solution to this is to reinstate the globalIndex tracking solely for the purpose of managing compaction of tombstones. globalIndex basically keeps track of the highest index stored on all servers. This is significantly less of an issue than the general use of globalIndex was previously since tombstones are infrequently compacted from the log anyways.

There may be some alternative approaches to handling this problem that allow tombstones to be removed from the log even while a server is down. One option is to have servers infrequently commit an entry (via the leader) that effectively persists the server's commitIndex in the Raft log. Then, the highest committed commitIndex for all servers could be used to compact logs while a server is down. This has the benefit of persisting through leader changes, which isn't the case with tracking globalIndex in memory on the leader.

Support session event names

Currently, the session publish method only supports a single argument. But this can make it difficult to perform more complex messaging via the Session object and often requires that users send a more complex object through the session. Users should be able to send an event type String or other identifier with session events.

BufferUnderflowException when attempting to install from snapshot

Noticed the following exception

2016-02-19 22:26:47,757 | INFO  | 0.254.1.203:9876 | ServerStateMachine  | 10.254.1.203/10.254.1.203:9876 - Installing snapshot 3902239
2016-02-19 22:26:47,759 | ERROR | 203:9876-state-1 | SingleThreadContext  | An uncaught exception occurred
java.nio.BufferUnderflowException
    at io.atomix.catalyst.buffer.AbstractBuffer.checkRead(AbstractBuffer.java:371)[68:org.onosproject.onlab-thirdparty:1.5.0.SNAPSHOT]
    at io.atomix.catalyst.buffer.AbstractBuffer.checkRead(AbstractBuffer.java:351)[68:org.onosproject.onlab-thirdparty:1.5.0.SNAPSHOT]
    at io.atomix.catalyst.buffer.AbstractBuffer.readLong(AbstractBuffer.java:578)[68:org.onosproject.onlab-thirdparty:1.5.0.SNAPSHOT]
    at io.atomix.copycat.server.storage.snapshot.SnapshotReader.readLong(SnapshotReader.java:151)[68:org.onosproject.onlab-thirdparty:1.5.0.SNAPSHOT]
    at io.atomix.variables.state.LongState.install(LongState.java:46)[68:org.onosproject.onlab-thirdparty:1.5.0.SNAPSHOT]
    at io.atomix.manager.state.ResourceManagerState.install(ResourceManagerState.java:81)[68:org.onosproject.onlab-thirdparty:1.5.0.SNAPSHOT]
    at io.atomix.copycat.server.state.ServerStateMachine.lambda$installSnapshot$82(ServerStateMachine.java:124)[68:org.onosproject.onlab-thirdparty:1.5.0.SNAPSHOT]
    at io.atomix.copycat.server.state.ServerStateMachine$$Lambda$587/1905889655.run(Unknown Source)[68:org.onosproject.onlab-thirdparty:1.5.0.SNAPSHOT]
    at io.atomix.catalyst.util.concurrent.Runnables.lambda$logFailure$17(Runnables.java:20)[68:org.onosproject.onlab-thirdparty:1.5.0.SNAPSHOT]
    at io.atomix.catalyst.util.concurrent.Runnables$$Lambda$49/1217411637.run(Unknown Source)[68:org.onosproject.onlab-thirdparty:1.5.0.SNAPSHOT]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)[:1.8.0_31]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)[:1.8.0_31]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)[:1.8.0_31]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)[:1.8.0_31]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)[:1.8.0_31]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)[:1.8.0_31]
    at java.lang.Thread.run(Thread.java:745)[:1.8.0_31]

Not immediately clear what happened here.

Allow log compaction for all committed entries

Currently, logs are not allowed to be compacted until entries have been stored on active all servers in the cluster. This ensures that all state machines see every command and thus all state machines store related events in memory until they're acked by client sessions. But this is obviously not ideal because, essentially, this means logs cannot be compacted while any active server is down. Ideally, servers should be allowed to compact entries that have been committed.

The issue with compacting entries that haven't been stored on all servers is that if a server compacts a command that triggers an event before that command is received by some follower, the state of events on all servers will become inconsistent. Sessions ensure they receive events in sequential order by checking each event message against the previous received event message for consistency. Consider the following scenario:

In a three node cluster with servers A, B, and C, server A is the leader
A client connects to server C and registers a session
Server C is far behind the leader, but because session operations are proxied to the leader, the client cannot detect that server C is slow
Another client submits a command that triggers a sequential event to be sent to the first session and the command is committed through servers A and B
Server A compacts the command out of its log before server C catches up

In this scenario, server C will never receive the command and thus will never publish the event. That results in an inconsistent event history between server C and servers A and B.

An inconsistent history of events between servers is a safety violation. Clients should always see every event sent to their session, but if a client were connected to server C it would miss the event.

But in addition to safety, liveness also becomes an issue with inconsistencies between servers. Servers and clients coordinate sequencing of events in a manner similar to Raft's AppendEntries RPCs wherein the server sends the previous event index for comparison. However, previous event indexes for servers A and B will differ from that of C. A client may ack an event index, thus clearing events from server memory, and then switch to a server that expects the client to have seen events that it never saw. Essentially, a client may ack events that it doesn't know about.

Fortunately, Copycat already does keep track of which clients are connected to which servers through connect requests/responses. My proposed fix to this is to extend that mechanism to allow servers to reject connections from clients if conditions indicate that sequential events sent by that server may not be complete. Essentially, what we want is to ensure that any server to which a client is connected has the complete history of events from the client's last received event to the last committed index in the log. In order to do this, servers simply need to track the lowest index for which they've received sequential commits. When a client connects to a server, it sends the last received event index for its session. If the server's lowest sequential commit index is less than the client's highest event index, it rejects the connection and forces the client to find another server.

Because of the nature of consensus, clients should always be able to find a healthy server to which to connect and from which to receive events. That is, there will always be at least one server that participated in the commitment of each command, and so there will always be some server that received each command. Clients simply need to find the server that has a complete event history starting from their last received event.

In order to ensure entries are not compacted after a connection between a client and server is made, servers will have to use the lowest index of all connected clients in the same way that globalIndex is currently used. That is, servers will prevent entries from being compacted so long as some client connected to the server is awaiting events.

That last point is actually similar to an alternative solution. Since client keep alive requests are logged and replicated, it could be possible to use that metadata in place of globalIndex in general to ensure logs are never compacted until al open sessions have received events. Essentially, the lowest event index acked by any client becomes the globalIndex. This is certainly a much simpler solution, but one of the benefits to forcing clients to switch servers is that it forces clients to switch to more up-to-date servers.

Client state version is not correctly sent to cluster

The RaftClient ensures that it does not see state go back in time by tracking the highest state machine version (index) of any response seen by the client. The version should be sent by the client any time is submits a Query to the cluster. But currently, the client is submitting the last request sequence number for which it received a response. This is incorrect.

Properly handle cleaning tombstones from the log

The current log cleaning implementation does not handle cleaning tombstones from the log very well. Tombstones are unique in that rather than representing state, they actually represent the absence of state. For this reason, tombstones must be persisted long enough to be applied to all nodes in the cluster, and they must not be removed from the log until all prior related entries have been removed.

What we really need is a system that understands the relationships between log entries. What I'm proposing is using something like Storm's message tracking algorithm to track relationships between entries in the log. Essentially, what this means is each entry in the log is assigned a random 64-bit identifier, and that ID is XORed with IDs of related entries. Once the XOR equals 0, all entries in the tree have been removed from the log.

Add connection strategies

Add strategies to allow clients to control which servers they connect to. This should include allowing clients to connect only to followers or only to leaders for performance. Clients that frequently perform sequential or causal reads will benefit from connecting to followers, and clients that frequently write or perform linearizable reads will benefit from connecting only to leaders.

Command sequences are lost after compaction

Currently, commands are sequences in the internal Raft state machine by reordering them as they're applied. However, in the event of a server restart, if commands have been cleaned from the log then they may appear out of sequence or missing sequence numbers. For this reason, leaders must sequence commands as they're written to the log so that internal state machines can assume log order equals sequential order.

Log strong earnings when commit objects are garbage collected

Commit objects expose an interface for cleaning the Raft log. When either close() or clean() is called on a commit, the commit is released back to an internal commit pool. If a commit is not released back to a pool but is garbage collected, that is indicative of a commit that was neither closed or cleaned. Catalog should log strong earnings to ensure developers properly clean the commit log.

Support separate client/server transports

Allow users to configure separate transports for clients and servers. This would require that servers bind to two ports and configuration changes would have to reflect this change.

This feature is a precursor to supporting an optional REST API. See atomix/catalyst#6 for details on how REST support can be implemented in Catalyst.

Allow clients to unregister their sessions

Currently, when a client/session is closed, its session is not removed from state machines until it expires. Clients should be able to explicitly unregister their session when closed. The protocol already supports an UnregisterRequest which commits an UnregisterEntry which removes the session quietly. Client sessions simply need to be updated to submit an UnregisterRequest when explicitly closed.

Duplicate commands don't have to be committed to the Raft log

The current implementation of linearizable semantics for clients via sessions follows the Raft dissertation precisely. All commands are logged with a sequence number. When a command is applied to the state machine, if a command with that sequence number has already been evaluated, the command's cached output is returned. But this process does not have to occur through the log. Leaders have enough information to determine whether a command has already been written to the Raft log based on the command's sequence number. If a leader receives a command request for a sequence number that it already wrote to the Raft log, it can simply immediately return the cached output or if the cached output is not yet available (the original command has not been committed) store the request object until the original command is completed.

Add strategy for initializing a client's session

Currently, clients attempt to connect to each known server in the cluster at startup. If a client fails to register a session through one of those servers, startup fails. It has been requested to provide support for exponential back off when opening a client's session. This could be done in a ConnectionStrategy (also rename current ConnectionStrategy to CommunicationStrategy or something).

Cluster memebers are tracked by their Address' hashCode

ClusterState seems to track members of the cluster by their "ID" and the ID seems to be (LeaderState#join(...)) the Object#hashCode() of the address of that node.

This seems to be insecure, as two nodes with different addresses could have the same hashCode and therefore the same ID. Or am I missing something?

Support persistent state machines

Provide a base class for persistent state machines that allows such state machines to provide snapshots to the Raft protocol for replication to new members.

Add log persistence levels

The Log is backed by a Buffer which can be easily be swapped to support writing to disk, memory, or memory mapped files. This option should be exposed in the Storage.Builder as a StorageLevel enum.

Possible log file corruption when a node is terminated

I happened to notice this occasionally during some recent runs. Doesn't happen every time.

Here's a scenario I was running:

3 node cluster. s1, s2 and s3.
A separate client is logging commands on this cluster at the rate of few hundred per second.
Terminate a follower node (kill java process)
Try bringing it back up and you see the below exception trace.

09:04:31.681 [copycat-server-Address[localhost/127.0.0.1:5003]] ERROR i.a.c.u.c.SingleThreadContext - An uncaught exception occurred
java.lang.IndexOutOfBoundsException: null
at io.atomix.catalyst.buffer.AbstractBytes.checkOffset(AbstractBytes.java:47) ~[catalyst-buffer-1.0.0-rc3.jar:na]
at io.atomix.catalyst.buffer.AbstractBytes.checkRead(AbstractBytes.java:54) ~[catalyst-buffer-1.0.0-rc3.jar:na]
at io.atomix.catalyst.buffer.FileBytes.readUnsignedShort(FileBytes.java:286) ~[catalyst-buffer-1.0.0-rc3.jar:na]
at io.atomix.catalyst.buffer.AbstractBuffer.readUnsignedShort(AbstractBuffer.java:528) ~[catalyst-buffer-1.0.0-rc3.jar:na]
at io.atomix.copycat.server.storage.Segment.(Segment.java:81) ~[classes/:na]
at io.atomix.copycat.server.storage.SegmentManager.loadDiskSegment(SegmentManager.java:345) ~[classes/:na]
at io.atomix.copycat.server.storage.SegmentManager.loadSegment(SegmentManager.java:332) ~[classes/:na]
at io.atomix.copycat.server.storage.SegmentManager.loadSegments(SegmentManager.java:404) ~[classes/:na]
at io.atomix.copycat.server.storage.SegmentManager.open(SegmentManager.java:95) ~[classes/:na]
at io.atomix.copycat.server.storage.SegmentManager.(SegmentManager.java:58) ~[classes/:na]
at io.atomix.copycat.server.storage.Log.(Log.java:135) ~[classes/:na]
at io.atomix.copycat.server.storage.Storage.open(Storage.java:280) ~[classes/:na]
at io.atomix.copycat.server.state.ServerContext.lambda$0(ServerContext.java:72) ~[classes/:na]
at io.atomix.copycat.server.state.ServerContext$$Lambda$3/846238611.run(Unknown Source) ~[na:na]
at io.atomix.catalyst.util.concurrent.Runnables.lambda$logFailure$17(Runnables.java:20) ~[catalyst-common-1.0.0-rc3.jar:na]
at io.atomix.catalyst.util.concurrent.Runnables$$Lambda$4/1663166483.run(Unknown Source) [catalyst-common-1.0.0-rc3.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_25]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_25]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_25]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_25]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_25]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_25]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_25]

Add recovery strategies

Add strategies to allow clients to automatically recover from lost sessions. By default, clients should fail when sessions are lost to force the user to handle the loss of linearizable semantics, but users should be allowed to override that behavior to automatically open a new session and continue operation with the loss of linearizability between sessions.

Implementing counting state machines

If I want to implement a state machine that tracks the number of times it was updated then the current implementation presents some difficulties. We work under the assumption that entries that no longer contribute to state machine evolution need not be replicated. But for operation counting state machines all operations are meaningful. I guess if we annotate the command entry as a tombstone we can clean it up after it is replicated to all instances. But a cluster configuration change to add new members poses a challenge.

I'm opening this issue to get some thoughts on this subject. My initial thought was that for some state machine types snapshotting is the only way to go if we want to garbage collect log entries from time to time.

Reject actions that are disruptive to the leader during configuration changes

There is some potential for leaders to be disrupted during configuration changes. Since the leader updates its configuration when it logs the configuration change, servers that are removed from the leader's configuration may timeout and attempt an election. By incrementing the term and requesting a vote from the leader, a server that was removed from the leader's configuration can force the leader to step down.

§4.3.2

Handling this case requires a simple extension to the handling of VoteRequest to reject requests if received within an election timeout of receiving a heartbeat from the leader.

Handle compacted entries during AppendRequest consistency checks

i think the fix for handling null (cleaned) entries during AppendRequest consistency checks may be wrongheaded:
5001868

So, to redefine the problem, when checking the term of the previous entry while handling an AppendEntries RPC, if the entry has been cleaned it will be null. That causes the consistency check to fail. What we did in the fix for this way ensure that globalIndex is always the lowest matchIndex minus 1, but this only fixes the problem for tombstones. Normal entries can still be clean()ed and thus hidden from reading after the globalIndex. So it's likely this problem would still occur if the NoOpEntry in our scenario were not changed to a tombstone.

I think the case is actually that we could safely hide all committed and cleaned entries. If an entry returns null from the log, we know it must necessarily have been committed and therefore consistency checks are not necessarily even needed. Any leader that's elected will have the same entry at the same term simply as a product of the Raft election algorithm.

But I think just to feel good about it we should still keep the last committed entry in the log visible at all times. As I mentioned, the log compaction process already ensures that the last committed entry always remains on disk and will never be removed, so I think I was wrong and you were right @madjam that we should simply ensure the last committed entry in the log always returns a non-null entry. So, a property of the log would be that entries from commitIndex to lastIndex are always non-null, and that would make this feel a lot better.

ISE: inconsistent index

I haven't done much debugging on this yet. But thought I will log this first.

Here's what I was doing:

3 node cluster
A separate client is submitting commands to the cluster at the rate of few hundred per second
Randomly kill a node bring it back up and verify everything is still running ok.

Here's what I did to see this error:

Kill the leader
Noticed another node was elected as new leader
This expection is logged on the new leader

12:21:00.424 [copycat-server-Address[localhost/127.0.0.1:5003]] ERROR i.a.c.u.c.SingleThreadContext - An uncaught exception occurred java.lang.IllegalStateException: inconsistent index: 1605633 at io.atomix.catalyst.util.Assert.state(Assert.java:69) ~[classes/:na] at io.atomix.copycat.server.storage.Segment.get(Segment.java:319) ~[classes/:na] at io.atomix.copycat.server.storage.Log.get(Log.java:319) ~[classes/:na] at io.atomix.copycat.server.state.LeaderState$Replicator.entriesCommit(LeaderState.java:1040) ~[classes/:na] at io.atomix.copycat.server.state.LeaderState$Replicator.commit(LeaderState.java:980) ~[classes/:na] at io.atomix.copycat.server.state.LeaderState$Replicator.lambda$commit$126(LeaderState.java:876) ~[classes/:na] at io.atomix.copycat.server.state.LeaderState$Replicator$$Lambda$111/982607974.apply(Unknown Source) ~[na:na] at java.util.Map.computeIfAbsent(Map.java:957) ~[na:1.8.0_25] at io.atomix.copycat.server.state.LeaderState$Replicator.commit(LeaderState.java:874) ~[classes/:na] at io.atomix.copycat.server.state.LeaderState$Replicator.access$100(LeaderState.java:818) ~[classes/:na] at io.atomix.copycat.server.state.LeaderState.accept(LeaderState.java:652) ~[classes/:na] at io.atomix.copycat.server.state.ServerState.lambda$registerHandlers$20(ServerState.java:421) ~[classes/:na] at io.atomix.copycat.server.state.ServerState$$Lambda$34/1604755402.handle(Unknown Source) ~[na:na] at io.atomix.catalyst.transport.NettyConnection.handleRequest(NettyConnection.java:102) ~[classes/:na] at io.atomix.catalyst.transport.NettyConnection.lambda$0(NettyConnection.java:90) ~[classes/:na] at io.atomix.catalyst.transport.NettyConnection$$Lambda$50/1312252238.run(Unknown Source) ~[na:na] at io.atomix.catalyst.util.concurrent.Runnables.lambda$logFailure$7(Runnables.java:20) ~[classes/:na] at io.atomix.catalyst.util.concurrent.Runnables$$Lambda$4/1471868639.run(Unknown Source) [classes/:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_25] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_25] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_25] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_25] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_25] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_25] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_25]

Leaders should reject join requests prior to committing entries from previous terms

This issue is in reference to the bug found in single member configuration changes:
https://groups.google.com/forum/m/#!topic/raft-dev/t4xj6dJTP6E

Leader request logic needs to be updated to reject JoinRequests if it hasn't yet committed a no-op entry.

The view of time in the internal state machine may not be consistent

The internal ServerStateMachine tracks time by updating a timestamp with max(entry.getTimestamp(), currentTimestamp). But in some cases, the entry's timestamp is used directly to evaluate things like session expiration:
https://github.com/atomix/copycat/blob/master/server/src/main/java/io/atomix/copycat/server/state/ServerStateMachine.java#L200

This can result in unexpected behavior in the state machine particularly after leadership changes. All state machine methods should store and use the same timestamp for time-dependent operations in order to ensure safety and consistency.

Allow synchronous session events

Currently, the CompletableFuture returned by session publish methods is completed once the event is queued to be sent to the client, but not once it's sent to or received by the client. It should be possible for state machines to block waiting for session events to be received by clients in order to ensure things like locks and leader elections remain consistent. We could potentially expose an event consistency level to control the level of coordination required for each event and prevent too many CompletableFutures from being held in memory.

Asynchronous client callbacks may not be executed sequentially

The current implementation provides sequential consistency for asynchronous clients by sequencing commands as they're applied to the Raft state machine. If a client submits a command through two different paths, client sequence numbers instruct the internal state machine on how to resequence writes such that if a client submits multiple concurrent writes, they'll be applied in the order in which they were sent.

However, this algorithm does not currently preserve sequential consistency in responses. That is, if a client submits multiple commands concurrently, even if the commands are applied to server state machines in the correct order, the client's CompletableFutures may still be completed out of order. This can happen in the event that a response is lost on the way to the client. In that case, the client will resubmit the command, but any responses received thereafter will be applied upon receipt and may therefore be completed out of order.

In order to resolve this issue, clients need to sequence command responses. This issue also applies to queries. From a client's perspective, it can see reads go back in time when multiple reads are executed concurrently. Clients simply have to queue responses received out of order. We may want to provide options for enabling/disabling client linearizability. Sequentail consistency algorithms apply only to asynchronous usage of the client since synchronous usage enforces sequential consistency by its nature.

Only submit keep alive requests during periods of inactivity

Currently, clients submit keep alive requests at an interval that is a fixed fraction of the session timeout. This places an unnecessary burden on the cluster, particularly if many clients are connected and frequently submitting commands. Command entries have all the necessary information (namely session ID and timestamp) to double as keep alive entries. Clients should only submit keep alive requests if they haven't submitted a command for some fraction of the session timeout.

Chubby takes an interesting approach to reducing the frequency of keep alive requests. Rather than committing and immediately responding to session keep alives, Chubby waits until near the end of the required interval to respond to the client, and the client immediately sends another keep alive request. So, I suppose these are two different approaches that could be taken.

Optimize log iteration with per-replica pointers

Generally speaking, the Raft algorithm reads log entries sequentially. When replicating to a follower, the leader reads some number of sequential entries, sends a request, gets a response, and continues. The current log implementation is optimized for sequential reads for uncompacted segments. But for compacted segments, reading each entry is significantly more expensive due to binary search. This is somewhat mitigated by storing last read offsets and positions, but if the leader is reading from two different points in compacted areas of the log, this optimization will have much less significance.

Reading the log can be further optimized by storing the offset in the same way but on a per-replica basis. Essentially, we would just create something like a LogIterator of which one is created for each follower on the leader. This would allow sequential reads in almost all cases aside from the initial read.

KeepAlives can stall after a connection reset

I have observed this several times running with a 3 node cluster set up. The trigger is usually a request failure (usually due to a timeout) that causes the connection to be reset and from that point on no more KeepAliveRequests are sent. This eventually leads to the session expiring.

My suspicion is that there is an uncaught exception that is causing this. By changing how new connections are established after a reset I was able to "fix" this problem. I'll clean up that change and submit a PR shortly.

Safely handle servers with in-memory logs recovering from the leader

When servers with in-memory logs crash and recover and rejoin a cluster with the same leader, in-memory leader state becomes inconsistent:
https://github.com/atomix/copycat/blob/master/server/src/main/java/io/atomix/copycat/server/state/LeaderState.java#L1204

This is because the leader does not allow followers to forcefully decrement their matchIndex. This will result essentially in infinite recursion since the leader can't decrease nextIndex to the follower's initial log index (0) and so AppendRequest is never consistent with the follower's log.

It should be safe to allow followers to specify their true last log index. There is a risk that doing so can mask bugs in the algorithm, but should not itself be a bug.

Prevent split brain on startup

The current implementation of configuration changes handles startup of the cluster rather poorly. Essentially, when each server is started, it attempts to find an existing leader to join the cluster. If it can't find a cluster to join, it assumes it's a member of a new cluster and transitions directly to follower.

This can allow split brain when some number of joining servers can't see an existing leader. For instance, if a single node is running in the cluster and two new servers are added, if those two servers can't see the existing server (a single leader), they'll form their own cluster and elect a leader. Neither side of the partition will know about the other, and both sides will accept writes.

I propose that the structure of server configurations be used to determine whether a server is expected to already be a member of a cluster or if its joining a cluster. There's no safety issue of a server thinks it should be joining when it's already a member of a cluster, but there is a safety issue if a server thinks it's starting a new cluster when one already exists.

If the server's Address is listed in the cluster configuration, it should transition directly to FOLLOWER and assume that it's already a member of the cluster. If the follower does not receive any heartbeats from a leader within an election timeout and transitions to candidate and requests votes from other servers, those servers should verify that the candidate is present in their configurations. If a server is not present in some other server's configuration, it should transition back to INACTIVE and fail since this may indicate that it should have joined the cluster but was instead started as a full member.

Raft suggests servers should form a cluster by joining one by one. Initially, a single server cluster starts, and then additional servers are added through configuration changes. From an API perspective, something like this could perhaps be done by exposing methods that differentiate between joining and leaving a cluster of starting the server as an existing member of the cluster. As long as safety mechanisms are in place this would be fine.

Reject configuration changes if a configuration change is ongoing

The current implementation of configuration changes does not reject configuration requests if an existing reconfiguration is already taking place. This is a safety violation since multiple concurrent configuration changes can result in non-overlapping majorities. Leaders should ensure that configuration changes are committed and therefore applied on all servers before accepting new configuration changes.

Expired session still left open after its recovery

When a expired session is recovered by opening a new session, we still leave the old session open. As result the old ClientSessionManager keeps sending periodic KeepAliveRequests (which fail with UNKNOWN_SESSION_ERROR as expected) I noticed that this also prevents the client (whose session was recovered) from successfully submitting new operations possibly because it is still using a stale ClientSessionSubmitter (I'm not sure of this part).

I should have a pull request with the bug fix. It worked in my limited testing. But I'm not fully versed with the ClientSession recovery logic yet. So there could be some missing pieces.

Intermitttent unit test failure in ClusterTest

I observe intermittent unit test failures ( > 50%) while running local builds of Copycat. ClusterTest::testMany is the one that fails.

I'm at commit 310383b

Tests run: 84, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 476.549 sec <<< FAILURE! testMany(io.atomix.copycat.test.ClusterTest) Time elapsed: 40.868 sec <<< FAILURE! java.util.concurrent.TimeoutException: Test timed out while waiting for an expected result at net.jodah.concurrentunit.Waiter.await(Waiter.java:168) at net.jodah.concurrentunit.Waiter.await(Waiter.java:142) at net.jodah.concurrentunit.ConcurrentTestCase.await(ConcurrentTestCase.java:105) at io.atomix.copycat.test.ClusterTest.testMany(ClusterTest.java:135)

I haven't done a lot of debugging yet. Thought I will log an issue in case some one runs into the same thing or have already figured out whats causing it.

Reorganize scheduler method parameters

Scheduler method parameters don't quite feel right when you use them. It currently adheres to the ScheduledExecutorService API with callbacks coming before Duration, but I think the signature should be the opposite which feels a little better with lambdas.

Infer session event associations

Rather than exposing a CompletableFuture for published session events, state machines should infer the command to which published session events related based on the command that was being executed when they were sent. The state machine can transparently wait for session events to be received when necessary before sending a response. Note that we must ensure that events sent by scheduled callbacks are not improperly associated with the commands that triggered them.

State machines should be able to publish events during session registration

In order to ensure consistency, the internal ServerStateMachine only allows events to be published during certain operations. Effectively, for fault tolerance session events may only be published in response to operations that are replicated through the Raft log and are therefore applied on a majority of servers. But the current implementation is missing support for publishing session events during session registration operations. Session events published during session registration should be linearizable by default.

Client sessions should not expire on NO_LEADER_ERROR

The logic in the current implementation of client-side session expiration is inverted. Clients currently only expire their sessions on NO_LEADER_ERROR but don't on other errors. But clients should not expire their sessions on NO_LEADER_ERROR since sessions cannot be expired if no leader exists (to ensure compacted keep-alives don't expire sessions). Sessions can persist through loss of quorums safely since session memory usage can only increase of commands are committed to the log. It is when clients can't communicate with the cluster for more than a session timeout (from the beginning of their last successful request) that they should expire their sessions. This is necessary to ensure a client perceives its session as being lost for consistency with other clients. This will not be totally consistent. Clock drift precludes us from guaranteeing a client will expire its own session before the cluster expires it. But perhaps we can provide an alternative failure event to indicate that the session was lost on the client side rather than expired by the cluster.

Bug in network partition detection logic?

Scenario 3 nodes: s1, s2 and s3
Kill s1 (current leader)
Noticed that s2 repeatedly gets elected leader and immediately steps down suspecting a network partition.

Here are the relevant logs for s2:

13:35:35.542 [copycat-server-Address[localhost/127.0.0.1:5002]] DEBUG i.a.c.server.state.FollowerState - Address[localhost/127.0.0.1:5002] - Heartbeat timed out in PT1.4S 13:35:35.542 [copycat-server-Address[localhost/127.0.0.1:5002]] INFO i.a.c.server.state.FollowerState - Address[localhost/127.0.0.1:5002] - Polling members [Address[localhost/127.0.0.1:5001], Address[localhost/127.0.0.1:5003]] 13:35:35.542 [copycat-server-Address[localhost/127.0.0.1:5002]] DEBUG i.a.c.server.state.FollowerState - Address[localhost/127.0.0.1:5002] - Polling Address[localhost/127.0.0.1:5001] for next term 114 13:35:35.543 [copycat-server-Address[localhost/127.0.0.1:5002]] DEBUG i.a.c.server.state.FollowerState - Address[localhost/127.0.0.1:5002] - Polling Address[localhost/127.0.0.1:5003] for next term 114 13:35:35.543 [copycat-server-Address[localhost/127.0.0.1:5002]] DEBUG i.a.c.server.state.FollowerState - Address[localhost/127.0.0.1:5002] - Received accepted poll from Address[localhost/127.0.0.1:5003] 13:35:35.543 [copycat-server-Address[localhost/127.0.0.1:5002]] INFO i.a.copycat.server.state.ServerState - Address[localhost/127.0.0.1:5002] - Transitioning to CANDIDATE 13:35:35.543 [copycat-server-Address[localhost/127.0.0.1:5002]] DEBUG i.a.c.server.state.FollowerState - Address[localhost/127.0.0.1:5002] - Cancelling heartbeat timer 13:35:35.543 [copycat-server-Address[localhost/127.0.0.1:5002]] INFO i.a.c.server.state.CandidateState - Address[localhost/127.0.0.1:5002] - Starting election 13:35:35.543 [copycat-server-Address[localhost/127.0.0.1:5002]] DEBUG i.a.copycat.server.state.ServerState - Address[localhost/127.0.0.1:5002] - Set term 114 13:35:35.543 [copycat-server-Address[localhost/127.0.0.1:5002]] DEBUG i.a.copycat.server.state.ServerState - Address[localhost/127.0.0.1:5002] - Voted for Address[localhost/127.0.0.1:5002] 13:35:35.543 [copycat-server-Address[localhost/127.0.0.1:5002]] INFO i.a.c.server.state.CandidateState - Address[localhost/127.0.0.1:5002] - Requesting votes from [Address[localhost/127.0.0.1:5001], Address[localhost/127.0.0.1:5003]] 13:35:35.543 [copycat-server-Address[localhost/127.0.0.1:5002]] DEBUG i.a.c.server.state.CandidateState - Address[localhost/127.0.0.1:5002] - Requesting vote from Address[localhost/127.0.0.1:5001] for term 114 13:35:35.543 [copycat-server-Address[localhost/127.0.0.1:5002]] DEBUG i.a.c.server.state.CandidateState - Address[localhost/127.0.0.1:5002] - Requesting vote from Address[localhost/127.0.0.1:5003] for term 114 13:35:35.544 [copycat-server-Address[localhost/127.0.0.1:5002]] DEBUG i.a.c.server.state.CandidateState - Address[localhost/127.0.0.1:5002] - Received successful vote from Address[localhost/127.0.0.1:5003] 13:35:35.544 [copycat-server-Address[localhost/127.0.0.1:5002]] INFO i.a.copycat.server.state.ServerState - Address[localhost/127.0.0.1:5002] - Transitioning to LEADER 13:35:35.544 [copycat-server-Address[localhost/127.0.0.1:5002]] DEBUG i.a.c.server.state.CandidateState - Address[localhost/127.0.0.1:5002] - Cancelling election 13:35:35.544 [copycat-server-Address[localhost/127.0.0.1:5002]] INFO i.a.copycat.server.state.ServerState - Address[localhost/127.0.0.1:5002] - Found leader Address[localhost/127.0.0.1:5002] 13:35:35.560 [copycat-server-Address[localhost/127.0.0.1:5002]] DEBUG i.a.copycat.server.state.LeaderState - Address[localhost/127.0.0.1:5002] - Sent AppendRequest[term=114, leader=2130712286, logIndex=742033, logTerm=1, entries=[325], commitIndex=787602, globalIndex=787570] to Address[localhost/127.0.0.1:5001] 13:35:35.575 [copycat-server-Address[localhost/127.0.0.1:5002]] DEBUG i.a.copycat.server.state.LeaderState - Address[localhost/127.0.0.1:5002] - Sent AppendRequest[term=114, leader=2130712286, logIndex=742031, logTerm=1, entries=[327], commitIndex=787602, globalIndex=787570] to Address[localhost/127.0.0.1:5003] 13:35:35.575 [copycat-server-Address[localhost/127.0.0.1:5002]] DEBUG i.a.copycat.server.state.LeaderState - Address[localhost/127.0.0.1:5002] - Starting heartbeat timer 13:35:35.575 [copycat-server-Address[localhost/127.0.0.1:5002]] WARN i.a.copycat.server.state.LeaderState - Address[localhost/127.0.0.1:5002] - Suspected network partition. Stepping down 13:35:35.575 [copycat-server-Address[localhost/127.0.0.1:5002]] INFO i.a.copycat.server.state.ServerState - Address[localhost/127.0.0.1:5002] - Transitioning to FOLLOWER 13:35:35.575 [copycat-server-Address[localhost/127.0.0.1:5002]] DEBUG i.a.copycat.server.state.LeaderState - Address[localhost/127.0.0.1:5002] - Cancelling heartbeat timer

My hunch is that the logic in place for a leader to detect if it is partitioned away is causing this.

When a AE request to a follower fails we immediately check to see if the leader is partitioned by checking when the last successful commit happened. I feel this check should include the time when the leader's term started.

So perhaps the check ought to be:

System.currentTimeMillis() - Math.max(commitTime(), termStartTime) > context.getElectionTimeout().toMillis() * 2

Sessions may be lost after compaction and replay

In certain scenarios, a state machine may lose a session after a restart due to log cleaning even if the session hasn't actually timed out. This can occur if enough keep-alive entries were cleaned after the initial session registration. State machines need a way to determine the difference between missing keep alive entries and a true session expiration. In order to accomplish this, leaders should commit unregister entries once they see a session expire from their own perspective (the most consistent/up to date), and state machines should not remove sessions from memory until an unregister entry is encountered.