Giter Site home page Giter Site logo

jgroups-extras / jgroups-raft Goto Github PK

View Code? Open in Web Editor NEW
253.0 33.0 81.0 1.32 MB

Implementation of the RAFT consensus protocol in JGroups

Home Page: https://jgroups-extras.github.io/jgroups-raft/

License: Apache License 2.0

Java 99.15% Shell 0.62% Dockerfile 0.22%
consensus distributed-systems raft raft-algorithm raft-protocol

jgroups-raft's People

Contributors

belaban avatar chr1st0ph avatar dependabot[bot] avatar fmarinelli avatar franz1981 avatar hypnoce avatar jabolina avatar metanet avatar npepinpe avatar oscerd avatar osheroff avatar pruivo avatar valdar avatar vjuranek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jgroups-raft's Issues

Incorrect commit-index after restart

This is related to #24 .

  • Static membership={A,B,C}
  • Start A, B (A is the leader)
  • Kill B, now we have no majority
  • Make a change: last-applied=1, commit-index=0 on A
  • Kill A
  • Restart A and start B
    -> commit-index should now (we have the majority) be 1, but it is still 0 in A and B

Add snapshot transfer

When a member P joins for the first time, or is way out of sync, the leader needs to transfer the snapshot to P.
The leader determines that P is out of sync if an ApendEntries message to P with (e.g. index=95) got a negative result with index=0. If the leader's first index is 40, then the leader will never be able to provide messages 1=39 to P.
If the leader has a snapshot, e.g. due to log compaction (see #7), then that snapshot is used. Otherwise, a new snapshot is generated.
Next, the snapshot is sent to P, which applies the snapshot to its state machine and sets last_applied to the index shipped with the snapshot. The log is then created with that index+1.
Following AppendEntries will get P's log up-to-date with the leader using the default mechanisms.

RAFT doesn't work with ForkChannels

When creating a ForkChannel and adding RAFT over it, there's an NPE on RAFT.init(). This requires a change in JGroups, see [1] for details. The code below doesn't work

[1] https://issues.jboss.org/browse/JGRP-1926

public class bla {
    protected JChannel   ch, fork_ch;

    protected void start(String name) throws Exception {
        ch=new JChannel("/home/bela/udp.xml").name(name);
        fork_ch=new ForkChannel(ch, "singleton", "fc1",
                                new ELECTION(),
                                new RAFT().members(Arrays.asList("A", "B", "C")).raftId(name),
                                new REDIRECT(),
                                new CLIENT());
        fork_ch.connect("ignored");
        ch.connect("demo");
    }

    public static void main(String[] args) throws Exception {
        new bla().start(args[0]);
    }
}

Demos don't work

Exception in thread "main" java.lang.NullPointerException
at org.jgroups.protocols.raft.RAFT.createLogName(RAFT.java:955)
at org.jgroups.protocols.raft.RAFT.start(RAFT.java:430)
at org.jgroups.stack.ProtocolStack.startStack(ProtocolStack.java:965)
at org.jgroups.JChannel.startStack(JChannel.java:890)
at org.jgroups.JChannel._preConnect(JChannel.java:553)
at org.jgroups.JChannel.connect(JChannel.java:288)
at org.jgroups.JChannel.connect(JChannel.java:279)
at org.jgroups.raft.demos.ReplicatedStateMachineDemo.start(ReplicatedStateMachineDemo.java:30)
at org.jgroups.raft.demos.ReplicatedStateMachineDemo.main(ReplicatedStateMachineDemo.java:184)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

Replicate committed entries from leader to followers

Currently, AppendEntries are sent by the leader and committed at the leader when a majority of votes has been received.
However, entries are not committed at the followers.
Todos:

  • Maintain commit_table
  • Start task which periodically (heartbeat_interval ms) sends out AppendEntries messages to members who are lagging behind
  • Advanced commit_id and apply log entries to the state machine at the followers on reception of AppendEntries message

RAFT: RequestTable needs to be populated when becoming leader

When we have 2 requests (3-4) in the leader's RequestTable with last_applied=4 and commit_index=2, and the leader crashes and is restarted (or a new leader is started), then RequestTable needs to be populated with pending requests.
If this is not done, the new leader will start trying to commit pending entries 3 and 4 and when the AppendEntries responses are received, and the RequestTable is empty, we won't ever commit entries 3 and 4.
Solution: when becoming leader, populate RequestTable from the log: add entries in range [last_applied+1 .. commit_index].

nexus issue while building from scratch

ivy:retrieve] nexus-snapshots: unable to get resource for org/apache/logging/log4j#log4j;2.6.2: res=${nexus.snapshots.url}/org/apache/logging/log4j/log4j/2.6.2/log4j-2.6.2.pom: java.net.MalformedURLException: no protocol: $
nexus.snapshots.url}/org/apache/logging/log4j/log4j/2.6.2/log4j-2.6.2.pom
ivy:retrieve] :: org.mapdb#mapdb;1.0.8: several problems occurred while resolving dependency: org.mapdb#mapdb;1.0.8 {=[]}:
ivy:retrieve] several problems occurred while resolving dependency: org.sonatype.oss#oss-parent;7 {}:
ivy:retrieve] nexus-snapshots: unable to get resource for org/sonatype/oss#oss-parent;7: res=${nexus.snapshots.url}/org/sonatype/oss/oss-parent/7/oss-parent-7.jar: java.net.MalformedURLException: no protocol: ${nexus.snapsh
ts.url}/org/sonatype/oss/oss-parent/7/oss-parent-7.jar
ivy:retrieve] nexus-snapshots: unable to get resource for org/sonatype/oss#oss-parent;7: res=${nexus.snapshots.url}/org/sonatype/oss/oss-parent/7/oss-parent-7.pom: java.net.MalformedURLException: no protocol: ${nexus.snapsh
ts.url}/org/sonatype/oss/oss-parent/7/oss-parent-7.pom
ivy:retrieve] nexus-snapshots: unable to get resource for org/mapdb#mapdb;1.0.8: res=${nexus.snapshots.url}/org/mapdb/mapdb/1.0.8/mapdb-1.0.8.pom: java.net.MalformedURLException: no protocol: ${nexus.snapshots.url}/org/mapd
/mapdb/1.0.8/mapdb-1.0.8.pom
ivy:retrieve] :: org.fusesource.leveldbjni#leveldbjni-all;1.8: several problems occurred while resolving dependency: org.fusesource.leveldbjni#leveldbjni-all;1.8 {=[]}:
ivy:retrieve] several problems occurred while resolving dependency: org.fusesource.leveldbjni#leveldbjni-project;1.8 {}:
ivy:retrieve] several problems occurred while resolving dependency: org.fusesource#fusesource-pom;1.9 {}:
ivy:retrieve] nexus-snapshots: unable to get resource for org/fusesource#fusesource-pom;1.9: res=${nexus.snapshots.url}/org/fusesource/fusesource-pom/1.9/fusesource-pom-1.9.jar: java.net.MalformedURLException: no protocol:
{nexus.snapshots.url}/org/fusesource/fusesource-pom/1.9/fusesource-pom-1.9.jar
ivy:retrieve] nexus-snapshots: unable to get resource for org/fusesource#fusesource-pom;1.9: res=${nexus.snapshots.url}/org/fusesource/fusesource-pom/1.9/fusesource-pom-1.9.pom: java.net.MalformedURLException: no protocol:
{nexus.snapshots.url}/org/fusesource/fusesource-pom/1.9/fusesource-pom-1.9.pom
ivy:retrieve] nexus-snapshots: unable to get resource for org/fusesource/leveldbjni#leveldbjni-project;1.8: res=${nexus.snapshots.url}/org/fusesource/leveldbjni/leveldbjni-project/1.8/leveldbjni-project-1.8.pom: java.net.Ma
formedURLException: no protocol: ${nexus.snapshots.url}/org/fusesource/leveldbjni/leveldbjni-project/1.8/leveldbjni-project-1.8.pom
ivy:retrieve] nexus-snapshots: unable to get resource for org/fusesource/leveldbjni#leveldbjni-all;1.8: res=${nexus.snapshots.url}/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8.pom: java.net.MalformedURLEx
eption: no protocol: ${nexus.snapshots.url}/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8.pom
ivy:retrieve] :: org.testng#testng;6.8.+: several problems occurred while resolving dependency: org.sonatype.oss#oss-parent;3 {}:
ivy:retrieve] nexus-snapshots: unable to get resource for org/sonatype/oss#oss-parent;3: res=${nexus.snapshots.url}/org/sonatype/oss/oss-parent/3/oss-parent-3.jar: java.net.MalformedURLException: no protocol: ${nexus.snapsh
ts.url}/org/sonatype/oss/oss-parent/3/oss-parent-3.jar
ivy:retrieve] nexus-snapshots: unable to get resource for org/sonatype/oss#oss-parent;3: res=${nexus.snapshots.url}/org/sonatype/oss/oss-parent/3/oss-parent-3.pom: java.net.MalformedURLException: no protocol: ${nexus.snapsh
ts.url}/org/sonatype/oss/oss-parent/3/oss-parent-3.pom
ivy:retrieve] :: com.beust#jcommander;1.+: several problems occurred while resolving dependency: org.sonatype.oss#oss-parent;3 {}:
ivy:retrieve] nexus-snapshots: unable to get resource for org/sonatype/oss#oss-parent;3: res=${nexus.snapshots.url}/org/sonatype/oss/oss-parent/3/oss-parent-3.jar: java.net.MalformedURLException: no protocol: ${nexus.snapsh
ts.url}/org/sonatype/oss/oss-parent/3/oss-parent-3.jar
ivy:retrieve] nexus-snapshots: unable to get resource for org/sonatype/oss#oss-parent;3: res=${nexus.snapshots.url}/org/sonatype/oss/oss-parent/3/oss-parent-3.pom: java.net.MalformedURLException: no protocol: ${nexus.snapsh
ts.url}/org/sonatype/oss/oss-parent/3/oss-parent-3.pom
ivy:retrieve] :: commons-io#commons-io;2.4: several problems occurred while resolving dependency: commons-io#commons-io;2.4 {=[]}:
ivy:retrieve] several problems occurred while resolving dependency: org.apache.commons#commons-parent;25 {}:
ivy:retrieve] several problems occurred while resolving dependency: org.apache#apache;9 {}:
ivy:retrieve] nexus-snapshots: unable to get resource for org/apache#apache;9: res=${nexus.snapshots.url}/org/apache/apache/9/apache-9.jar: java.net.MalformedURLException: no protocol: ${nexus.snapshots.url}/org/apache/apac
e/9/apache-9.jar
ivy:retrieve] nexus-snapshots: unable to get resource for org/apache#apache;9: res=${nexus.snapshots.url}/org/apache/apache/9/apache-9.pom: java.net.MalformedURLException: no protocol: ${nexus.snapshots.url}/org/apache/apac
e/9/apache-9.pom
ivy:retrieve] nexus-snapshots: unable to get resource for org/apache/commons#commons-parent;25: res=${nexus.snapshots.url}/org/apache/commons/commons-parent/25/commons-parent-25.pom: java.net.MalformedURLException: no proto
ol: ${nexus.snapshots.url}/org/apache/commons/commons-parent/25/commons-parent-25.pom
ivy:retrieve] nexus-snapshots: unable to get resource for commons-io#commons-io;2.4: res=${nexus.snapshots.url}/commons-io/commons-io/2.4/commons-io-2.4.pom: java.net.MalformedURLException: no protocol: ${nexus.snapshots.ur
}/commons-io/commons-io/2.4/commons-io-2.4.pom
ivy:retrieve] ::::::::::::::::::::::::::::::::::::::::::::::
ivy:retrieve]

Counter init value is never used

Counter init value is never used as CounterService.getOrCreateCounter() calls get() to find out if there's already exists any value for given counter and eventually use this value. However, get() always returns a value. In case counter doesn't exist, it returns 0 and thus init value is actually never used.

Example from CounterServiceDemo which initialize counter to 1
(Counter counter=counter_service.getOrCreateCounter("counter", 1);):

-- view: [A(raft-id=A)|0] (1) [A(raft-id=A)]
[1] Increment [2] Decrement [3] Compare and set [4] Dump log
[8] Snapshot [9] Increment N times [x] Exit
first-applied=0, last-applied=0, commit-index=0, log size=0b

-- view: [A(raft-id=A)|1] (2) [A(raft-id=A), C(raft-id=C)]
-- changed role to Candidate
-- changed role to Leader
-- view: [A(raft-id=A)|2] (3) [A(raft-id=A), C(raft-id=C), B(raft-id=B)]
4

index (term): command
---------------------

[1] Increment [2] Decrement [3] Compare and set [4] Dump log
[8] Snapshot [9] Increment N times [x] Exit
first-applied=0, last-applied=0, commit-index=0, log size=0b

3
expected value: 1
update: 5
failed setting counter "counter" from 1 to 5, current value is 0

RaftHandle: new class to interact with RAFT

Currently, applications have to retrieve a ref to the RAFT and CLIENT protocols, which is cumbersome, as users should not have to deal with the protocol stack of JGroups.
Goal: provide a new class RaftHandle which sits on top of a channel and has methods such as

  • set() (implementing Settable
  • register the state machine
  • get the last-applied, commit-index etc
  • set raft-id
  • Register RoleChange listeners

In a nutshell: let this new class handle protocol stack interactions and shield the users from the stack.
All blocks (e.g. CounterService and ReplicatedStateMachine) should also use this class.

This class could be created on regular channels or also on fork channels.

The name is yet TBD.

Consensus based CounterService

Implement CounterService and CounterServiceDemo in jgroups-raft. All counters are their values are essentially stored in a hashmap, so updates to counters simply updates the hashmap.
When less than N/1+1 servers are running, the service won't be able to make progress (forsaking availability), but it handles network partitions very well.

RAFT: uncommitted entry is not replicated

If we have A.last_applied=1 and A.commit_index=0, then the 1 entry that was added to A's log won't get replicated to a newly started B.
To reproduce:

  • RAFT.members is {A,B,C}
  • Use CounterServiceDemo to reproduce
  • Start A
  • Start B. Assuming that the leader is A:
  • Kill B
  • A is still the leader (this might change in the future)
  • Add an entry to A's state machine. This will fail, as we don't have a majority to commit. However, A will add the entry to its log and set last_applied to 1 (commit_index is 0)
  • Start B again
    -> A won't replicate the entry to B !
    -> Adding another entry to A won't help either:
14037 [ERROR] RAFT: A: resending of 0 failed; entry not found

-> RAFT.resend() with index=0 doesn't work (the log starts at 1) !!

Cannot apply entries in single node cluster

Related to #68 - when developing locally, it's common to test with a single node cluster. It seems the commit index is never incremented; perhaps I mis-configured something? Looking through the code, though, it seems like it's behaving as expected: we increment the commit index only when receiving append entries request, which the leader obviously doesn't send to itself. I think this could be solved by using the leader's own index to calculate the quorum.

Did I miss something or is this the intended behaviour?

EDIT: as a side note, using the leader's own match index when computing quorum has a nice side effect of allowing the append to the local log to be asynchronous, i.e. we can start replicating before the entry has been appended to the local log (this is described in section 10.2.1 in the original thesis)

Cannot create single member cluster

Hi,

while trying to make a single node raft quorum work, I ended up in a loophole where the single node A was trying to perform an election that never got a response.

Here is the configuration :

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd">
  <TCP
          bind_addr="127.0.0.1"
          bind_port="7800"
          port_range="2"
          enable_diagnostics="true"
  />


  <TCPPING initial_hosts="127.0.0.1[7800]"
           port_range="2"
           max_dynamic_hosts="3"
           async_discovery="true"
  />
  <MERGE3 max_interval="30000"
          min_interval="10000"/>
  <FD_SOCK/>
  <FD_ALL/>
  <VERIFY_SUSPECT timeout="1500"  />
  <BARRIER />
  <pbcast.NAKACK2 xmit_interval="500"
                  xmit_table_num_rows="100"
                  xmit_table_msgs_per_row="2000"
                  xmit_table_max_compaction_time="30000"
                  use_mcast_xmit="false"
                  discard_delivered_msgs="true"/>
  <UNICAST3 xmit_interval="500"
            xmit_table_num_rows="100"
            xmit_table_msgs_per_row="2000"
            xmit_table_max_compaction_time="60000"
            conn_expiry_timeout="0"/>
  <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                 max_bytes="4M"/>
  <raft.NO_DUPES/>
  <pbcast.GMS print_local_addr="true" join_timeout="2000"/>
  <!--<UFC max_credits="2M"-->
  <!--min_threshold="0.4"/>-->
  <MFC max_credits="2M"
       min_threshold="0.4"/>
  <FRAG2 frag_size="60K"  />
  <RSVP resend_interval="2000" timeout="10000"/>
  <pbcast.STATE_TRANSFER />
  <raft.ELECTION election_min_interval="100" election_max_interval="500"/>
  <raft.RAFT members="A" raft_id="A" resend_interval="1000" />
  <raft.REDIRECT/>
  <raft.CLIENT bind_addr="0.0.0.0" />
</config>

Listenning to the role change event, I only get
changed role to Candidate

Is this expected ?

Add log compaction

Log compaction dumps the contents of the state machine into a snapshot file and then truncates the log.
This is done as follows:

  • The member stops handling requests
  • The state machine is asked to dump its state into a snapshot file (same location as the log file(s))
  • The logs is truncated:
    • If the log was 1-95 (first=1, last_applied=commit_index=95), then 1-94 will be removed, and first=95, last_applied=commit_index=95
  • The member accepts requests again

Log compaction can be triggered when the log size exceeds a certain configured threshold, or manually (e.g. via JMX). Also, when a leader needs to send a snapshot to a member, and the snapshot doesn't exist, one will be created.

Log synchronization to out-of-sync member doesn't catch up

When we have {A,B,C}, then kill C and make a few more updates (e.g. using ReplicatedStateMachineDemo), after restarting C, A tries to transfer a snapshot to C.
This is costly and (apparently) doesn't work.
Instead, A should simply send the missing log entries to C.
Same problem when we remove C's log: snapshot installation doesn't work.

This is probably a regression of adding snapshot installation

ELECTION: start / stop elections based on view changes

Raft [1] duplicates some of the JGroups functionality: the goal of this issue is to remove that duplication and use JGroups wherever possible, e.g. for heartbeats and election kickoffs.
Also, cluster membership changes as described in ch. 4 of [1] should be simple to implement with this change.
Examples:

  • When a view size drops under majority, the leader steps down
  • When a view size reaches majority members, start the election process. When a leader has been chosen, stop the election timer
  • When a leader is not in the next view, start an election

Advantages:

  • No constant heartbeating, this is done by JGroups anyway
  • Easy addition of members which would change majority
  • No need to mark views as special entries in the log

Details are in [2].

[1] https://github.com/ongardie/dissertation
[2] https://github.com/belaban/jgroups-raft/blob/master/doc/design/Election.txt

Web site

Use GitHub Pages to create a simple web site; mainly used to host the manual and point to the discussion list.

All members changeRole for Candidate, How to solve it ?

   protected boolean voteFor(final Address addr) {
        if(addr == null) {
            voted_for=null;
            return true;
        }
        if(voted_for == null) {
            voted_for=addr;
            return true;
        }
        return voted_for.equals(addr); // a vote for the same candidate in the same term is ok
    }

protected void handleVoteRequest(Address sender, int term, int last_log_term, int last_log_index) {
        if(local_addr != null && local_addr.equals(sender))
            return;
        if(log.isTraceEnabled())
            log.trace("%s: received VoteRequest from %s: term=%d, my term=%d, last_log_term=%d, last_log_index=%d",
                      local_addr, sender, term, raft.currentTerm(), last_log_term, last_log_index);
        boolean send_vote_rsp=false;
        synchronized(this) {
            if(voteFor(sender)) {
                if(sameOrNewer(last_log_term, last_log_index))
                    send_vote_rsp=true;
                else {
                    log.trace("%s: dropped VoteRequest from %s as my log is more up-to-date", local_addr, sender);
                }
            }
            else
                log.trace("%s: already voted for %s in term %d; skipping vote", local_addr, sender, term);
        }
        if(send_vote_rsp)
            sendVoteResponse(sender, term); // raft.current_term);
    }

here!!!!!
return voted_for.equals(addr); // a vote for the same candidate in the same term is ok

not work when using tcp

Hi

It doesn't work when I try to use tcp in jgroup raft with the following setting.:


<TCP bind_port="7800"
recv_buf_size="${tcp.recv_buf_size:130k}"
send_buf_size="${tcp.send_buf_size:130k}"
max_bundle_size="64K"
sock_conn_timeout="300"

     thread_pool.min_threads="0"
     thread_pool.max_threads="20"
     thread_pool.keep_alive_time="30000"/>

<TCPPING async_discovery="true"
         initial_hosts="${jgroups.tcpping.initial_hosts:localhost[7800],localhost[7801]}"
         port_range="2"/>
<MERGE3 max_interval="30000"
        min_interval="10000"/>
<FD_SOCK/>
<FD_ALL/>
<VERIFY_SUSPECT timeout="1500"  />
<BARRIER />
<pbcast.NAKACK2 xmit_interval="500"
                xmit_table_num_rows="100"
                xmit_table_msgs_per_row="2000"
                xmit_table_max_compaction_time="30000"
                max_msg_batch_size="500"
                use_mcast_xmit="false"
                discard_delivered_msgs="true"/>
<UNICAST3 xmit_interval="500"
          xmit_table_num_rows="100"
          xmit_table_msgs_per_row="2000"
          xmit_table_max_compaction_time="60000"
          conn_expiry_timeout="0"
          max_msg_batch_size="500"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
               max_bytes="4M"/>
<raft.NO_DUPES/>
<pbcast.GMS print_local_addr="true" join_timeout="2000"
            view_bundling="true"/>
<UFC max_credits="2M"
     min_threshold="0.4"/>
<MFC max_credits="2M"
     min_threshold="0.4"/>
<FRAG2 frag_size="60K"  />
<RSVP resend_interval="2000" timeout="10000"/>
<pbcast.STATE_TRANSFER />
<raft.ELECTION election_min_interval="100" election_max_interval="500"/>
<raft.RAFT members="1,2" raft_id="${raft_id:undefined}" resend_interval="1000"/>
<raft.REDIRECT/>
<raft.CLIENT bind_addr="0.0.0.0" />

Thanks,
Qi

In-memory log implementation for unit tests

Implementation of Log in memory only. Main benefit: speed for unit tests and no need to clean up files on disk after unit tests. Also avoids incorrect unit tests by inadvertently reusing left over log files from previous runs or demos.

CounterServiceDemo: concurrent mass increment blocks clients

When running 2 (or more) instances of CounterServiceDemo and incrementing the same counter from 2 instances at the same time (e.g. press [9] and pick 1000 increments), then both instances block.
A preliminary investigation showed that the issue is with REDIRECT.

AddServer / RemoverServer commands

This is re #21: currently probe.sh or JMX needs to be used to dynamically add or remove a server. However, also provide 2 commands which can be submitted by a client (e.g. via a script) to add / remove servers. The commands would try to contact a local server (fixed port) or be given an address:port to connect to.
This requires an additional protocol which listens for client commands and forwards them to the current leader, plus a simple client-server protocol.
Make sure the client-server protocol is generic enough to later be reused for generic client commands, e.g. get(), set() etc.

RAFT: commit_id of follower(s) not updated

In some scenarios, we can end up with followers not getting the last commit_id, e.g. (A = leader, B = follower)

  • A: last-applied=1000, commit-id=1000
  • B: last-applied=1000, commit-id=999

The resend task on A should resend the commit-id of 1000 to B, so B can update its commit-id from 999 to 1000 and apply the state change represented by update 1000 to its state machine.
Need to reproduce this first...

Reset heartbeat timer on vote

I may be missing something, so apologies if I'm reading your code wrong :-)
It appears that when handling a VoteRequest, the heartbeat timer is never reset upon voting for a candidate.
https://github.com/belaban/jgroups-raft/blob/master/src/org/jgroups/protocols/raft/ELECTION.java#L191-L211
Raft specifies that a follower resets its election timeout when it grants a vote to a candidate. This ensures that a follower doesn't vote for a candidate and then immediately timeout, transition to candidate itself, increment its term, vote for itself, and ultimately force a newly elected leader to step down.

Jgroups version incompatibility

The jgroups version in the current pom.xml states the version 3.6.0.Final however it does not recognize
org.jgroups.util.Util.waitUntilAllChannelsHaveSameView which is used in the current project. It is only supported in version 4.0.0.Beta3 however that is not compatible with other symbols used from jgroups.

Prevent duplicate members

If we have RAFT.members = {A,B,C}, we do currently prevent a member D from joining. However, we don't prevent two members with the same raft-id from joining, e.g. 2 members each with raft-id=C.
This currently only triggers a warning, but ideally we'd like to prevent the second duplicate C member from starting.
This could be done by each new joiner checking its first view and closing the channel when it finds a dupe. Not nice as this may not necessarily terminate the application which started the channel.
Alternatively, we could create a new protocol DUPE_PREVENTION which is a copy of AUTH and rejects all JOIN and MERGE requests which would add a duplicate member to the current view.
'Duplicate member' here means a member whose address (an ExtendedUUID) has a duplicate raft-id.

Leader keeps sending snapshot forever

56773 [DEBUG] RAFT: A: sending snapshot (17b) to B
57775 [DEBUG] RAFT: A: sending snapshot (17b) to B
58777 [DEBUG] RAFT: A: sending snapshot (17b) to B
59779 [DEBUG] RAFT: A: sending snapshot (17b) to B
...

To reproduce:

  1. Start A and B (out of {A,B,C}); they'll get the majority and (assume) A becomes leader
  2. Make a few updates, e.g. with CounterServiceDemo
  3. Do a snapshot on A
  4. Kill B and remove its log (/tmp/B.log)
  5. Restart B again
    -> Problem as shown above

Manual

  • Write manual (copy build and templates from JGroups)
  • Publis to GitHub pages: branch gh-pages in jgroups-raft repo

Leader needs to step down if view size falls below the majority

When we have cluster {A,B,C,D,E}, with A being the leader, and then the clusters splits into {A,B} (A being the leader in term 2) and {C,D,E} (C being the leader in term 3), then A could block client requests for a long time, because it cannot commit them (no majority), until the network split disappears.

However, C in term 3 is able to commit changes and would therefore not block client requests.

According to section 6.1 ("Leaders") [1], a leader can realize that it cannot get a majority any longer. In JGroups, this can be done by having A check on each view change if the view's size is still greater than or equal to the majority, and if this is not the case, step down as leader.
Thus, when A gets view change A|2={A,B}, it should become a Follower. This would allow clients to possibly access the new coordinator C.

Followers could do the same: null leader when they get a view whose size is smaller than the majority. This would eliminate the risk of followers redirecting clients to stale leaders.

[1] https://github.com/ongardie/dissertation

Killing and restarting member P let's P vote twice

We have members={A,B,C,D} (majority=3) and have members A (leader), B and C running (e.g. CounterServiceDemo).
To reproduce:

  • Kill C
  • Now A and B don't have a majority to commit
  • Make a change in A (e.g. increment the counter from 3 to 4)
  • Both A and B add the incr to their log but don't commit as A doesn't have the majority (it only has 2 votes for the change). The value of the counter is therefore still 3.
  • Kill B and restart it
  • The change to the counter is now committed and its value is (incorrectly) 4 !

-> Probably B's vote was counted twice
-> Solution: maintain not just the number of votes but also who voted and discard duplicate votes

RAFT: request correlation

Correlate AppendEntries requests and responses:

  • Clients should get return values when calling set()
  • The matchIndex and nextIndex has to be adjusted
  • If we get responses from a majority, we have to advance the commitIndex and apply all committed entries to the state machine, plus unblock clients blocked on set() with results

Add / remove server (dynamic membership)

Currently RAFT.majority defines the majority needed for elections and log commits. This requires the operator to always start the same set of servers. It is easy to start A,B,C, append and commit some changes to the log, then stop them and start D,E,F and make some other changes, overwriting the previous ones and violating Leader Completeness.
We need to define a static membership, e.g. {A,B,C} in the config file, compute the majority from it (2) and use AddServer or RemoveServer to change it dynamically (in running members) and XML editing to change it in the config.
Every member should be identified by a name (logical name), which also names the persistent log. When started, and the name is not in the above list, become read-only (don't participate in elections) or throw an exception.
Adding or removing a server would involve calling AddServer which needs to be acked by a majority of the existing view, and then the config would need to be changed as well.

Provide ConsensusService building block

Provide a new building block ConsensusService, whih can be used by applications to get consensus on a decision.
The jgroups-raft impl is somewhat tied to a state machine, and this block allows for it to be used in a different scenario.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.