wesleyac / raft Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 1.0 85 KB

A Raft implementation in python

License: MIT License

Python 100.00%

raft's People

Contributors

Stargazers

Watchers

Forkers

danluu

raft's Issues

Tests fail if network delay is 0

This might be "working as intended" but, in 15 seconds of discussion, Wesley and I couldn't think of an obvious reason this shouldn't work.

up_nodes contains DownNodes, can't call node methods on nodes

If you try to get a list of up_nodes, we get DownNodes!

        leaders = collections.defaultdict(set)
        for node in self.power_broker['up_nodes'].values():
            if node.is_leader():
                leaders[node.term].add(node.node_id)

======================================================================
ERROR: runTest (hypothesis.stateful.WorldBroker.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/danluu/dev/raft/venv/lib/python3.6/site-packages/hypothesis/stateful.py", line 191, in runTest
    run_state_machine_as_test(state_machine_class)
  File "/Users/danluu/dev/raft/venv/lib/python3.6/site-packages/hypothesis/stateful.py", line 109, in run_state_machine_as_test
    breaker.run(state_machine_factory(), print_steps=True)
  File "/Users/danluu/dev/raft/venv/lib/python3.6/site-packages/hypothesis/stateful.py", line 247, in run
    state_machine.execute_step(value)
  File "/Users/danluu/dev/raft/src/world_broker.py", line 141, in execute_step
    if node.is_leader():
AttributeError: 'DownNode' object has no attribute 'is_leader'

Multiple leaders elected when RecieveDrop occurs

Here's one example of a bad execution

Hypothesis test steps:

Step #1: []
Step #2: [<events.ReceiveDrop at 0x1057d88d0>]
Step #3: []
Step #4: []

terms : set of leaders

defaultdict(<class 'set'>, {1: {2}, 3: {2}, 6: {4}, 10: {2, 4}})

Term 10 has both 2 and 4 as leaders (or the test incorrectly thinks that's the case).

List of all change_type calls that don't change a node type to itself

0: Follower->Candidate
1: Candidate->Leader
1: Follower->Candidate
1: Leader->Follower
2: Follower->Candidate
3: Candidate->Leader
3: Leader->Follower
4: Follower->Candidate
4: Follower->Candidate
5: Candidate->Leader
5: Candidate->Leader
5: Leader->Follower
5: Leader->Follower
5: Follower->Candidate
6: Candidate->Leader
6: Leader->Follower
9: Follower->Candidate
9: Follower->Candidate
10: Candidate->Leader
10: Candidate->Leader

The node that became a leader in term 6 doesn't stop being a leader, but in term 10, a new node becomes a leader.

This seems possibly related to #17, where a node went from Candidate to Follower to Leader. The bug that incorrectly caused the node to go from Candidate to Follower was fixed, but an additional bug was that the node should not have been able to go form Follower to Leader.

Something.... happens if we set cat level 20

Term #0, Node #2: Follower->Candidate
Node #2 increased term to 1
Node #2 voted for node #2
Node #3 increased term to 1
Term #0, Node #2: Follower->Candidate
Node #2 increased term to 1
Node #2 voted for node #2
Node #3 increased term to 1
Step #1: [<events.ReceiveDrop at 0x10a18a940>, <events.ReceiveDrop at 0x10a18a668>]
E
======================================================================
ERROR: runTest (hypothesis.stateful.WorldBroker.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/danluu/dev/raft/venv/lib/python3.6/site-packages/hypothesis/stateful.py", line 191, in runTest
    run_state_machine_as_test(state_machine_class)
  File "/Users/danluu/dev/raft/venv/lib/python3.6/site-packages/hypothesis/stateful.py", line 109, in run_state_machine_as_test
    breaker.run(state_machine_factory(), print_steps=True)
  File "/Users/danluu/dev/raft/venv/lib/python3.6/site-packages/hypothesis/stateful.py", line 251, in run
    state_machine.teardown()
  File "/Users/danluu/dev/raft/src/world_broker.py", line 212, in teardown
    self.print_log()
  File "/Users/danluu/dev/raft/src/world_broker.py", line 78, in print_log
    if entry['log_type'] == 'change_type':
KeyError: 'log_type'

----------------------------------------------------------------------
Ran 1 test in 0.764s

FAILED (errors=1)
Term #0, Node #2: Follower->Candidate
Node #2 increased term to 1
Node #2 voted for node #2
Node #3 increased term to 1

One problem is that we log something that doesn't match our print log function. It's possible there's some other bad thing going on here that's masked by our code blowing up because we can't print the log.

Pylint

Your code has been rated at -0.75/10

Tell me how you really feel pylint

Get election timeout from hypothesis?

Right now, we calculate election timeouts using

self.rng.randint(self.conf['election_timeout_window'][0],
                                self.conf['election_timeout_window'][1])

We might get better test shrinking if we have hypothesis supply this randomness, similar to how we have hypothesis supply the randomness for message delays.

Adverse event generation doesn't work

Need to leave for dinner, but it turns out that if you make this change:

-        self.catastrophy_level = 0
+        self.catastrophy_level = 1

Hypothesis errors out with:

============================================================================ FAILURES ============================================================================
________________________________________________________________________ TestSet.runTest _________________________________________________________________________

self = fixed_dictionaries({'affected_nodes': sets(elements=sampled_from(range(0, 5))),
 'delay': integers(min_value=1, max_va...unt': integers(min_value=-100, max_value=100),
 'start_time': integers(min_value=0, max_value=400)}).flatmap(ClockSkew)

    def accept(self):
        if not hasattr(self, cache_key):
            try:
>               setattr(self, cache_key, getattr(self, force_key))
E               AttributeError: 'OneOfStrategy' object has no attribute 'force_is_empty'

venv/lib/python3.6/site-packages/hypothesis/searchstrategy/strategies.py:102: AttributeError

During handling of the above exception, another exception occurred:

self = fixed_dictionaries({'affected_nodes': sets(elements=sampled_from(range(0, 5))),
 'delay': integers(min_value=1, max_va... integers(min_value=1, max_value=400),
 'start_time': integers(min_value=0, max_value=400)}).flatmap(DeliveryDuplicate)

    def accept(self):
        if not hasattr(self, cache_key):
            try:
>               setattr(self, cache_key, getattr(self, force_key))
E               AttributeError: 'OneOfStrategy' object has no attribute 'force_is_empty'

venv/lib/python3.6/site-packages/hypothesis/searchstrategy/strategies.py:102: AttributeError

During handling of the above exception, another exception occurred:

self = fixed_dictionaries({'affected_node_pair': (sampled_from(range(0, 5)), sampled_from(range(0, 5))),
 'delay': integers(m...gth': integers(min_value=1, max_value=400),
 'start_time': integers(min_value=0, max_value=400)}).flatmap(TransmitDrop)

    def accept(self):
        if not hasattr(self, cache_key):
            try:
>               setattr(self, cache_key, getattr(self, force_key))
E               AttributeError: 'FlatMapStrategy' object has no attribute 'force_is_empty'

venv/lib/python3.6/site-packages/hypothesis/searchstrategy/strategies.py:102: AttributeError
During handling of the above exception, another exception occurred:

self = fixed_dictionaries({'affected_node_pair': (sampled_from(range(0, 5)), sampled_from(range(0, 5))),
 'delay': integers(m...alue=150),
 'event_length': integers(min_value=1, max_value=400),
 'start_time': integers(min_value=0, max_value=400)})

    def accept(self):
        if not hasattr(self, cache_key):
            try:
>               setattr(self, cache_key, getattr(self, force_key))
E               AttributeError: 'LazyStrategy' object has no attribute 'force_is_empty'

venv/lib/python3.6/site-packages/hypothesis/searchstrategy/strategies.py:102: AttributeError

During handling of the above exception, another exception occurred:

self = <hypothesis.stateful.WorldBroker.TestCase testMethod=runTest>

    def runTest(self):
>       run_state_machine_as_test(state_machine_class)

venv/lib/python3.6/site-packages/hypothesis/stateful.py:191:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
venv/lib/python3.6/site-packages/hypothesis/stateful.py:104: in run_state_machine_as_test
    breaker = find_breaking_runner(state_machine_factory, settings)
venv/lib/python3.6/site-packages/hypothesis/stateful.py:90: in find_breaking_runner
    database_key=state_machine_factory.__name__.encode('utf-8')
venv/lib/python3.6/site-packages/hypothesis/core.py:800: in find
    runner.run()
venv/lib/python3.6/site-packages/hypothesis/internal/conjecture/engine.py:320: in run
    self._run()
venv/lib/python3.6/site-packages/hypothesis/internal/conjecture/engine.py:564: in _run
    self.reuse_existing_examples()
venv/lib/python3.6/site-packages/hypothesis/internal/conjecture/engine.py:543: in reuse_existing_examples
    self.test_function(data)
venv/lib/python3.6/site-packages/hypothesis/internal/conjecture/engine.py:125: in test_function
    self._test_function(data)
venv/lib/python3.6/site-packages/hypothesis/core.py:771: in template_condition
    success = condition(result)
venv/lib/python3.6/site-packages/hypothesis/stateful.py:71: in is_breaking_run
    runner.run(state_machine_factory())
venv/lib/python3.6/site-packages/hypothesis/stateful.py:243: in run
    value = self.data.draw(state_machine.steps())
venv/lib/python3.6/site-packages/hypothesis/internal/conjecture/data.py:112: in draw
    return strategy.do_draw(self)
venv/lib/python3.6/site-packages/hypothesis/searchstrategy/lazy.py:154: in do_draw
    return data.draw(self.wrapped_strategy)
venv/lib/python3.6/site-packages/hypothesis/searchstrategy/lazy.py:104: in wrapped_strategy
    *self.__args, **self.__kwargs
venv/lib/python3.6/site-packages/hypothesis/strategies.py:454: in lists
    if elements.is_empty:
...

This is in the branch test-snapshot at 56fc771

Maybe have a way to limit number of simultaneous adverse events?

If we run with cat = 100, we get a failure every time (that I've seen). Here's the set of adverse events from one run:

Step #1: {'adverse_events': [<events.PowerDown at 0x10b24d128>,
  <events.TransmitDrop at 0x10b24d0b8>,
  <events.ReceiveDrop at 0x10b35b0f0>,
  <events.ClockSkew at 0x10b31b358>,
  <events.PowerDown at 0x10b376860>],

{'event_type': 'Simulation Initialization'}
{'affected_node_pair': (2, 1), 'delay': 1, 'event_length': 621, 'start_time': 0, 'event_type': 'TransmitDrop', 'global_time': 0}
{'affected_node': 2, 'event_length': 1, 'skew_amount': 57, 'start_time': 0, 'event_type': 'ClockSkew', 'global_time': 0}
{'affected_nodes': {0}, 'delay': 1, 'event_length': 621, 'start_time': 0, 'event_type': 'ReceiveDrop', 'global_time': 0}
{'affected_node': 3, 'event_length': 275, 'start_time': 0, 'event_type': 'PowerDown', 'global_time': 0}
{'affected_node': 3, 'start_time': 275, 'event_type': 'StopPowerDown', 'global_time': 275}
{'affected_node': 2, 'start_time': 275, 'data': <message.RequestVoteResponse object at 
{'affected_node': 3, 'event_length': 259, 'start_time': 362, 'event_type': 'PowerDown', 'global_time': 362}
{'affected_node': 3, 'start_time': 621, 'event_type': 'StopPowerDown', 'global_time': 621}

This simulation runs for 700ms. For 621ms, node 0 cannot receive messages and there's a problem with nodes 2 and 1 communicating with each other.

With just those two events, only nodes 3 and 4 could possibly be elected leader. In addition to those two events, node 3 is powered down from 0 to 275 and from 259 to 621. If we handle overlapping powerdown events correctly (do we?), that would make node 3 unavailable until 621. In that case, only node 4 could be elected leader, but there's no guarantee that node 4 will go up for election in the first 700ms, and in fact in this particular log node 4 never becomes a candidate so the test fails.

Also, it's not clear how we have events of duration 621 when we have max_ms_per_event=400.

doc-fix

Hello,
In fact, this is not a issue.I just want modify one string of README.md into

pip install -r requirements.txt

This is clearly to some people, thanks

update_term resets all state

The update_term code contains

        self.term = 0
        self.log = [] # list[tuple(term, entry)]
        self.commit_index = 0
        self.last_applied = 0
        self.voted_for = None
        self.node_type = 'Follower'
        self.votes_received = set()
        self.election_timeout = self.calculate_election_timeout()

This seems like it can't be right. Maybe it was supposed to be attached to some kind of initialization function? Jinny and I are going to remove this. Please let us know if there's some reason this or something like it should be there.

Can't make time pass without generating adverse events?

Jinny and I looked at this and didn't know what the expected mechanism for making time pass is.

We tried adding this check in teardown:

        if self.catastrophy_level == 0:
            # self.execute_step(20)
            # TODO: this check should be stronger.
            # TODO: heal before checking for other catastrophy levels.
            assert(len(self.leaders_history) > 0)

This check fails because, with catastrophy level 0, we execute for 0 time and then the test ends, so we don't have a leader.

There are a few ways we could fix this, but we're not sure if any of them conflicts with the current intent of the code.

Hack to fix heapq push

We fixed this by add __lt__ and __eq__ on Event. This works, but is quite dangerous because we broke object equality, so any future use of equality is potentially confusing

    def __lt__(self,other):
        return self.event_map['start_time'] < other.event_map['start_time']
    def __eq__(self,other):
        # WARNING: this completely breaks object equality.
        return self.event_map['start_time'] == other.event_map['start_time']

Node goes from follower to leader state

If we look at a trace of node state transitions, we see that a node goes from candidate to follower to leader. It should probably not become a follower in between the candidate and leader states:

src/world_broker.py execute_step
timer_trip 2
change_type 2: Follower -> Candidate
change_type 2: Candidate -> Follower
change_type 4: Follower -> Follower
change_type 3: Follower -> Follower
change_type 2: Follower -> Leader
change_type 1: Follower -> Follower
change_type 2: Leader -> Leader
change_type 0: Follower -> Follower
change_type 2: Leader -> Leader

Repo appears to have someone's virtualenv commited

I'm going to remove bin and include, which appear to have virtualenv stuff that will only work if you are using a mac and your name is bc :-). Let me know if I'm reading this incorrectly and that stuff should not be removed.

Checker can't distinguish between "valid" and "invalid" failures

In the diagram below, blue indicates that a node is down and brown/red indicates a candidate going up for election:

Any of nodes 0, 3, or 4 could theoretically become a leader. However, in the first region of the diagram, nodes 3 and 4 both become candidates at the same time and split the vote so that neither can become leader.

The next time around, 3 and 4 again both go up for election at the same time. Immediately afterwards, there's a 1ms outage in node 1, which prevents either 3 or 4 from becoming leader.

Given these events, it's "correct" that no leader is elected. It's suspicious that nodes 3 and 4 both have an election timeout of 211 twice in a row, so perhaps we have a bug there, but even if there's a bug there and we fix it, that doesn't prevent this case from happening.

Possible error when checking for corner cases with [-1]

In order to get the code to run without errors, we (Brennan, Jinny, and I) added some checks to pulling values out of self.log and then added some defaults.

        last_logged_term = -1
        last_logged_entry = None
        if len(self.log) > 0:
            last_logged_term = self.log[-1][0]
            last_logged_entry = self.log[-1][1]

This was in code none of us wrote and we were focused on other stuff and didn't read this code closely to make sure that the code makes sense. It's possible/likely that this change makes the code run but introduces a bug.

Related question: does the code in the log work if it gets passed a None?