Once I began adding log repair (only very partially done) to schooner, I found a flaw in the leader design. Currently, it spawns persistent tasks to manage each peer (leader_peer_handler
fn) and those peer handlers manage sending heartbeats. This seemed like a good idea because it frees the main leader task from having to manage that heartbeat interval amongst all its other activities including trying to respond to clients in a timely fashion.
The problem is that the peer handlers send stale data in heartbeats - prev_log_idx in particular. This is deadly once log_repair is in replace because older prev_log_idx (say a 5 when it should be a 6) will cause the followers to truncate their logs (remove 6 since the heartbeat said 5). Disaster.
So I'd like to open this for review and even invite someone else to implement it if they'd like to do that.
I thought of two alternatives to solving this (neither fully fleshed out, so one or both may be flawed as well):
Option 1: put an RWArc around a shared data structure
This shared data structure would be shared between the persistent peer handlers and the main leader_loop task. It could be a new StateMachine
struct that we define. The peer handlers would get this and would base new heartbeats off what is in the StateMachine.
Pros
- It solves (I think) the stale heartbeat problem
Cons
- It introduces a mutex-based data structure, something I'd like to avoid if possible
- I'm not yet sure what the StateMachine contains and whether it has the right information for the heartbeat messages
- Having persistent peer handlers (persistent while that node is leader that is) may be flawed in itself.
- There is no way (I know of) to mark tasks as daemon threads like you can in Java or Go (all goroutines are daemon). "Daemon" here means they will shutdown when the non-daemon or main thread shuts down.
- Because of that if the leader_loop task dies, the peer handlers will not and the only way they shut down (right now) is to get a stop message from the leader_loop task. They will forever send heartbeats even after another new leader has been elected. Bad.
#### Option 2: leader_loop manages heartbeats; peer_handlers are no longer long-lived/persistent
The leader_loop has to balance two opposing forces:
- heartbeats must be sent out on time
- clients are served serially and must be responded to within some finite time period that is probably longer than the heartbeat timeout and the leader_loop has to wait on followers to respond before responding to the client
Here's a redesigned leader_loop model for consideration:
There is an outer loop that selects over three channels:
- A timeout channel that goes off every
heartbeat_interval
milliseconds
- The event Receiver channel (from the network_listener)
- An AEResponse Receiver from (non-persistent) peer handlers
The event Receiver channel brings in:
- client commands
- AERequests from other peers that claim to be leader (may be a valid or bogus claim)
The AEResponse Receiver gets AEResponses coming from peers that were not needed to make a majority to respond to the client, so were not picked up in the inner loop below I'm about to describe.
When heartbeat_interval timer is selected
One non-persistent peer handler task is spawned per peer. That handler makes a connection to the peer and is given an AERequest and a Sender to message back (the Receiver end is the "AEResponse Receiver" mentioned as number 3 in the channel list above).
When event Receiver channel is selected - and it is a client cmd
As with heartbeats, one non-persistent peer handler task is spawned per peer. That handler makes a connection to the peer and is given an AERequest and a Sender to message back (the Receiver end is the "AEResponse Receiver" mentioned as number 3 in the channel list above).
The leader_loop then sets up an inner loop with another select, that selects over:
- AEResponse Receiver, waiting for peer responses in order to respond to the waiting client
- A (new) heartbeat_interval timer
- A client timeout timer (which probably has much longer duration than the heartbeat_interval timer)
The heartbeat_interval timer is set once per round. The first time it is set it will be heartbeat_interval
millis. If the AEReponse Receiver is selected, then it decide if a majority of the cluster has logged the cmd and if so respond to the client or if not go back into this inner select loop. However, this time when it sets the heartbeat_timer, will need to be for heartbeat_interval
minus the duration of handling the AEResponse.
The client timeout timer, on the other hand, will be set once over all rounds and if it goes off then the client is notified of a failure to log its command and then leader_loop returns to the outer loop.
Ideally the time remaining on the heartbeat_inteval timer should be taken into account when setting the heartbeat timer in the outer loop, but if the heartbeat_interval is <= 1/3 of the election_timeout
setting, it is probably OK to be a little sloppy on this one.
When the AEResponse Receiver is selected (outer loop)
These are responses that were not handled in the inner loop since they weren't need to make a majority, or they are responses to heartbeat AERs. When they come in they must be analyzed for success/failure status. If success, then the peer's next_idx must be updated in the leader. If failure, then additional AERequests must be sent to do log repair - which requires spawning another peer handler task. The leader returns to the outer leader loop. Again the time taken here must be subtracted from heartbeat_interval when resetting the heartbeat timer.
Summary of Option 2
A lot of details are not covered here, including how to handle AERequests from putative new leaders, but hopefully this outlines enough how this model would work.
Pros
- solves the stale-heartbeat data problem
- no mutexes or shared state
- only one persistent task, so no rogue unkillable leader tasks can remain if the leader_loop task dies
Cons
#### Option 3: ????
Profit! (Your idea here)
Look at other implementations?