See the above sketch illustrating an important problem with synchronous quorum calls.
The details:
When a client performs what she thinks to be a synchronous quorum call by sending an RPC to all servers in a configuration, it is actually spawning a goroutine for each server. This is illustrated below for a Read
quorum call.
func (c *Configuration) Read(ctx context.Context, a *ReadRequest) (resp *State, err error) {
...
for _, n := range c.nodes {
go callGRPCRead(ctx, n, a, replyChan)
}
...
func callGRPCRead(ctx context.Context, node *Node, arg *ReadRequest, replyChan chan<- readReply) {
err := grpc.Invoke(ctx, "/dev.Register/Read", arg, reply, node.conn)
replyChan <- readReply{node.id, reply, err}
}
This Read
quorum call can complete before all replies have been received, because only a subset of successful replies are needed to complete a quorum call. Exactly when enough replies have been received is determined by a quorum function, I won't explain the details here, except to say that in a typical majority quorum function, invoking a configuration with three servers, a majority of two replies is enough to complete a call. This has the potential to leave some goroutines running, or possibly even waiting to be started.
Thus, in rare cases it can happen that synchronous quorum call QC1 completes, but some of its goroutines haven't even started running yet. Moreover, since QC1 is complete, the client can perform the next quorum call QC2, and some of its goroutines may actually be scheduled before those belonging to QC1.
Note that it is okay for the goroutines of QC1 to linger after QC2 has started, as long as the RPC calls (TCP send call) belonging to QC1 are invoked before those belonging to QC2 for the same server.
Furthermore, often it is undesirable to cancel these lingering goroutines since it may cause some (slow) servers to become outdated with the rest. But there are some applications/protocols, where canceling these goroutines would be okay. However, that should not be the default since those that wish to cancel can do so through the context ctx
.
The problem it causes:
A server can receive a message out-of-order, even though the underlying networking takes place over TCP. This is unexpected behavior.
Is this relevant in practice?
Yes. This is a problem in the Raft implementation as the server could see m_102
, when expecting m_101
, and the message is therefore discarded, because Raft expects to receive messages in order. Hence, there is no retransmission of m_102
as the message was correctly received by the TCP layer, thus this server will fail to receive all succeeding messages as they are not the expected message m_101
.
This has been experienced in a configuration running a cluster on a single machine, and possibly also in a LAN scenario when pressuring the leader, but that hasn't been confirmed yet. However, it happens only rarely, as it depends on the scheduling of goroutines, and so it is quite difficult to reproduce.
Is it a problem only relevant to Raft?
Raft requires strict message ordering per server, and does not tolerate holes in the log. This is not the case for protocols that can deal with multiple concurrent requests, such as MultiPaxos with α>1. That is algorithms that allows holes in the log. It is not clear if it is a problem in majority-based read-write storage systems, or other such systems, but it would probably require a combination of failures and this bug.
I expect that other protocols and systems may also not tolerate holes in their log, but can't think of any right now.
Some possible solutions:
- Leave it as is, and document it.
- Implement a general fix for the problem that applies to all uses of quorum calls. This would be the prefer solution, but it may have ramifications:
- One possible solution is to limit concurrency across quorum calls (prevent pipelining).
- Another solution is to use a
WaitGroup
in a unorthodox way to prevent the problem with very high probability.
- Another solution is to provide an option that those protocols that need strict ordering can use.
- Instead of an option, we could define a separate quorum call type with strict ordering.
Other concerns:
We should have a reliable test case that can demonstrate and reproduce this bug. However, this seems to require interacting with the go scheduler.