relab / gorums Goto Github PK

Gorums simplify fault-tolerant quorum-based protocols

License: MIT License

Makefile 1.64% Go 93.34% Shell 0.14% HTML 4.87%

gorums's Introduction

Gorums

Gorums [1] is a novel framework for building fault tolerant distributed systems. Gorums offers a flexible and simple quorum call abstraction, used to communicate with a set of processes, and to collect and process their responses. Gorums provides separate abstractions for (a) selecting processes for a quorum call and (b) processing replies. These abstractions simplify the main control flow of protocol implementations, especially for quorum-based systems, where only a subset of the replies to a quorum call need to be processed.

Gorums uses code generation to produce an RPC library that clients can use to invoke quorum calls. Gorums is a wrapper around the gRPC library. Services are defined using the protocol buffers interface definition language.

System Requirements

To build and deploy Gorums, you need the following software installed:

Protobuf compiler (protoc)
Make
Ansible (used by benchmark script)

Contributors Guide

We value your contributions. Before starting a contribution, please reach out to us by posting on an existing issue or creating a new one. Students and other contributors are encouraged to follow these guidelines:

We recommend using VSCode with the following plugins
- Go plugin with the
  - gopls language server enabled
  - golangci-lint enabled
- Code Spell Checker
- markdownlint
- vscode-proto3
Code should regularly be merged into master through pull requests.

Examples

The original EPaxos implementation modified to use Gorums can be found here.

A collection of different algorithms for reconfigurable atomic storage implemented using Gorums can be found here.

Documentation

Publications

[1] Tormod Erevik Lea, Leander Jehl, and Hein Meling. Towards New Abstractions for Implementing Quorum-based Systems. In 37th International Conference on Distributed Computing Systems (ICDCS), Jun 2017.

[2] Sebastian Pedersen, Hein Meling, and Leander Jehl. An Analysis of Quorum-based Abstractions: A Case Study using Gorums to Implement Raft. In Proceedings of the 2018 Workshop on Advanced Tools, Programming Languages, and PLatforms for Implementing and Evaluating Algorithms for Distributed systems.

Authors

Hein Meling
John Ingve Olsen
Tormod Erevik Lea
Leander Jehl

gorums's People

Contributors

Stargazers

Watchers

Forkers

inltech r0qs vladmaksyk johningve graydon bondeking henrikskul linhbkhn95 ethereumstudy khtjessem zlimbo monstertail jaime0815

gorums's Issues

Allow adding new nodes on the fly

Is there any reasons that the list of nodes need to be specified up front? As we are always creating a "new configuration", I don't see a problem with adding nodes at runtime to a new configuration.

For starters:

Remove the nodeAddrs argument from NewManager.
Manager.NewConfiguration creates connections to non-existent nodes on the fly, adding them to the manager for re-use.

api: support for pre-call translation to custom type

Background: We now have support for custom return types that can be returned from the different quorum call variants to encode additional details that does need to be sent on the wire or exposed to the server. We also have support for perNode arguments that translate the request parameter to a value to be sent to the different servers. This latter feature is a pre-call translation while the return types are handled by the quorum function and so there is a form of symmetry here.

What is missing is a general pre-call translation or adapter that takes one type as input and creates another type that can be sent to the servers. See attached picture.

When is this useful? We can make use of this when we need to create signatures or operate on the input message. We already do something like this for the byzq, but without support from Gorums. We implement a method for signing in the quorum spec.

How should it be implemented? One obvious solution would be to have the user provide a string option such as pre_call_adapter to specify a custom type expected, which would then produce a function MethodNamePreCallAdapter() in the QuorumSpec interface that the user needs to implement.

There are some issues regarding the implementation that remains undecided on my end wrt. naming, but I think the principle is very clear to me.

An alternative solution is to define a per_call_adapter string in the proto file. Instead of defining the PreCallAdapter in the QuorumSpec interface, we can create a separate interface called MethodNamePreCallAdapter that Gorums can check for with a type assertion and call it if it exists. That is, if the QuorumSpec object also implements the MethodNamePreCallAdapter method, it can invoke this method to do the adaptation. Is this too much magic?

Can this somehow be integrated with the per_node_arg option, making it so that this named function MethodNamePrecallAdapter or something is called, and that can generate either a single message to be sent to all servers or a unique message for each server. This needs some more thinking.

Gorums cannot handle nil for non-PerNodeArg Gorums RPC calls.

For my read-write distributed storage implemented by Gorums,
if I send a nil message through either read or write quorum call,
then the quorum call returns a nil reply with error nil.

The reason for this issue is the callGRPC methods generated by Gorums.

For example in method: callGRPCWrite:

if arg == nil {
    // send a nil reply to the for-select-loop
    replyChan <- internalWriteResponse{node.id, nil, nil}
    return
}

This allows the write quorum call to send a nil write request and reply a nil as the write response with error message nil. (the quorum function can receive enough replies to obtain a quorum)

So, in my quorum functions, I currently wrote some code to handle the nil replies, otherwise, it can cause panic, since there's no value and timestamp to get if replies are nils.

Template simplification soon possible

In Go 1.11 this update to the template language will allow simplified templates in Gorums:

add variable assignment

golden_test.go: compiler output test is sensitive to protoc and Go version used

golden_test.go verifies that we don't change the protoc compiler output (using protobuf+grpc+gorums) unintentionally. But the test output is sensitive to the protoc and Go version used. protoc 3.0 and 3.1 have different outputs (a few lines). Go 1.8 changed some compression related stuff that made the gzip output used by protobuf differ.

Option 1: We should somehow don't rely on a specific set of versions for this test.
Option 2: Evaluate the relevance of the test and delete it if it has not proven useful.

doc: how to use golang.org/net/trace for client-side quorum call tracing

We need to document how to use tracing.

What are the command lines to use?
What it helps us to debug?
Could w provide an example of how to trace something?

protoc-gen-gorums: generates useless Equal and VerboseEqual functions

The plugin generates Go code for different Equal and VerboseEqual functions that are not used currently. These reduce test coverage for @wruiwr's stuff. Suggest that we remove those if they are not needed for something that I didn't see (@tormoder); it is a two line fix and the tests pass without these functions.

The fix is simply to comment or remove https://github.com/relab/gorums/blob/master/cmd/protoc-gen-gorums/main.go#L26 and https://github.com/relab/gorums/blob/master/cmd/protoc-gen-gorums/main.go#L29

Associate replies in quorums functions with sender Gorums node ids

Old idea, but is necessary for making it possible switch quorum system (e.g. majority to gird) at the client without involving the server.

The previously initial idea was to have a common id mapping method in (every) quorum spec, which maps a gorums id to application specific id. Changing a quorum system on the client would then involve creating a new configuration that uses another quorum spec with a different id mapping method.

Previously relab/gorums-dev#1.

Add benchmarking code

Check if we can adapt or make something similar to this.

'make' is flaky with "address already in use" errors

Change node ids from integer to string -> string pairs

I found the whole Raft ID to Gorums ID and back translation to be really tedious, what do you think of this change?

Change node ids from an uint32 to a string -> string pair.
- This removes the need for an application id to gorums id table, making the developer responsible for the id-scheme used, I find this preferable.
- While this is solved by using hostnames and DNS, it also avoids the problem of node A having IP X seen from B while IP Y seen from C.

doc: revise and augment user guide

Make progression of different examples ranging from

Simple gRPC example
Load balancing accessing two distinct services
Distributed storage with simple majority quorums (both single-writer and multi-writer)
Distributed storage with byzantine quorums
Advanced protocol: Paxos with majority quorum??
Advanced quorum functions with correctables (can we use EPaxos as an example, or could we do something else?)
Advanced quorum functions with per_node_arg and custom_return_type

(consider splitting this issue into a separate issue for each example.)

doc: developers guide needs to document make targets

Not sure if it belongs in the dev-guide or user guide, but at least developers should be provided with a list of make targets and a description of what they do. Right now, only the most important ones are documented, and they don't seem to be correct either ; gengoldenanddev does not exist. I assume it is now called goldenanddev.

Moreover, maybe there are some that we could throw out as well, that are no longer relevant/useful.

The per_node_arg option is only partially supported

Only the code quorum call and future calls support per_node_arg as it is now. This is mainly a matter of adding template code to conditionally generate the exported method call for one of the two modes, with and without per_node_arg. However, to save work, we should wait until after #19 has been resolved.

Make quorum functions always take the request as a parameter

Currently, if the quorum function needs access to the request object, the developer needs to provide an option gorums.qf_with_req to the RPC method for which this applies. This has been found to be a bit confusing and newcomers may find it awkward to work with different signatures for the different quorum functions. Moreover, it is believed that the overhead of passing one extra parameter that is not used by quorum function is low, and thus it is reasonable to instead always require that the quorum function follows the same signature, i.e. that the first argument is the request object and the second argument is the set of replies.

Implement and benchmark sync.Map vs RWMutex in receiveQueue

Consider if perhaps sync.Map should be used instead in dev/ordering.go for the receiveQueue, because:
The Map type is optimized for two common use cases: (1) when the entry for a given key is only ever written once but read many times, as in caches that only grow, or (2) when multiple goroutines read, write, and overwrite entries for disjoint sets of keys. In these two cases, use of a Map may significantly reduce lock contention compared to a Go map paired with a separate Mutex or RWMutex.

It would be interesting to do a performance comparison, given the comment from the documentation, which seems to match our use case, since we only ever write to the same key once, but read it many times.

The above is copied from here.

reorganize: repo folder layout and generator code

Taking some inspiration from the protobuf redesign, we should reorganize things.

We should place internal-only commands in internal/cmd
We should separate the plugin internals from the dev example.
The dev test examples should also be internal.
The dev test examples should be split into single method services, to avoid that service implementations must implement a bunch of irrelevant methods.
We need to study how to write plugins for the redesigned protobuf.
We should prepare proper example code (can we use ExampleABC style?) or maybe that doesn't make sense.
We should do this in a stepwise manner, in a separate branch, taking a clean slate approach.

Should we rename <MethodName>Reply type to something else?

The following is just a small excerpt from a longer discussion in relab/gorums-dev/#29. The below description captures the main outstanding issue that haven't been decided yet. Namely what to name the reply type, when it is different from type specified in the proto file, because we need to return a more sophisticated data structure, for call types such as correctables and futures.

Below text is copied from the old issue:
Note that, I'm not happy with the suffix Reply to the ReadCorrectableReply which comes from the ReadCorrectable method name.

sm.TypeName = sm.MethodName + "Reply"

Ignoring the fact that in the current example, I've named the method ReadCorrectable, perhaps it would be better to name the different reply types with their respective names, e.g. Correctable. That is, for a ReadCorrectable method, the reply type would become ReadCorrectableCorrectable.

This looks ugly for our example case, but if our method name is just Read, then the return type becomes ReadCorrectable and ReadFuture and so on.

Another alternative (that was used in a previous implementation) is to just use the method name as the name for the reply type.

Anyone have any input on what is the preferred choice on this naming??

bench: Remote benchmarks fails with deadline exceeded

go test -bench . -remotehosts=localhost:8090,localhost:8091,localhost:8092
goos: darwin
goarch: amd64
pkg: github.com/relab/gorums/dev
BenchmarkRead1KQ1N3Local-8                    	   10000	    150985 ns/op	   6.87 MB/s	   42812 B/op	     488 allocs/op
--- FAIL: BenchmarkRead1KQ1N3Remote
	config_qc_test.go:1096: could not create manager: connect node 127.0.0.1:8090 error: dialing node failed: context deadline exceeded
BenchmarkRead1KQ1N3ParallelLocal-8            	   30000	     55536 ns/op	  18.67 MB/s	   41610 B/op	     438 allocs/op
--- FAIL: BenchmarkRead1KQ1N3ParallelRemote
	config_qc_test.go:1096: could not create manager: connect node 127.0.0.1:8090 error: dialing node failed: context deadline exceeded
BenchmarkRead1KQ1N3FutureLocal-8              	   10000	    155727 ns/op	   6.66 MB/s	   42986 B/op	     492 allocs/op

Since I don't run these benchmarks very often, it may be a bit difficult to determine when and which commit caused this problem. I've tried to increase the timeout to 5, 10 and 50 seconds, with the same result.
Have also removed the timeout entirely, resulting in blocking. It should be noted that grpc.WithTimeout() has been deprecated, but the functionality should still work. Moreover, I've also tried to replace the deprecated function with the suggested replacement code with the same result.

Seems like there must be a problem with setting up servers and listening for connections.

Clean up Gorums method options

Currently, we have several options, some of which may be removed and others should be reorganized and renamed to make things easier to maintain.

We should rename the gorums.qc option to gorums.quorumcall. It is longer, but much clearer.
gorums.qf_with_req is often used, so perhaps we should once again reconsider whether or not we should require users to specify this; just make it the default.
gorums.correctable_stream can be replaced by gorums.correctable and the stream annotation at the output type (we can test for both of these in the code, so it shouldn't be difficult to implement). If necessary, we can follow a similar structure as the new gorums.ordered option.
Consider whether or not gorums.qc_future can be replaced with gorums.correctable.
If we want to keep the future option, replace gorums.qc_future with two options: gorums.quorumcall and gorums.future. This can follow a similar structure as the new gorums.ordered option.

Before making hard decisions regarding the removal of certain options, e.g. future, we should implement test examples to use them to get more experience with it...

Opinions?

grpc.Code(err) is deprecated

Consider replacing this:

	switch grpc.Code(err) { // nil -> codes.OK
	case codes.OK, codes.Canceled:
		node.setLatency(time.Since(start))
	default:
		node.setLastErr(err)
	}

before grpc.Code() is removed, if that happens.

Ref. rpc_util.go#L409:

// Code returns the error code for err if it was produced by the rpc system.
// Otherwise, it returns codes.Unknown.
//
// Deprecated; use status.FromError and Code method instead.
func Code(err error) codes.Code {
	if s, ok := status.FromError(err); ok {
		return s.Code()
	}
	return codes.Unknown
}

Gorums returns quorum call error: incomplete call (errors: 4, replies: 0) hiding underlying error

I get this error:

error reading: quorum call error: incomplete call (errors: 4, replies: 0)

But I don't get the underlying error causing the problem.
@tormoder What to do about it? Shouldn't the error be propagated?

Make Manager and Configuration an interface to ease testing

This might or might not be a good idea, but I was thinking about having Manager and Configuration be interfaces. This should be easy enough to generate but we will end up having to change the implementing structs names, which should be fine. This might ease testing when your code is coupled to the network, i.e., Gorums, as you can mock out the Manager and Configurations.

I guess my main point with doing this was that we could mock out the network layer and easier create deterministic tests with, e.g., specific interleaving of messages, network delays and patriotions. This might end up being a rather large task. Just making the Manager and Configuration be interfaces might not be that big of a deal though.

Manual changes to vendor/ breaks 'dep ensure (-update)'

We should not do manual changes to the vendor/ directory.

I currently can't run dep ensure / dep ensure -update / dep ensure -update [grpc-pkg-path] without introducing changes that will break the build/tests.

I wanted to update gRPC to a newer version, but updating the current dependencies will overwrite our current patch to gogo/protofbuf.

A temporary solution could be to have the patch on it's own branch of our already existing fork of gogo/protobuf and vendor it, but that will lead to import path problems since dep doesn't support aliasing (glide does I think).

Augment reply objects with Configuration

This is an idea that has been in my paper notes for a while. I'm writing it up to get comments on whether or not it is a good idea.

Sometimes protocols follow a specific sequence of steps, where the next step may benefit from input of the previous step. For example, in Paxos we have Prepare followed by Accept followed by Commit. Thus, it might be useful to augment the reply objects with methods of the Configuration object of the first step that can then be used in the second step and so on.

Here is a concrete example:

  promise := config.Prepare(prepareMsg)
  learn := promise.Accept(acceptMsg)
  decided := learn.Commit(commitMsg)

Here the promise struct embeds the config allowing us to invoke Accept on the same configuration as the original promise, but can also do preparatory work for the next phase, e.g.:

    prmMsg, err := p.config.Prepare(ctx, &PrepareMsg{Rnd: crnd})
    // The following code can be hidden inside prmMsg.Accept()
    if prmMsg.GetVrnd() != Ignore {
        // promise msg has a locked-in value
        cval = prmMsg.GetVval()
        // update proposer state in mutual exclusion
        p.m.Lock()
        p.cval = cval
        p.m.Unlock()
    }
    // use local proposer's cval or locked-in value from promise msg, if any.
    accMsg := &AcceptMsg{Rnd: crnd, Val: cval}
    lrnMsg, err := p.config.Accept(ctx, accMsg)

Another idea is that the reply prmMsg could hold information about which of the replicas replied (was fastest and so on.)

Maybe this is actually just a pattern that developers could follow, rather than something Gorums should help with.

dev: TestQuorumCallCancel is failing

TestQuorumCallCancel is failing/flaky after #6 was merged.

Should we remove the plain correctable version?

There are two different correctable implementations as of now, but they are very similar in some some aspects, except that the prelim one uses streams, and can be updated multiple times from the same server (if I understand correctly). Below is a diff of the real code changes; all other changes are comments and imports.

I guess the question I'm asking is if it is worth maintaining two implementations that are almost the same; I believe @tormoder raised the question himself in a discussion a while ago. If we decide to remove the plain correctable version in favor of the streaming version, it would be easier to maintain one version and also get the documentation right. Since we aren't using any of the two in any prototypes, except for the test cases, I guess the main issue to consider is if we can imagine a use case where one or both may be useful. I think we have discussed EPaxos as a potential target for a correctable implementation.

If we can solve the issue of request-reply matching raised by @tormoder on slack, we could possibly achieve the two different functionalities with the same correctable implementation.

132c130
< 		replyValues = make([]*{{.FQRespName}}, 0, c.n)
---
> 		replyValues = make([]*{{.FQRespName}}, 0, c.n*2)
143c141
< 			resp.NodeIDs = append(resp.NodeIDs, r.nid)
---
> 			resp.NodeIDs = appendIfNotPresent(resp.NodeIDs, r.nid)
167c165
< 		if errCount+len(replyValues) == c.n {
---
> 		if errCount == c.n { // Can't rely on reply count.
174c172,190
< {{template "callGRPC" .}}
---
> func callGRPC{{.MethodName}}(ctx context.Context, node *Node, arg *{{.FQReqName}}, replyChan chan<- {{.UnexportedTypeName}}) {
> 	x := New{{.ServName}}Client(node.conn)
> 	y, err := x.{{.MethodName}}(ctx, arg)
> 	if err != nil {
> 		replyChan <- {{.UnexportedTypeName}}{node.id, nil, err}
> 		return
> 	}
>
> 	for {
> 		reply, err := y.Recv()
> 		if err == io.EOF {
> 			return
> 		}
> 		replyChan <- {{.UnexportedTypeName}}{node.id, reply, err}
> 		if err != nil {
> 			return
> 		}
> 	}
> }```

Rename Register to Storage

We should rename Register to Storage following Leander's logic that it is a more understandable term.

This will have an impact in many places, so some care must be taken and documentation must be updated accordingly.

Allow creating multiple overlapping configurations

Problem

Currently, NewConfiguration only creates a new configuration, if no previous configuration with the same ids exists. Otherwise, it returns a reference to an already existing configuration. This raises the following issues:

It is not possible to create multiple configurations with identical nodes.
When recreating a configuration, the returned configuration will actually not use the QSpec provided.
It is not possible to garbage collect configurations.

I believe it is desirable to allow multiple configurations with the same set of nodes. This allows concurrent Quorum Calls to use different QSpecs. This allows maintaining call specific data in the QSpec without concurrency control.
Updating state in the QSpec can, for example, be useful to store, which replies have been successfully validated.

Fix

I want to propose to remove the configs map. Every call to NewConfiguration should indeed return a new configuration. The application can maintain them by reference. When the application does no longer references the configuration it should be garbage collected.
An interesting question is whether configuration ids should also be removed. Configuration ids allow to easily identify within and across processes if two configurations use the same underlying set of nodes.

What do you think @tormoder @meling

Consider providing functionality for automatically using the optimal configuration

For example when using thrifty configurations.

A user can already do this manually with the provided abstractions, but consider to find a way to transparently relieve the client of the burden of figuring out the best way to use the information available about the servers in configuration/quorum system.

Previously relab/gorums-dev#5.

Missing import when using bytes message field

The rec.proto file in recstore generates pb.go file without bytes import.

Redesign Gorums to use Client Interceptors

This would be a fairly large redesign, but if my initial thoughts are correct, we could possibly get rid of the whole code generator. I may be wrong of course.

Here is the godoc:
https://godoc.org/google.golang.org/grpc#UnaryClientInterceptor

Here is a simple example of using a Client Interceptor to measure time:
https://about.sourcegraph.com/go/grpc-in-production/#client-interceptor
Same for a Server Interceptor:
https://about.sourcegraph.com/go/grpc-in-production/#server-interceptor
Here is a Client Interceptor for retry logic:
https://about.sourcegraph.com/go/grpc-in-production/#networks-fail

Here is a youtube video with more details (not only on inceptors):
https://www.youtube.com/watch?v=7FZ6ZyzGex0

Another source of reference examples can be found here:
https://github.com/grpc-ecosystem/go-grpc-middleware

Multicast: Streams do not reconnect on error

Original issue by @s111 was relab/gorums-dev#31.

s111 commented on Feb 27
Streams are closed on any error and they do not reconnect. If a server ever crashes, is stopped, etc., and recovers, it won't receive multicast messages that were intended for it.

+--------+      +--------+
| server | <=== | server |
|    1   | ===> |    2   |
+--------+      +--------+

Server 2 crashes and recovers.

+--------+      +--------+
| server | <=== | server |
|    1   | =X   |    2   |
+--------+      +--------+

Connection from server 1 to server 2 is now permanently broken.

meling commented on Feb 27
Just to clarify. Do you mean between the client and one of the servers? That is, does this occur when the client does a WriteAsync on the Configuration object? (referring to the default register implementation in the dev folder). Can you provide some more details on how to reproduce this?

s111 commented on Feb 27
This is how gRPC defines the Stream interface:

// Stream defines the common interface a client or server stream has to satisfy.
type Stream interface {
	// Context returns the context for this stream.
	Context() context.Context
	// SendMsg blocks until it sends m, the stream is done or the stream
	// breaks.
	// On error, it aborts the stream and returns an RPC status on client
	// side. On server side, it simply returns the error to the caller.
	// SendMsg is called by generated code. Also Users can call SendMsg
	// directly when it is really needed in their use cases.
	SendMsg(m interface{}) error
	// RecvMsg blocks until it receives a message or the stream is
	// done. On client side, it returns io.EOF when the stream is done. On
	// any other error, it aborts the stream and returns an RPC status. On
	// server side, it simply returns the error to the caller.
	RecvMsg(m interface{}) error
}

The important part being:
On any other error, it aborts the stream and returns an RPC status.

Basically if any error occur during a RecvMsg call, the underlying stream will be closed and further calls to RecvMsg will just return an error. That is, streams do not implement the same backoff and try again later mechanism that regular gRPC calls do.

s111 commented on Feb 27
The stream is created on start up when all nodes are connecting: register.pb.go#L999.
If it ever goes down, it is not re-created.
I'm bringing this up because I assumed gRPC would try to reconnect when the stream broke down. And as documented, it does not. So I don't necessarily think we need to modify the call, but it might be good to describe this behavoir in the documentation somewhere.

meling commented on Feb 27
Just a quick mention of the TODO in the relevant file. It would probably/maybe be helpful to also get a log message when this happens. Maybe it should be added to the top-level documentation on multicast methods.

Generated code depends on the Gorums package

Original issue by @s111 was relab/gorums-dev#11.

s111 commented on Jan 6
As far as I can tell, there is no reason for code generated by Gorums to need Gorums as a depenedency.
This was not the case until the introduction of custom options on services: diff

Examples of current generated code:
raft.pb.go
register.pb.go

These contain import _ "github.com/relab/gorums" which forces the project to bring along Gorums as a dependency to build, even though it is never used.

The cause seems to be rooted in import "github.com/relab/gorums/gorums.proto"; in the proto files, causing generator.go:L1453 to be executed.

The import _ "github.com/relab/gorums" line can be removed, and the project works and builds fine without needing Gorums. However it would be better if the generation processes avoided the line altogether, so that anyone starting out with Gorums wouldn't need to learn this trick.

tormoder commented on Jan 6
(First thoughts, if have to look at this more carefully.)

The comment in your link says:

// We need to import all the dependencies, even if we don't reference them,
// because other code and tools depend on having the full transitive closure
// of protocol buffer types in the binary.

And the import does have side-effects via ìnit():
https://github.com/relab/gorums/blob/master/gorums.pb.go#L95
where the gorums extensions are registered in the protobuf "global registry of extensions":
https://github.com/gogo/protobuf/blob/0516412251d96185a7f150ed58e036eb66a40ecd/proto/extensions.go#L669

The same thing will happen (the dot import) also if the standard Google ("well-known") protobuf types are used. (See https://github.com/golang/protobuf/tree/master/ptypes, 'empty', 'timestamp' etc.).

s111 commented on Jan 6 • edited
When registering types, the import is actually used in the resulting file.
In this case, we are only registering extensions that are used in the generation phase, and are not actually used in the generated file?

tormoder commented on Jan 9
Yes, that's true, since this for service methods, and not actual messages.

I don't currently know how to fix this, or if it worth changing.
The import is done in the main top-level (Go) protobuf generator, before the gRPC and gorums plugins.
I don't want to fork the protobuf library just for this one exception.

I think it is also just importing the gorums.pb.go file, which is about ~100 lines of code.

Size test:

Raft server binary with import:
-rwxr-xr-x 1 tormod users 11989058 Jan 9 11:41 replica

Raft server binary without import:
-rwxr-xr-x 1 tormod users 10711447 Jan 9 11:42 replica

Approx. 1.2 MB diff.

Performance analysis/impact of doing time.Now() on every call

The following applies to Go version 1.9.2 om macOS High Sierra 10.13.1.

top
Showing nodes accounting for 3.66s, 81.70% of 4.48s total
Dropped 140 nodes (cum <= 0.02s)
Showing top 10 nodes out of 133
      flat  flat%   sum%        cum   cum%
     1.72s 38.39% 38.39%      1.76s 39.29%  syscall.Syscall /usr/local/Cellar/go/1.9.2/libexec/src/syscall/asm_darwin_amd64.s
     0.83s 18.53% 56.92%      0.83s 18.53%  time.now /usr/local/Cellar/go/1.9.2/libexec/src/runtime/sys_darwin_amd64.s
     0.29s  6.47% 63.39%      0.29s  6.47%  runtime.kevent /usr/local/Cellar/go/1.9.2/libexec/src/runtime/sys_darwin_amd64.s
     0.23s  5.13% 68.53%      0.25s  5.58%  runtime.freedefer /usr/local/Cellar/go/1.9.2/libexec/src/runtime/panic.go
     0.17s  3.79% 72.32%      0.17s  3.79%  runtime.mach_semaphore_signal /usr/local/Cellar/go/1.9.2/libexec/src/runtime/sys_darwin_amd64.s
     0.13s  2.90% 75.22%      0.13s  2.90%  runtime.mach_semaphore_wait /usr/local/Cellar/go/1.9.2/libexec/src/runtime/sys_darwin_amd64.s
     0.10s  2.23% 77.46%      0.10s  2.23%  runtime.memmove /usr/local/Cellar/go/1.9.2/libexec/src/runtime/memmove_amd64.s
     0.10s  2.23% 79.69%      0.10s  2.23%  runtime.usleep /usr/local/Cellar/go/1.9.2/libexec/src/runtime/sys_darwin_amd64.s
     0.05s  1.12% 80.80%      0.05s  1.12%  runtime.memclrNoHeapPointers /usr/local/Cellar/go/1.9.2/libexec/src/runtime/memclr_amd64.s
     0.04s  0.89% 81.70%      0.09s  2.01%  runtime.selectgo /usr/local/Cellar/go/1.9.2/libexec/src/runtime/select.go

It seems that invoking time.Now() in a closed loop is consuming a surprising amount of CPU time. Maybe we should do this differently? Suggestions anyone? This is the offending code:

func callGRPCWrite(ctx context.Context, node *Node, arg *Value, replyChan chan<- internalWriteResponse) {
	reply := new(WriteResponse)
	start := time.Now()
	err := grpc.Invoke(
		ctx,
		"/istore.Storage/Write",
		arg,
		reply,
		node.conn,
	)
	s, ok := status.FromError(err)
	if ok && (s.Code() == codes.OK || s.Code() == codes.Canceled) {
		node.setLatency(time.Since(start))
	} else {
		node.setLastErr(err)
	}
	replyChan <- internalWriteResponse{node.id, reply, err}
}

I can't think of any other way if we want to update the per-node latency on every call. I guess the question is if we could make this optional or find some way to sample it for every N calls. But the latter may not be as generic, as some RPCs may be called infrequently.

PS: The number one item in the cpu profile is the grpc.Invoke() which does the networking stuff, requiring a syscall. I haven't checked the details of that call, but I suspect there ain't much to do there.

api: make wrapper(s) to simplify setup

Many use cases for Gorums currently repeat a lot of the same code, with small variations. I propose to make a helper function that is included in the generated pb.go file which can alleviate some of the boilerplate in some of the code and tests. Places to look for boilerplate:

config_qc_test.go:setup()
wherever we create Manager, Configuration and QuorumSpec objects
especially, the handling of grpc credentials for secure connections.

(I'm still thinking if it is necessary, but I think we could do some simplifications for the developer, especially on the client side.)

Integrate golang.org/net/trace for client-side quorum call tracing

As gRPC, but for our layer above. Has been implemented for regular Quorum Calls. But needs to be added for Correctables and a few other call types.

calltypes/internal: callGRPC methods must protect against nil args

Currently, the callGRPC methods has no protection against nil arguments, which will cause a panic if, e.g. a per_node_arg function returns nil for one or more nodes. See disabled test case in dev/config_qc_test.go:996.

We need to implement the per_node_arg function as part of the goroutine for each RPC, something like this:

go func() {
  arg := perNode(*a, n.id)
  if arg != nil {
    callGRPC(..., arg, ...)
  } else {
    // decrement the number of replies to expect (in the for-select-loop)
  }
}()

This will allow semantics where the per_node_arg function selectively returns nil if it does not need to invoke a node.

Change dependency management from dep to vgo

Let's try it when Go 1.11 is released in August 2018.

List of Tasks for Gorums

List of tasks to do for Gorums:

(more to be added)

Documentation and Publishing

List of tasks to do for Gorums related projects:

Update Raft and rkv to use current version of Gorums
Update CPN-based model-based testing test generator to use current version of Gorums
Update recstore to use current version of Gorums; consider to use CallAdapter interface
Update byzq/istore to use current version of Gorums; consider to use CallAdapter interface
Update/rewrite epaxos implementation to use current version of Gorums
- Support for reconfiguration
- Adapt quorum functions to use QuorumSpec interface
- Does it make sense to use Correctables for EPaxos?
- Add tests for quorum functions, since they are more complex than others

Decisions

Should we use gorums.io as vanity import path, or is github.com/relab/gorums better? Are there any drawbacks with vanity imports that we should think about? Vangen is a tool to generate the necessary stuff for vanity imports.

Two struct embeddings in Node can simplify API

The following two struct embeddings in Node can simplify the use of Send(), CloseAndRecv() and service methods on the node itself:

-       StorageClient StorageClient
+       StorageClient

-       WriteAsyncClient Storage_WriteAsyncClient
+       Storage_WriteAsyncClient

Some simple examples:

-               rreply, err := node.StorageClient.ReadNoQC(ctx, &qc.ReadRequest{})
+               rreply, err := node.ReadNoQC(ctx, &qc.ReadRequest{})

-       _, _ = n.WriteAsyncClient.CloseAndRecv()
+       _, _ = n.CloseAndRecv()

Any objections? I've committed the implementation on a separate branch (see commit 65db0c5), because I'm not sure if it is a good idea to keep these fields in the Node struct. But haven't thought carefully about it. One challenge is if we want to have more fields for different types of methods.

I was actually looking at how it would work to do client streaming to solve our other problem in #16, and found the need to keep the client side stream along with the node when I was looking at node, seeing there is already something similar there. Though I'm not quite sure where is the best place to store these types of things; in the Node or in some alongside data structure that maps from node or its id to these things.

devguide: Update "File Endings" and "Workflow" subsections

reorganizing: move examples to separate repos

We should keep examples (such as byzq and gridq and gbench clients etc) in separate repos, or in one common repo with many examples. Reason: people often want to look at examples of how to use gorums in their own project, and the examples in byzq and friends may not be representative for how to use it from another project. At least it turns out that they run into issues with dependencies and stuff that we don't experience because of the current organization.

The ordered QC implementation performs worse than the old QC implementation with larger messages

With small messages, the ordered QC implementation is faster because it reuses the same stream for sending messages, whereas the old (unary RPC based) QC implementation has to allocate memory for a new ClientStream on every call. However, when the message size becomes large enough, the memory required to create the streams becomes insignificant. When this happens, the old QC implementation outperforms the new one because it only has to marshal a single message. In the new implementation, the application message must first be marshaled before the ordering.Message can be marshaled. This means that the payload gets copied twice.

On my machine, the old QC implementation starts performing better than the ordered QC implementation at a message size of around 45-50KB.

To improve this performance, we need to prevent the application message from being copied several times. I can see a few different ways to solve this:

Write a gRPC Codec to manually marshal the messages. Manually marshaling the messages should be doable using the protowire package. The proto package contains functions that would make this a lot easier, but most of them are unexported.
Marshal both messages separately onto the same buffer and send them. This way, the application message does not have to be embedded in the ordering.Message. This can already be done with the MarshalAppend function. ~~However, I could not find an easy way to unmarshal the messages.~~ Unmarshaling can be done as long as the messages are prefixed with their size.
Somehow generate the ordering.Message such that it includes the application messages as fields. We could use the oneof type for this. We would need to make a template for the ordering.proto file, output it from the gorums plugin and then compile it. This would probably be the best solution if it can be implemented seamlessly.
- Problem: Difficult / impossible to determine the correct import paths to generate the proto file without requiring additional plugin options or requiring the user to compile the generated proto file manually.
Embed the method and message IDs in the application message's UnknownFields. The problem with this solution is that the application message must be unmarshaled before the metadata can be accessed, which means that the type of the message must be known before the metadata is available. Implementing this would likely require each method to have its own gRPC stream.
Ask if this functionality could be added to protobuf.

reorganizing: move to dep vendoring tool

dep is safe for production use. https://github.com/golang/dep

Quorum calls invoked in-order might send messages to individual servers out-of-order

See the above sketch illustrating an important problem with synchronous quorum calls.

The details:
When a client performs what she thinks to be a synchronous quorum call by sending an RPC to all servers in a configuration, it is actually spawning a goroutine for each server. This is illustrated below for a Read quorum call.

func (c *Configuration) Read(ctx context.Context, a *ReadRequest) (resp *State, err error) {
	...
	for _, n := range c.nodes {
		go callGRPCRead(ctx, n, a, replyChan)
	}
	...

func callGRPCRead(ctx context.Context, node *Node, arg *ReadRequest, replyChan chan<- readReply) {
	err := grpc.Invoke(ctx, "/dev.Register/Read", arg, reply, node.conn)
	replyChan <- readReply{node.id, reply, err}
}

This Read quorum call can complete before all replies have been received, because only a subset of successful replies are needed to complete a quorum call. Exactly when enough replies have been received is determined by a quorum function, I won't explain the details here, except to say that in a typical majority quorum function, invoking a configuration with three servers, a majority of two replies is enough to complete a call. This has the potential to leave some goroutines running, or possibly even waiting to be started.

Thus, in rare cases it can happen that synchronous quorum call QC1 completes, but some of its goroutines haven't even started running yet. Moreover, since QC1 is complete, the client can perform the next quorum call QC2, and some of its goroutines may actually be scheduled before those belonging to QC1.

Note that it is okay for the goroutines of QC1 to linger after QC2 has started, as long as the RPC calls (TCP send call) belonging to QC1 are invoked before those belonging to QC2 for the same server.

Furthermore, often it is undesirable to cancel these lingering goroutines since it may cause some (slow) servers to become outdated with the rest. But there are some applications/protocols, where canceling these goroutines would be okay. However, that should not be the default since those that wish to cancel can do so through the context ctx.

The problem it causes:
A server can receive a message out-of-order, even though the underlying networking takes place over TCP. This is unexpected behavior.

Is this relevant in practice?
Yes. This is a problem in the Raft implementation as the server could see m_102, when expecting m_101, and the message is therefore discarded, because Raft expects to receive messages in order. Hence, there is no retransmission of m_102 as the message was correctly received by the TCP layer, thus this server will fail to receive all succeeding messages as they are not the expected message m_101.

This has been experienced in a configuration running a cluster on a single machine, and possibly also in a LAN scenario when pressuring the leader, but that hasn't been confirmed yet. However, it happens only rarely, as it depends on the scheduling of goroutines, and so it is quite difficult to reproduce.

Is it a problem only relevant to Raft?
Raft requires strict message ordering per server, and does not tolerate holes in the log. This is not the case for protocols that can deal with multiple concurrent requests, such as MultiPaxos with α>1. That is algorithms that allows holes in the log. It is not clear if it is a problem in majority-based read-write storage systems, or other such systems, but it would probably require a combination of failures and this bug.

I expect that other protocols and systems may also not tolerate holes in their log, but can't think of any right now.

Some possible solutions:

Leave it as is, and document it.
Implement a general fix for the problem that applies to all uses of quorum calls. This would be the prefer solution, but it may have ramifications:
1. One possible solution is to limit concurrency across quorum calls (prevent pipelining).
2. Another solution is to use a WaitGroup in a unorthodox way to prevent the problem with very high probability.
Another solution is to provide an option that those protocols that need strict ordering can use.
Instead of an option, we could define a separate quorum call type with strict ordering.

Other concerns:

We should have a reliable test case that can demonstrate and reproduce this bug. However, this seems to require interacting with the go scheduler.

dev: storage.proto and corresponding tests have too many methods in a single interface

It would be better to split out the rpc methods in the Storage service into separate services, and thereby get smaller interfaces, and thus we can avoid unimplemented methods in config_qc_qspecs_test.go and similar places. Also, perhaps it would be clearer which methods belong to which services interface, if we can name them appropriately.

Replace deprecated grpc.WithTimeout() with alternative grpc.DialContext()

This line in node_gen.go should be replaced with

	n.conn, err = grpc.DialContext(ctx, n.addr, opts...)

However, this will impact the Manager API since the ctx must be supplied when creating the manager, so that each node gets the same timeout:

	ctx, cancel := context.WithTimeout(context.Background(), time.Second)

Also we need to consider if each node should get its own ctx, and we should probably handle cancel() calls in the Node.Close() or something.

Need to consider if this should be part of the manager api or hidden from the user.

Add convenience methods to the configuration object

What do you think of this change to the configuration object?

Add add/remove/replace convenience methods to Configuration, i.e., c.add(x, q) is identical to Manager.NewConfiguration(append(c.NodeIDs(), x), q) and similar for remove and replace. (I don't know if replace is actually useful, I was thinking if we wanna change the IP of a node on the fly, but that can be achieved by just using hostnames and DNS).
- (We could maybe drop the need for a new quorum spec when using the add/remove methods by adding a OnChange method to the QuorumSpec, i.e., this method would be implemented and would be responsible for changing the read and write quorum size and etc.., when a new configuration is created from an existing configuration with that QuorumSpec)

Detect changes in template files

The makefile currently does not detect if template files changed, and thus won't recompile the generator if template files change. This should be fixed.