microsoft / ccf Goto Github PK
View Code? Open in Web Editor NEWConfidential Consortium Framework
Home Page: https://microsoft.github.io/CCF/
License: Apache License 2.0
Confidential Consortium Framework
Home Page: https://microsoft.github.io/CCF/
License: Apache License 2.0
Node signatures are currently triggered both by count and time mechanism, and occasionally cause signatures that are closer together than they could be. This is somewhat inefficient, and results in slightly lower throughput than expected in a real system.
Since the time in the enclave is updated at every tick
, we may also want to display the delta between the current host time (when the line is printed) and the enclave time when the log was created.
We should add a new end-to-end test that spins up multiple nodes and clients and tests the performance of forwarded commands with client signatures, using the bitcoin secp256k1
curve.
In a PBFT configuration, RPC endpoints would pass commands directly to the consensus. The consensus will then distribute the commands among the nodes, and decide when to issue a batch to the frontends.
Remove lua's use of __default
key, only support script-per-method approach.
This is useful for users who need to go through proxies until ansible/ansible#42534 is resolved.
apt_repository can be replaced with manual commands to download the key, add it, and then add the apt sources entry.
Extend the existing driver to support testing from Python:
Different users may want to use different crypto curves for their key pair.
See discussion below
See details listed:
https://github.com/microsoft/CCF/blob/master/src/node/nodestate.h#L214
https://github.com/microsoft/CCF/blob/master/src/node/nodestate.h#L265
https://github.com/microsoft/CCF/blob/master/src/node/nodestate.h#L412
https://github.com/microsoft/CCF/blob/master/src/node/nodestate.h#L1073
Would you like to add more error handling for return values from functions like the following?
We should add the capability to allow certain RPCs to be executed only when the node is partOfNetwork
/partOfPublicNetwork
.
As it stands, members can issue RPCs when a node is reading the public ledger (recovery). Instead, we should still let members connect to the CCF service (i.e. TLS handshake succeeds) but their RPCs should return that the node is not in an appropriate state.
One possible solution would be to tweak the install()
function in frontend.h
to allow passing optional arguments corresponding to the white list of node states that a certain RPC can run against.
When running a test, after the setting up a network, the nodes table should be populated, therefore, readable by ledger.py.
When reading the ledger, we can see identical node entries appearing on sequential transactions, without any property being changed. This seems concerning, since the ledger should only be storing the delta between versions.
After having created a Microsoft VM with SGX support, and having cloned the CCF repository and successfully installed the requirements as specified in README.md
, we attempted to run the end-to-end test as specified here: https://microsoft.github.io/CCF/demo.html
After executing python ../tests/e2e_scenarios.py --scenario ../tests/simple_logging_scenario.json -g ../src/runtime_config/gov.lua --label test1
in the newly created build
directory of the cloned CCF repository, we see the following log output:
2019-06-11 23:57:18.477 | INFO | infra.remote:stop:326 - [127.102.23.245] closing
2019-06-11 23:57:18.547 | ERROR | infra.remote:log_errors:52 - /tmp/ccf_0/out: [fail]../src/node/rpc/nodefrontend.h:32 - - Quote could not be verified OE_INVALID_PARAMETER
2019-06-11 23:57:18.547 | ERROR | infra.remote:log_errors:52 - /tmp/ccf_0/out: [fail]../src/host/rpcconnections.h:179 - - Cannot close id 3: does not exist
2019-06-11 23:57:18.547 | ERROR | infra.remote:log_errors:57 - /tmp/ccf_0/err contents:
2019-06-11 23:57:18.547 | ERROR | infra.remote:log_errors:58 -
2019-06-11 23:57:18.547 | INFO | infra.remote:stop:326 - [127.220.236.151] closing
2019-06-11 23:57:18.595 | INFO | infra.ccf:stop_all_nodes:235 - All remotes stopped...
Traceback (most recent call last):
File "../tests/e2e_scenarios.py", line 90, in <module>
run(args)
File "../tests/e2e_scenarios.py", line 68, in run
network.wait_for_node_commit_sync()
File "/tmp/CCF/tests/infra/ccf.py", line 284, in wait_for_node_commit_sync
assert [commits[0]] * len(commits) == commits, "All nodes at the same commit"
AssertionError: All nodes at the same commit
This error is reproducible across multiple Azure VMs, and on a fresh installation of 18.04 on bare metal. Is this a bug in the CCF end to end test, or are we going wrong somewhere? Documentation on this seems to be quite sparse, and doesn't seem to be up to date with the latest command line parameters & switches. We do have a video of the entire process if this would be useful in debugging.
It is almost certainly worth reading https://colin-scott.github.io/blog/2015/10/07/fuzzing-raft-for-fun-and-profit/ before starting on this.
It seems that in the current implementation of raft protocol, at least voted_for
is not persistent.
Is it partially implemented or am I missing something?
If a CCF service lost f + 1 nodes, is it possible to recovery the service by add new nodes (on the same platforms)?
In section IV-D "Adding a Node to a Service", TR says
The join protocol resumes when the new node gains status trusted through governance.
Does this imply if f + 1 nodes crashed, one should shut down the whole service, and follows
the "Catastrophic Recovery" instructions to recovery the service?
There is no RPC to support adding a node.
This will be needed in order to update the code to a newer version, since code update on an existing node is not allowed.
There ought to be a consistent way to define RPC schemas to provide tp-to-date:
Our lua documentation is out-of-date, still mentioning the msg_id
field which is no longer passed to handlers.
It also describes overriding the __default
handler function or individual functions matching the RPC method names. Our example apps only use the latter - perhaps the former can be removed.
When the API for these handlers changes, the lua apps will fail with obscure messages. We should look at improving this by adding validation of the arguments (number, named elements in a table, metaclass providing errors on missing lookup). It would also be useful to add explicit registration to find and report mismatches when the service starts, rather than when an out-of-date method is called.
Use https://github.com/dsprenkels/sss adapted to run in an enclave to implement Shamir's Secret Sharing.
ksk_m
to ccf.members
table and generate in test infra/docs.k_z
on start-up, split it into "some" shares, encrypt the shares and keep encrypted shares in node's memory for now.ccf.shares
table and populate it with (ledger secrets)k_z
and [(shares_m)ksk_m]
on start-up.getEncryptedRecoveryShare
RPC and allow members to easily decrypt them.submitRecoveryShare
RPC (after vote).k_z
when sufficient shares (i.e. secret sharing threshold) have been submitted, decrypt ledger secrets from ccf.shares
table and initiate private ledger recovery.nk
and pass it via the join protocol. The public key should be output by the starting node.nk_pub
before passing it to their vote RPC. CCF should decrypt the encrypted share on proposal completion.*Misc
ServiceStatus
should use a state machine, in a similar way to NodeState
.Rather than writing the verbose
LOG_INFO << "a: " << a << ", b: " << b << std::endl;
we should be able to write
LOG_INFO_FMT("a: {}, b:{}", a, b);
to utilise fmtlib and assume endl
s.
We have a limited number of private executors, which we would like to free up as much as possible. Two of the four build jobs are purely virtual by necessity (SAN and coverage) and can be moved to public executors.
I tried to do that with Pipelines, but the networking restrictions they place on containers mean we cannot run the end to end tests there: #122
The best thing we can do is move them to the public CI instead and keep the private build strictly for SGX builds.
The extracted code we are using isn't particularly recent, and we are missing a number of improvements made recently.
This is necessary for PBFT support, a Raft-configured CCF service may leave the command and the result empty.
As of the current master
(a1073f7), the small_bank_client_test
silently fails:
38: 2019-06-05 10:44:22.240 | DEBUG | infra.jsonrpc:log_response:229 - #0 {'id': 0, 'result': {'commit': 26, 'term': 2}, 'error': None, 'jsonrpc': '2.0', 'commit': 26, 'term': 2, 'global_commit': 26}
38: 2019-06-05 10:44:23.298 | INFO | infra.jsonrpc:log_request:217 - [127.89.82.26:60798] #0 getCommit {} (node 0 (user))
38: 2019-06-05 10:44:23.308 | DEBUG | infra.jsonrpc:log_response:229 - #0 {'id': 0, 'result': {'commit': 26, 'term': 2}, 'error': None, 'jsonrpc': '2.0', 'commit': 26, 'term': 2, 'global_commit': 26}
38: 2019-06-05 10:44:24.366 | INFO | infra.jsonrpc:log_request:217 - [127.89.82.26:60798] #0 getCommit {} (node 0 (user))
38: 2019-06-05 10:44:24.376 | DEBUG | infra.jsonrpc:log_response:229 - #0 {'id': 0, 'result': {'commit': 26, 'term': 2}, 'error': None, 'jsonrpc': '2.0', 'commit': 26, 'term': 2, 'global_commit': 26}
38: 2019-06-05 10:44:24.434 | INFO | infra.jsonrpc:log_request:217 - [127.89.82.26:60798] #0 getMetrics {} (node 0 (user))
38: 2019-06-05 10:44:24.496 | DEBUG | infra.jsonrpc:log_response:229 - #0 {'id': 0, 'result': {'histogram': {'buckets': {'100..103': 1, '10240..10751': 795, '10752..11263': 2, '108..111': 1, '1344..1407': 1, '1408..1471': 2, '1664..1727': 1, '1792..1855': 1, '1856..1919': 1, '1984..2047': 1, '2048..2175': 1, '2176..2303': ...
38: 2019-06-05 10:44:24.497 | INFO | infra.rates:_get_metrics:84 - Filtering histogram results...
38: 2019-06-05 10:44:29.499 | ERROR | infra.remote_client:wait:102 - Failed to wait on client client_0
38: Traceback (most recent call last):
38:
38: File "/home/ci/agent/_work/4/s/samples/apps/smallbank/tests/small_bank_client.py", line 22, in <module>
38: client.run(args.build_dir, get_command, args)
38: │ │ │ │ │ └ Namespace(accounts=10, app_script=None, build_dir='.', client='./small_bank_client', client_nodes=None, config='/home/ci/agent/_...
38: │ │ │ │ └ <function get_command at 0x7f369d254e18>
38: │ │ │ └ '.'
38: │ │ └ Namespace(accounts=10, app_script=None, build_dir='.', client='./small_bank_client', client_nodes=None, config='/home/ci/agent/_...
38: │ └ <function run at 0x7f369d254d08>
38: └ <module 'client' from '/home/ci/agent/_work/4/s/tests/client.py'>
38:
38: File "/home/ci/agent/_work/4/s/tests/client.py", line 70, in run
38: infra.runner.run(*args, **kwargs)
38: │ │ │ │ └ {}
38: │ │ │ └ ('.', <function get_command at 0x7f369d254e18>, Namespace(accounts=10, app_script=None, build_dir='.', client='./small_bank_clie...
38: │ │ └ <function run at 0x7f369d254a60>
38: │ └ <module 'infra.runner' from '/home/ci/agent/_work/4/s/tests/infra/runner.py'>
38: └ <module 'infra' from '/home/ci/agent/_work/4/s/tests/infra/__init__.py'>
38:
38: File "/home/ci/agent/_work/4/s/tests/infra/runner.py", line 130, in run
38: remote_client.wait()
38: │ └ <bound method CCFRemoteClient.wait of <infra.remote_client.CCFRemoteClient object at 0x7f36a10cfc50>>
38: └ <infra.remote_client.CCFRemoteClient object at 0x7f36a10cfc50>
38:
38: > File "/home/ci/agent/_work/4/s/tests/infra/remote_client.py", line 100, in wait
38: self.remote.wait_for_stdout_line(line="Global commit", timeout=5)
38: │ │ └ <bound method LocalRemote.wait_for_stdout_line of <infra.remote.LocalRemote object at 0x7f369d24dfd0>>
38: │ └ <infra.remote.LocalRemote object at 0x7f369d24dfd0>
38: └ <infra.remote_client.CCFRemoteClient object at 0x7f36a10cfc50>
38:
38: File "/home/ci/agent/_work/4/s/tests/infra/remote.py", line 360, in wait_for_stdout_line
38: "{} not found in stdout after {} seconds".format(line, timeout)
38: │ └ 5
38: └ 'Global commit'
38:
38: ValueError: Global commit not found in stdout after 5 seconds
38: 2019-06-05 10:44:29.550 | INFO | infra.remote:stop:326 - [127.133.12.129] closing
38: 2019-06-05 10:44:29.556 | INFO | infra.runner:run:137 - Rates: ----------- tx rates -----------
38: ----- mean ----: 9492.643637953652
38: ----- harmonic mean ----: 7938.621457889336
38: ---- standard deviation ----: 1373.4581010981751
38: ----- median ----: 9500
38: ---- max ----: 10777
38: ---- min ----: 83
38: ----------- tx rates histogram -----------
38: {
38: "histogram": {
38: "80..83": 1,
38: "100..103": 1,
38: "108..111": 1,
38: "320..335": 1,
38: "672..703": 1,
38: "992..1023": 3,
38: "1344..1407": 1,
38: "1408..1471": 2,
38: "1664..1727": 1,
38: "1792..1855": 1,
38: "1856..1919": 1,
38: "1984..2047": 1,
38: "2048..2175": 1,
38: "2176..2303": 1,
38: "2688..2815": 4,
38: "2816..2943": 5,
38: "2944..3071": 2,
38: "3072..3199": 1,
38: "3200..3327": 1,
38: "3328..3455": 2,
38: "3456..3583": 2,
38: "3584..3711": 4,
38: "3712..3839": 1,
38: "3840..3967": 2,
38: "4096..4351": 5,
38: "4352..4607": 5,
38: "4608..4863": 4,
38: "4864..5119": 3,
38: "5120..5375": 9,
38: "5376..5631": 5,
38: "5632..5887": 4,
38: "5888..6143": 3,
38: "6144..6399": 6,
38: "6400..6655": 12,
38: "6656..6911": 7,
38: "6912..7167": 9,
38: "7168..7423": 9,
38: "7424..7679": 10,
38: "7680..7935": 20,
38: "7936..8191": 21,
38: "8192..8703": 87,
38: "8704..9215": 201,
38: "9216..9727": 863,
38: "9728..10239": 166,
38: "10240..10751": 795,
38: "10752..11263": 2
38: },
38: "low": 0,
38: "high": 10777,
38: "underflow": 339,
38: "overflow": 0
38: }
38: 2019-06-05 10:44:29.566 | INFO | infra.remote:stop:326 - [127.89.82.26] closing
38: 2019-06-05 10:44:29.583 | ERROR | infra.remote:log_errors:52 - /tmp/ci_0/out: [fail]../src/host/rpcconnections.h:179 - - Cannot close id 3: does not exist
38: 2019-06-05 10:44:29.583 | ERROR | infra.remote:log_errors:57 - /tmp/ci_0/err contents:
38: 2019-06-05 10:44:29.583 | ERROR | infra.remote:log_errors:58 -
38: 2019-06-05 10:44:29.583 | INFO | infra.remote:stop:326 - [127.148.108.186] closing
38: 2019-06-05 10:44:29.615 | INFO | infra.ccf:stop_all_nodes:235 - All remotes stopped...
2/3 Test #38: small_bank_sigs_client_test .............. Passed 34.35 sec
test 39
We need to guarantee to fix this and also make sure that the test fails when this happens.
It seems that the regression was introduced in this commit: e567277
Imported from an earlier code TODO, I cannot find context for this.
As the number of internal CCF tables used for administration increases, there's an ever-increasing amount of reserved names. We ought to prefix tables (for example "CCF.MEMBERS").
OpenEnclave now exposes a CMake interface, we should take advantage of it and reduce the size of our CMakeLists. We should also be more robust to OpenEnclave changes in the future.
Our CI and local build on machines supporting SGX currently fail, e.g.:
./cchost --enclave-file=./libsmallbankenc.so.signed --raft-election-timeout-ms=100 000 --raft-host=127.37.17.198 --raft-port=59176 --tls-host=127.37.17.198 --tls-pubhost=127.37.17.198 --tls-port=5745 8 --ledger-file=0.ledger --node-cert-file=0.pem --enclave-type=debug --log-level=info --quote-file=quote0.bin
[info]../src/host/main.cpp:245 - - Starting new node.
[info]../src/host/main.cpp:263 - - Created new node.
15:08:16:190698 tid(0x7fc841795740) (H)[ERROR]:OE_BUFFER_TOO_SMALL[../host/crypto/openssl/cert.c oe_cert_find_extension:991]
15:08:16:190715 tid(0x7fc841795740) (H)[ERROR]:OE_BUFFER_TOO_SMALL[../host/crypto/openssl/cert.c oe_cert_find_extension:991]
15:08:17:376146 tid(0x7fc841795740) (H)[ERROR]X509_verify_cert failed!
error: (12) CRL has expired
(oe_result_t=OE_VERIFY_CRL_EXPIRED)[../host/crypto/openssl/cert.c oe_cert_verify:721]
15:08:17:376170 tid(0x7fc841795740) (H)[ERROR]oe_cer_verify failed with error = CRL has expired
(oe_result_t=OE_VERIFY_CRL_EXPIRED)[../common/sgx/revocation.c oe_enforce_revocation:248]
15:08:17:376174 tid(0x7fc841795740) (H)[ERROR]:OE_INVALID_PARAMETER[../host/crypto/openssl/cert.c oe_cert_chain_free:595]
15:08:17:376176 tid(0x7fc841795740) (H)[ERROR]:OE_INVALID_PARAMETER[../host/crypto/openssl/cert.c oe_cert_chain_free:595]
15:08:17:376177 tid(0x7fc841795740) (H)[ERROR]:OE_INVALID_PARAMETER[../host/crypto/openssl/cert.c oe_cert_chain_free:595]
15:08:17:376179 tid(0x7fc841795740) (H)[ERROR]enforcing CRL (oe_result_t=(null))[OE_VERIFY_CRL_EXPIRED ../common/sgx/quote.c:5139712]
15:08:17:376183 tid(0x7fc841795740) (H)[ERROR]:OE_INVALID_PARAMETER[../host/crypto/openssl/key.c oe_public_key_free:314]
15:08:17:376185 tid(0x7fc841795740) (H)[ERROR]:OE_INVALID_PARAMETER[../host/crypto/openssl/cert.c oe_cert_chain_free:595]
15:08:17:376187 tid(0x7fc841795740) (H)[ERROR]:OE_VERIFY_CRL_EXPIRED[../host/sgx/report.c oe_verify_report:315]
[fail]../src/host/enclave.h:154 - - Quote could not be verified: OE_VERIFY_CRL_EXPIRED
[fatal]../src/host/main.cpp:289 - - Verification of local node quote failed
terminate called after throwing an instance of 'std::logic_error'
what(): Fatal: [fatal]../src/host/main.cpp:289 - - Verification of local node quote failed
See openenclave/openenclave#1842 for further details.
We are missing a test (and therefore probably some functionality) for the code version change proposal, vote and transition user scenario.
The first part of the test should start a simple logging network, and make sure that trying to add a node running a different application (eg. lua app) fails because of the code version check.
The second part of the check should add a variant of the logging app to the build, stage a vote to allow nodes running that code, and gradually replace the nodes in the network with nodes running that new version of the code.
This allows signature verification to be offloaded to the followers, and leaves all leader resources free for transaction processing.
There should be a more efficient late-join/recovery mechanism than replaying through history.
When an RPC is sent to a follower and forwarded, valid_caller
is called only on the follower and not on the leader which later executes the transaction. The caller may be invalid by the time the transaction is executed.
When processing a forwarded transaction the leader should check that the caller is still valid and allowed to execute transactions.
The KV should support consuming a batch of serialised commands in order.
If the hash matches post-execution, the call should return successfully, otherwise, the transactions should be rolled back.
I built with -DCURVE_CHOICE=secp256k1_bitcoin
, and the votinghistory test fails with the following error:
29: Traceback (most recent call last):
29: File "/home/v-edasht/src/CCF/tests/votinghistory.py", line 134, in <module>
29: run(args)
29: File "/home/v-edasht/src/CCF/tests/votinghistory.py", line 124, in run
29: verify_sig(cert, sig, req)
29: File "/home/v-edasht/src/CCF/tests/votinghistory.py", line 22, in verify_sig
29: pub_key.verify(sig, req, hash_alg)
29: File "/home/v-edasht/env/lib/python3.7/site-packages/cryptography/hazmat/backends/openssl/ec.py", line 340, in verify
29: _ecdsa_sig_verify(self._backend, self, signature, data)
29: File "/home/v-edasht/env/lib/python3.7/site-packages/cryptography/hazmat/backends/openssl/ec.py", line 89, in _ecdsa_sig_verify
29: raise InvalidSignature
29: cryptography.exceptions.InvalidSignature
Currently, the CI with the default curve and smallbank off (ACC_1804_SGX_build
job), and with secp256k1_bitcoin
only running the small_bank tests (ACC_1804_SGX_perf_build
job). We don't have coverage of all tests with all curves/other options.
It should be possible to rotate ledger secrets (which are at the moment the same as network secrets) without executing a full recovery process. This is necessary to enable key-shares, as described in the TR, as opposed to the current sealing key approach.
The KV should be able to emit a batch of ordered requests (serialised commands), along with a hash of the corresponding ordered write-sets, for the replicator/consensus to consume.
When a flag is set in an RPC message, the user should get back a transaction receipt, consisting of path in the Merkle tree of transactions.
We should take this opportunity to introduce JSON Schema.
It should be possible for clients to get node quotes, to enable client-side quote verification. Those are currently stored in the KV, they just need to be exposed through a user-facing RPC.
We want to make sure a CCF application never issues any unncessary OCalls. One way to do that would be to add OCall logging in OpenEnclave, and make our end to end tests check for any OCalls.
Ideally OpenEnclave would allow disabling support for implicit OCalls altogether, at least in release enclaves, or in response to a user control.
Tracked in OpenEnclave as openenclave/openenclave#1777
openenclave
and push it to Docker Hub.As it stands, we need to explicitly pass -q
(a.k.a. --expect-quote
) to our python test infrastructure to explicitly tell python to copy the quote from the remote node to the local/build directory where the quote will be used to create nodes.json
.
However, we should expect most of our users to run these tests on a SGX machine, i.e. where a quote is produced. So we should switch the default behaviour for expecting quote, so that we need to explicitly pass a flag to specify to not expect a quote (for example, on virtual builds).
See #143
Reported by @dantengsky in #86
Suppose a service composed of 3 nodes {n0 .. n2} , all nodes are synced(same term, index) at beginning.
Adversary controls a minority {n0}. (Enclaves are not compromised)
n0 is in Follower state, adversary may modify the code of untrusted-zone, so that AdminMessage::tick messages are sent to enclave much more frequently, and with large enough elapsed_ms value to trigger timeouts.
If I get it right, the victim enclave will keep sending RequestVote messages to peers, and because messages are constructed by the enclave, other peers will treat the RequestVote messages as legitimate, the honest leader will also transit to Follower state.
The adversary also drops in-bound messages to the victim enclave, so that victim enclave can not transit to Leader state, hence no AppendEntries messages will be sent.
The malicious node keeps being the first RequestVote message's sender for each new term, the cluster will be effectively shutdown.
Network is still partially synchronous, a majority is still alive, but liveness no longer held.
The most straightforward fix for this is to execute the random election timeout inside the enclave, to make sure it isn't shorter than a lower bound.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.