ph14 / fdb-zk Goto Github PK

ZooKeeper server on top of FoundationDB

License: MIT License

Java 98.86% Starlark 1.14%

fdb-zk's Introduction

fdb-zk

fdb-zk is a FoundationDB layer that mimics the behavior of Zookeeper. It is installed as a local service to an application, and replaces connections to and the operation of a ZooKeeper cluster.

While the core operations are implemented, fdb-zk has not been vetted for proper production use.

Talk & Slides

Learn about how the layer works in greater detail:

Architecture

Similar to the FoundationDB Document Layer, fdb-zk is hosted locally and translates requests for the target service into FoundationDB transactions.

Applications can continue to use their preferred Zookeeper clients:

┌──────────────────────┐     ┌──────────────────────┐
│ ┌──────────────────┐ │     │ ┌──────────────────┐ │
│ │   Application    │ │     │ │   Application    │ │
│ └──────────────────┘ │     │ └──────────────────┘ │
│           │          │     │           │          │
│           │          │     │           │          │
│       ZooKeeper      │     │       ZooKeeper      │
│        protocol      │     │        protocol      │
│           │          │     │           │          │
│           │          │     │           │          │
│           ▼          │     │           ▼          │
│ ┌──────────────────┐ │     │ ┌──────────────────┐ │
│ │  fdb-zk service  │ │     │ │  fdb-zk service  │ │
│ └──────────────────┘ │     │ └──────────────────┘ │
└──────────────────────┘     └──────────────────────┘
            │                            │
         FDB ops                      FDB ops
            │                            │
            ▼                            ▼
┌───────────────────────────────────────────────────┐
│                   FoundationDB                    │
└───────────────────────────────────────────────────┘

Features

fdb-zk implements the core Zookeeper 3.4.6 API:

create
exists
delete
getData
setData
getChildren
watches
session management

It partially implements:

multi transactions (reads are fine, but there are no read-your-writes)

It does not yet implement:

getACL/setACL
quotas

Initial Design Discussion

https://forums.foundationdb.org/t/fdb-zk-rough-cut-of-zookeeper-api-layer/1278/

Building with Bazel

Compiling: bazel build //:fdb_zk
Testing: bazel test //:fdb_zk_test
Dependencies: bazel query @maven//:all --output=build

License

fdb-zk is under the Apache 2.0 license. See the LICENSE file for details.

fdb-zk's People

Contributors

Stargazers

Watchers

Forkers

claudiouzelac xumengpanda stevenleroux rayokota wuyuler

fdb-zk's Issues

implement ZK watch semantics via changefeeds

We're currently using FDB watches to mimic ZK watches, but the guarantees are different. FDB sets a watch that fires asynchronously whenever its value is changed, but they're also allowed to not fire in the case of a quick ABA update.

ZK has pretty specific semantics (per https://zookeeper.apache.org/doc/r3.3.3/zookeeperProgrammers.html#ch_zkWatches):

Watches are ordered with respect to other events, other watches, and asynchronous replies. The ZooKeeper client libraries ensures that everything is dispatched in order.

A client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode.

The order of watch events from ZooKeeper corresponds to the order of the updates as seen by the ZooKeeper service.

We need to make sure the fdb-zk client dispatches all updates in order across both reads and watches. We can use FDB watches in conjunction with a changefeed of all updates to actively watched nodes. On read operations, we check this feed to see if any events need to be returned (the FDB watch didn't fire yet), and if so, yield those results first, and then continue with the read.

When a node is written:

 Check if `active_watches : zknode  : watch_type : client_id` exists
 If so, for each client:
   Remove `active_watches : zknode : watch_type : *`
   Insert `watches_changefeed : client_id : versionstamp : watch_type` --> `path`
   Increment `watch_trigger : client_id : watch_type`

The read op & watch_trigger callback flow is then:

read all `watches_changefeed : client_id : *` and remove all entries
return watches to client in ascending versionstamp order
For read ops: continue on with the request

Initial discussion in https://forums.foundationdb.org/t/fdb-zk-rough-cut-of-zookeeper-api-layer/1278

multi-op transactions

handle ACL permissioning

This is overall simple to do but possibly gross.

On the storage side, we store a ZNode's ACLs in (node_ss, acls) --> [ACLs].

On the client auth side, we hook into the ZK server up until the first RequestProcessor to avoid having to deal with client connections & auth. For now at least, this means we can skip worrying about how the connection acquires its auth ids, and just trust that it does based on how ZK would normally do things.

The piece that's not currently done is actually enforcing the ACLs. For each operation, we must check the connection's auth ids against the requested action / node or parent node's ACLs. All this logic is straightforward, but unfortunately ZK-proper has it ACL-related methods marked package-private and private (https://github.com/apache/zookeeper/blob/branch-3.4.14/zookeeper-server/src/main/java/org/apache/zookeeper/server/PrepRequestProcessor.java#L273-L306), so this might come down to good ol' case of copypasta.

refactor directory usage

might want to have the module pull down the directories at start up time and then inject them

should there be a root directory for the whole app?

implement quotas

There are administrative commands to set up byte-size / subtree node size quotas. This can either be additional information at the directory level, or follow the exact same schema as Zookeeper which uses special children nodes

clean up ephemeral nodes

I think there are two broad pieces here, neither being too crazy:

we need to write the code to clean up ephemeral nodes
we need to figure out how we want to choose a client to execute (1)

For the latter piece, there were a few suggestions in https://forums.foundationdb.org/t/fdb-zk-rough-cut-of-zookeeper-api-layer/1278 to build out a simple leader election, or a cooperative clock and electing a given client to do the work for a given period of time. I think it could also make sense to optionally offload some of this work onto a dedicated client, so that app clients aren't cycling through unrelated background work.

client session management

We can store client sessions in FDB, rather than in-memory like in ZK. Since we can rely on FDB as the source of truth, I'm pretty sure we can avoid having to think about which server is the session owner and SessionMovedExceptions and the like (since every client will see the same view of the world)

Session IDs are longs assigned by the server. Conveniently, FDB Versionstamp transaction ids are longs, so we can simply write a key and use its transaction version as the session id.

Clients initially pass in a timeout when connecting. Clients regularly send in heartbeats to update their timeout.

We can use two subspaces updated transactionally to store this data: session_ids and sessions_by_timeout. session_ids will map a given session id to its next timeout, and sessions_by_timeout will order all session ids by their next timeout, so that we can efficiently identify expired sessions.

On session creation:

Set: `(sessions_by_timeout, next_timeout_timestamp, incomplete_versionstamp)`
Set: `(session_ids, incomplete_versionstamp) --> next_timeout_timestamp`
Return versionstamp transaction id as session id to client

On heartbeat:

Read: `(session_ids, session_id_versionstamp) --> timeout timestamp` to find the next timeout
  If timestamp doesn't exist or is expired:
    Remove: `(session_ids, session_id_versionstamp)`
    Remove: `(sessions_by_timeout, timeout, session_id_versionstamp)`
    Return SessionExpired to client

  If timestamp is within range:
    Remove: `(sessions_by_timeout, old_timeout, session_id_versionstamp)`
    Set: `(sessions_by_timeout, next_timeout_timestamp, session_id_versionstamp)`
    Set: `(session_ids, session_id_versionstamp) --> next_timeout_timestamp`

To find expired nodes (which is a prereq for #4):

Scan: `(sessions_by_timeout, stale_timeouts)` for any timestamps that are now considered stale
Clean up their ephemeral nodes
Range delete the scan range from before

check client session expiry on the right ops

Most ZK client ops require the server to verify the session is still active, we need to tie this behavior in to FdbZooKeeperImpl