Giter Site home page Giter Site logo

spacejam / sled Goto Github PK

View Code? Open in Web Editor NEW
7.8K 135.0 376.0 9.07 MB

the champagne of beta embedded databases

License: Apache License 2.0

Rust 98.85% Shell 0.55% Python 0.60%
incredibly-spicy database embedded-kv concurrent lock-free formal-methods high-performance rust kv b-tree

sled's Introduction

key value
buy a coffee for us to convert into databases
documentation
chat about databases with us

sled - it's all downhill from here!!!

An embedded database.

let tree = sled::open("/tmp/welcome-to-sled")?;

// insert and get, similar to std's BTreeMap
let old_value = tree.insert("key", "value")?;

assert_eq!(
  tree.get(&"key")?,
  Some(sled::IVec::from("value")),
);

// range queries
for kv_result in tree.range("key_1".."key_9") {}

// deletion
let old_value = tree.remove(&"key")?;

// atomic compare and swap
tree.compare_and_swap(
  "key",
  Some("current_value"),
  Some("new_value"),
)?;

// block until all operations are stable on disk
// (flush_async also available to get a Future)
tree.flush()?;

If you would like to work with structured data without paying expensive deserialization costs, check out the structured example!

features

  • API similar to a threadsafe BTreeMap<[u8], [u8]>
  • serializable (ACID) transactions for atomically reading and writing to multiple keys in multiple keyspaces.
  • fully atomic single-key operations, including compare and swap
  • zero-copy reads
  • write batches
  • subscribe to changes on key prefixes
  • multiple keyspaces
  • merge operators
  • forward and reverse iterators over ranges of items
  • a crash-safe monotonic ID generator capable of generating 75-125 million unique ID's per second
  • zstd compression (use the compression build feature, disabled by default)
  • cpu-scalable lock-free implementation
  • flash-optimized log-structured storage
  • uses modern b-tree techniques such as prefix encoding and suffix truncation for reducing the storage costs of long keys with shared prefixes. If keys are the same length and sequential then the system can avoid storing 99%+ of the key data in most cases, essentially acting like a learned index

expectations, gotchas, advice

  • Maybe one of the first things that seems weird is the IVec type. This is an inlinable Arced slice that makes some things more efficient.
  • Durability: sled automatically fsyncs every 500ms by default, which can be configured with the flush_every_ms configurable, or you may call flush / flush_async manually after operations.
  • Transactions are optimistic - do not interact with external state or perform IO from within a transaction closure unless it is idempotent.
  • Internal tree node optimizations: sled performs prefix encoding on long keys with similar prefixes that are grouped together in a range, as well as suffix truncation to further reduce the indexing costs of long keys. Nodes will skip potentially expensive length and offset pointers if keys or values are all the same length (tracked separately, don't worry about making keys the same length as values), so it may improve space usage slightly if you use fixed-length keys or values. This also makes it easier to use structured access as well.
  • sled does not support multiple open instances for the time being. Please keep sled open for the duration of your process's lifespan. It's totally safe and often quite convenient to use a global lazy_static sled instance, modulo the normal global variable trade-offs. Every operation is threadsafe, and most are implemented under the hood with lock-free algorithms that avoid blocking in hot paths.

performance

  • LSM tree-like write performance with traditional B+ tree-like read performance
  • over a billion operations in under a minute at 95% read 5% writes on 16 cores on a small dataset
  • measure your own workloads rather than relying on some marketing for contrived workloads

a note on lexicographic ordering and endianness

If you want to store numerical keys in a way that will play nicely with sled's iterators and ordered operations, please remember to store your numerical items in big-endian form. Little endian (the default of many things) will often appear to be doing the right thing until you start working with more than 256 items (more than 1 byte), causing lexicographic ordering of the serialized bytes to diverge from the lexicographic ordering of their deserialized numerical form.

  • Rust integral types have built-in to_be_bytes and from_be_bytes methods.
  • bincode can be configured to store integral types in big-endian form.

interaction with async

If your dataset resides entirely in cache (achievable at startup by setting the cache to a large enough value and performing a full iteration) then all reads and writes are non-blocking and async-friendly, without needing to use Futures or an async runtime.

To asynchronously suspend your async task on the durability of writes, we support the flush_async method, which returns a Future that your async tasks can await the completion of if they require high durability guarantees and you are willing to pay the latency costs of fsync. Note that sled automatically tries to sync all data to disk several times per second in the background without blocking user threads.

We support async subscription to events that happen on key prefixes, because the Subscriber struct implements Future<Output=Option<Event>>:

let sled = sled::open("my_db").unwrap();

let mut sub = sled.watch_prefix("");

sled.insert(b"a", b"a").unwrap();

extreme::run(async move {
    while let Some(event) = (&mut sub).await {
        println!("got event {:?}", event);
    }
});

minimum supported Rust version (MSRV)

We support Rust 1.62 and up.

architecture

lock-free tree on a lock-free pagecache on a lock-free log. the pagecache scatters partial page fragments across the log, rather than rewriting entire pages at a time as B+ trees for spinning disks historically have. on page reads, we concurrently scatter-gather reads across the log to materialize the page from its fragments. check out the architectural outlook for a more detailed overview of where we're at and where we see things going!

philosophy

  1. don't make the user think. the interface should be obvious.
  2. don't surprise users with performance traps.
  3. don't wake up operators. bring reliability techniques from academia into real-world practice.
  4. don't use so much electricity. our data structures should play to modern hardware's strengths.

known issues, warnings

  • if reliability is your primary constraint, use SQLite. sled is beta.
  • if storage price performance is your primary constraint, use RocksDB. sled uses too much space sometimes.
  • if you have a multi-process workload that rarely writes, use LMDB. sled is architected for use with long-running, highly-concurrent workloads such as stateful services or higher-level databases.
  • quite young, should be considered unstable for the time being.
  • the on-disk format is going to change in ways that require manual migrations before the 1.0.0 release!

priorities

  1. A full rewrite of sled's storage subsystem is happening on a modular basis as part of the komora project, in particular the marble storage engine. This will dramatically lower both the disk space usage (space amplification) and garbage collection overhead (write amplification) of sled.
  2. The memory layout of tree nodes is being completely rewritten to reduce fragmentation and eliminate serialization costs.
  3. The merge operator feature will change into a trigger feature that resembles traditional database triggers, allowing state to be modified as part of the same atomic writebatch that triggered it for retaining serializability with reactive semantics.

fund feature development

Like what we're doing? Help us out via GitHub Sponsors!

sled's People

Contributors

asonix avatar bltavares avatar byron avatar dependabot-preview[bot] avatar dependabot-support avatar dependabot[bot] avatar divergentdave avatar dotcypress avatar hdevalence avatar jamesmunns avatar jeehoonkang avatar jibbow avatar jneem avatar kerollmops avatar lambda avatar lucab avatar mbr avatar mich2000 avatar nrc avatar pantonov avatar pmuens avatar rleungx avatar roignpar avatar rubdos avatar siler avatar sitano avatar spacejam avatar spacekookie avatar terkwood avatar th3whit3wolf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sled's Issues

add child split fragment

this was skipped to simplify the in-memory prototype, but we need to use a frag so that we can linearize key updates with splits

tune Lru

calls to the Lru bookkeeping structures are inefficient. should use simpler data structures.

modularize core components

Right now some of the core components such as tree, log, etc. are larger files which include a couple of different structs, enums and their corresponding methods and functions.

Those components should be modularized so that a clear file hierarchy is present which makes it easier to understand and reason about the code, its different components and their relationships.

@spacejam already started the process for the Log component in https://github.com/spacejam/rsdb/pull/27.

zstd dictionary support

the use of a dictionary dramatically improves zstd compression and decompression performance.

Maybe `Frag: Send`?

I am trying to port rsdb to crossbeam-epoch: https://github.com/crossbeam-rs/crossbeam-epoch/

In doing so, I found out that Stack probably should require T: Send, because an item can be sent to elsewhere. In turn, I guess PageCache's P should be Send because of inner: Radix<Stack<CacheEntry<P>>>, and then Frag should be Send. What do you think?

compression of items written to log slots

we can't compress on the log-buffer level without destroying the property of a LogID being its physical offset, so we need to compress before writing here. Desired properties:

  • works well with small amounts of bytes of human languages
  • configurable

pagetable snapshotting

on recovery, we should be able to recover a snapshot of the pagetable, then read the log after the pagetable was created for full reconstruction. we only want to read and recover the structure of pages and the offsets of chunks of frags in the log for future retrieval, without pulling in any kv data.

initially this can be done by forking and snapshotting, similar to redis

multi-fragment flush strategy

to improve data locality and reduce written data, write adjacent chunks of frags in a single stack at once to the log

recover persisted flushes on start-up

naive recovery, no handling of LRU cache-related stuff yet. no page table snapshotting. read it all into memory for reconstruction.

  • implement iterator on Log trait
  • implement recovery method on PageCache
  • write relevant page metadata to flushes to support reconstruction
  • test that a log's stabilized updates are visible after shutting down and restarting

dirty page flushing

  • periodically flush the dirty buffer
  • make the LogID in CacheEntry non-optional

in the future we may want to buffer things before claiming a log slot for updates, but for now let's just keep it simple

create Epoch structure

should have similar basic structure as Arc:

  1. shared header, represented as AtomicUsize
  2. enroll on clone
  3. un-enroll + re-enroll + merge GC + un-enroll on drop
  4. associated lock-free stack of GC enums for different types of stuff
  5. associated thread-local GC state that is merged after initial un-enroll

test torn pages on log

  1. write garbage to log after valid entries
  2. ensure read on valid entries suceeds
  3. ensure read on garbage fails
  4. ensure that during recovery, we detect the torn page and allow new entries to reuse the garbage space
  5. ensure that the new entries, and old valid entries are valid before shutting down
  6. recove again, ensure all valid entries are readable

persistent state definition

define structures around persisted fragments and base nodes, and a way of easily generating a block of serialized state from a Frags struct.

array-stack

reconstructing pages in-memory involves chasing linked-list pointers around

implement a linked-array structure that minimizes cache thrashing during page reconstruction

read in arguments for benchmark tool

the benchmark tool should accept arguments for these parameters:

  1. number of threads
  2. number of total operations
  3. proportion of reads, writes, cas, del, and scan operations among the operations
  4. freshness bias (prefer recent/likely to be in cache, prefer old/not likely to be in cache, no preference)
  5. non-present-key chance (the chance that a request for cas/get/set/del may be sent for a key that does not exist)
  6. key size min, max, median
  7. value size min, max, median
  8. scan iterations min, max, median

collect statistics on each operation in the benchmark tool

when executing operations, we want to know about:

  1. throughput (requests / s)
  2. latency (time per request)

For latency, we care about the latency distribution, which means we need to collect percentiles. This is actually a pretty interesting problem, since almost all existing solutions REALLY suck and use probabilistic approches that destroy accuracy. One approach that I like is:

  • create a Vec of AtomicUsize that is pretty big.
  • create a funciton that maps from a measured value to a "slot number"
  • each slot number corresponds to values that are ~1% larger than the ones in the previous one
  • when measuring something, like latency, convert the latency to the slot number using the function, then look up the AtomicUsize and fetch_add 1 to it.
  • when done with all measurements, you can scan over the counts to determine any percentile. it's good to know, at a minimum, 0th (fastest), 50th (median), 75th, 90th, 95th, 97.5th, 99th, 99.9th, 99.99th, 99.999th, and 100th (max).

here's an example of a function that converts from a number to a slot number, and back: https://github.com/spacejam/loghisto/blob/master/metrics.go#L316

For throughput, just atomically increment a counter for each of the ops, and have a thread spit out the values every second, resetting it to 0. The latency percentiles can be for the duration of the entire benchmark, for now.

improve lock-free radix pointer ratio

the current radix implementation is super naive, and all nodes contain 64 (1 << 6) atomic pointers to possibly nonexistant children. because this is a dense keyspace, the first few layers will be full, with the bottom leaves (of which there are 64x more than the above layer...) possibly being quite sparse. we want to keep the overall structure lock-free, since this structure is in the hot-path for every operation. look into ways of using less memory for sparse nodes, and growing/shrinking nodes with a CAS to keep sparsity in check.

implement mvp benchmark calls to db

take the arguments, and implement these operations:

  1. spin up the specified number of threads
  2. execute the desired number of total operations
  3. roughly maintain the specified proportion of reads, writes, cas, del, and scan operations among the operations

tune materializer calls

calls to the page materializer take a lot of time, try to cache results from it so we don't spend as much cpu calling it on the same hot pages over and over.

epoch-based GC

basic idea:
https://github.com/spacejam/tla-rust#lock-free-epoch-based-gc

it may be sufficient to use ping-pong epochs for this to simplify some logic. we keep a single word "header" that is CAS'd by threads to coordinate:

1 bit | 1 bit | 7 bits | 1 bit | 7 bits
left/right active | left state | left active thread count | right state | right active thread count

  1. there is only one active epoch at a time, specified by the active bit
  2. each epoch is either in the open or sealed state
  3. when an operation begins on shared state (the tree), a thread first looks at the shared coordination header, and sees which buffer is active
  4. if the active buffer is also in the open state, the thread tries to increment the associated active thread count by 1 using a CAS operation
  5. if the active buffer is in the sealed state, we must wait for the thread that was the last one to finish after sealing it to finish the GC operation and transition it into the active state, so we spin until it is available
  6. the thread then operates on the shared state
  7. if the thread makes any shared state unreachable, it pushes it to a thread-local buffer that will be added to an epoch after un-enrolling. The state remains readable until the end of the epoch, when it is guaranteed that all threads have moved on to a later epoch, and cannot have viewed it anyway.
  8. at some point, a condition is reached that causes the current epoch to be sealed, whether a periodic time since the last seal or the amount of garbage in the epoch
  9. the thread that detects this condition tries to simultaneously set the seal bit on the epoch and flip the active bit using a single CAS operation, causing enrolling and GC'ing threads to use the other epoch
  10. when the operation on the shared state is complete, the thread un-enrolls in its epoch by decrementing the shared header by 1
  11. if its starting epoch is sealed, and the decrement would cause the epoch active thread count to drop to 0, then this thread is responsible for triggering (not necessarily performing) a GC operation on each item in the GC stack associated with the epoch. note that the thread still has not done anything with its thread-local GC state, it is just cleaning up the stuff from other threads.
  12. after the epoch's GC state has been handed off to another thread for cleaning up, or has been completely cleaned up in this thread, the designated clean-up thread then marks the epoch as open again. it can combine this with the CAS that decrements the active thread count to 0 to cut down on memory barriers.
  13. after un-enrolling in the epoch, the thread then enrolls again, and adds its local GC state to that epoch's GC stack
  14. when un-enrolling in the "write" epoch, the same logic applies for handling decrements on sealed epochs as before, and the thread must handle the actual clean-up when responsible

frag location metadata in pagetable

the pagetable should know how to retrieve frags from the log to satisfy non-resident reads

  • on recovery, load page locations in as a stack with a flush pointing to the disk location
  • on page-in, CAS the stack to the actual one

add logging to page operations

during all page operations:

  1. create consolidated page / block of updates
  2. reserve space for it in log
  3. CAS
  4. either abort or commit the log reservation

handle tree root hoists during recovery

currently the pagecache recovers itself, but does not communicate to the tree on top about updates it encounters to the tree root, since it doesn't know what a tree root is. We should support an interface that lets trees inject custom logic into the recovery (and eventually the snapshotting) process.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.