fjall-rs / fjall Goto Github PK

View Code? Open in Web Editor NEW

132.0 3.0 6.0 2.32 MB

LSM-based embedded key-value storage engine written in safe Rust

License: Apache License 2.0

Rust 99.22% JavaScript 0.78%

database lsm lsm-tree lsmt rust rust-lang storage-engine mit-license embedded-database embedded-kv

fjall's People

Contributors

Stargazers

Watchers

Forkers

shabbirhasan1 jonefeewang rs0d crystalanalyst semihbkgr zzzint

fjall's Issues

Data partitioning/Column families

Column families/partitions are the missing piece to having a great storage engine. Each column family is its own partition inside the LSM, but all share the same journal, enabling atomic writes. So each column family is physically stored separately from the others. Compaction then looks at each partition instead, never mixing tables of different partitions. It's like a big meta-LSM tree, instead of using multiple trees, which don't have atomic semantics, unless putting another super structure on top. That super structure will be the new LSM tree which contains partitions basically. Compaction may even be set differently for each column family.

When creating no column family, everything gets put into the "default" column family.
Creating column families is pretty simple, you add a new memtable. Dropping a column family is simple. Delete its memtable and all the segments inside the partition. The journal needs to be handled accordingly, because flushing one column family doesn't necessarily mean the others are flushed too. And if the column family was deleted, the journal should not flush parts to the partition at all.

There needs to be new semantics for writing to a column family:

insert(col_family, key, value)
remove(col_family, key)
batch() ... needs col_family support

https://github.com/facebook/rocksdb/wiki/Column-Families

Example use cases

One partition could be an index in a relational database for example, so writing a row into "default" also atomically inserts/updates rows inside "index-1" (https://github.com/facebook/mysql-5.6/wiki/MyRocks-data-dictionary-format)
Dropping the index just drops the column family "index-1", which is a O(1) operation
another example would be locality groups in Bigtable and column families in RocksDB and Cassandra

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

This repository currently has no open or pending branches.

Detected dependencies

cargo

Cargo.toml

byteorder 1.5.0

crc32fast 1.4.0

lsm-tree 0.6.3

log 0.4.21

std-semaphore 0.1.0

tempfile 3.10.1

fs_extra 1.3.0

path-absolutize 3.1.1

criterion 0.5.1

nanoid 0.4.0

test-log 0.2.15

rand 0.8.5

github-actions

.github/workflows/release.yml

actions/checkout v4

katyo/publish-crates v2

.github/workflows/test.yml

actions/checkout v4

Swatinem/rust-cache v2

actions/checkout v4

Check this box to trigger a request for Renovate to run again on this repository

Transactions

We should have all the pieces to get transactions working:

Snapshots achieved by MVCC
Atomicity achieved by Batch

WiscKey-like key-value separation

fjall-rs/lsm-tree#1

Optional TTL for FIFO strategy

Look at each segment's metadata
Compare with TTL setting
If now > (segment.created_at + ttl_seconds), drop that segment (use DropSegments)
Then check the max L0 size setting, and use that to maybe drop segments as well

Allow more characters in partition name.

motivation

I am seeking a lsm-tree based database as my in-memory database backend. Firstly, I found crate lsm_tree, but it doesn't contain auto flushing and partition, the only thing i can do is to store different table in one folder, that's fine. Then I find this crate, it exactly is the wheel that I am looking for. After a few minutes migration, a big problem occurs.

the problem

I was using a reflection of a type as the name of the lsm_tree file to store different types in different trees. Let's say, the dictionary-like generic api for my storage is Dense<K, V, D>, D is for delta. When lib user want to store a type like <u64,u64,()>, the name after reflection of this type is something like Dense<u64, u64, ()>
As you can see, the name contains character <>, and blank space, which cannot used as the name of partition.

solution

remove the limit for partition name
or, use a AsRef<[u8]> as the partition name api.

Improve & validate write buffer size handling

Bloom Filters

Need a bloom (or XOR or ribbon or cuckoo) filter, that:

can be restored from disk/raw bytes

This is obvious. On recovery, load all bloom filters back into memory from disk. So the bloom filter needs to be able to give us its internal bytes and needs a constructor to recreate it from raw bytes.

can be resized while writing to it

RocksDB used to store a bloom filter per data block, so filter construction is simple. However, its read path is much worse because you need to travel through the SST file. The new full filter format just stores one big bloom filter per SST file. That requires much more memory when flushing and compacting because all keys or hashes need to be buffered until the SST is written because the number of items is unknown until everything is written. Then the bloom filter needs to be built from the buffer.

Using a scalable filter may solve this memory issue.

Also...

M O N K E Y may maximize efficiency for a given amount of memory: http://daslab.seas.harvard.edu/monkey/.

let folder = "";

// A tree is a single physical keyspace/index/...
// and supports a BTreeMap-like API
// but all data is persisted to disk.
let tree = Config::new(folder).open().unwrap();

// Note compared to the BTreeMap API, operations return a Result<T>
// So you can handle I/O errors if they occur
tree.insert("hello-key-999991", "hello-value-999991")
    .unwrap();

let item = tree.get("hello-key-99999").unwrap();
match item {
    Some(value) => {
        println!("value: {}", std::str::from_utf8(&value).unwrap());
    }
    None => println!("Not Found"),
}

// Flush to definitely make sure data is persisted
tree.flush().unwrap();

Expect Not Found, but printed value: hello-value-999991

delete_range/delete_prefix

Implement fjall-rs/lsm-tree#2
Implement in journal
Implement in partition
Implement in transaction