Giter Site home page Giter Site logo

rubbledb's Introduction

RubbleDB: CPU-Efficient Replication with NVMe-oF

Abstract

Due to the need to perform expensive background compaction operations, the CPU is often a performance bottleneck of persistent key-value stores. In the case of replicated storage systems, which contain multiple identical copies of the data, we make the observation that CPU can be traded off for spare network bandwidth. Compactions can be executed only once, on one of the nodes, and the already-compacted data can be shipped to the other nodes’ disks, saving them significant CPU time. In order to further drive down total CPU consumption, the file replication protocol can leverage NVMe-oF, a networked storage protocol that can offload the network and storage datapaths entirely to the NIC, requiring zero involvement from the target node’s CPU. However, since NVMe-oF is a one-sided protocol, if used naively, it can easily cause data corruption or data loss at the target nodes.

We design RubbleDB, the first key-value store that takes advantage of NVMe-oF for efficient replication. RubbleDB introduces several novel design mechanisms that address the challenges of using NVMe-oF for replicated data, including pre-allocation of static files, a novel file metadata mapping mechanism, and a new method that enforces the order of applying version edits across replicas. These ideas can be applied to other settings beyond key-value stores, such as distributed file and backup systems. We implement RubbleDB on top of RocksDB and show it provides consistent CPU savings and increases throughput by up to 1.9× and reduces tail latency by up to 93.4% for write-heavy workloads, compared to replicated key-value stores, such as ZippyDB, which conduct compactions on all replica nodes.

Setup

0. Overview

We have prepared a script rocksdb/rubble/setup.sh to automatically set up everything on a CloudLab experiment. To use this script, please create an experiment with r650 or r6525 nodes as workers. The number of worker nodes equals to the replication factor (RF). For example, if you would like to run experiments under RF=3, three r650 or r6525 nodes are needed. Besides, we need to add a client node to the experiment of any type. For simplicity, we use the the same type r650 for both worker and client nodes to show the case of RF=3 below.

1. Create a CloudLab experiment

Initiate a CloudLab experiment from here. The specific exepriment configuration is:

  • Profile: small-lan
  • Number of Nodes: 4
  • Select OS image: UBUNTU 20.04
  • Optional physical node type: r650

Once the experiment is ready, you will see a status page as below showing the ID and hostname for each node. For example, node0's hostname is clnode257.clemson.cloudlab.us.

Experiment Example

2. Set up the environment remotely

Run the setup.sh script from your laptop to handle all the burden like NVMe-oF configuration, dependency installation, and code compilation:

bash setup.sh [your_cloudlab_username] [is_mlnx] [shard_num] [hostnames]
  • is_mlnx: 0 or 1 to indicate wheather the NIC is Mellanox or not. Both r650 and r6525 uses Mellanox NICs, so use 1 here.
  • shard_num: number of data partition in RubbleDB. Since we would like to evenly spread the workload on all nodes, use a multiple of RF here, e.g., 3, 6, or 9.
  • hostnames: the hostname for each node we see in the experiment status, e.g., clnode257.clemson.cloudlab.us.

As an example, the complete command for a three-shard experiment is:

bash setup.sh haoyuli 1 3 clnode257.clemson.cloudlab.us clnode258.clemson.cloudlab.us clnode271.clemson.cloudlab.us clnode256.clemson.cloudlab.us

This scripts take ~30 minutes to finish. If everything goes smoothly, you will see a message Done! printed in the end.

Run

If you follow the above steps, node0 will be the client, where we can initiate the evaluation. Now we need to ssh into node0:

ssh [email protected]
sudo su
cd /root/YCSB/
bash eval.sh [phase] [workload] [qps] [log_suffix] [client_num] [mode] [shard_num] [replication_factor]
  • phase: load or run, corresponding to the load or run phase in YCSB
  • workload: a-g, corresponding to YCSB workload a-g
  • qps: request per second
  • log_suffix: the log will be saved as ycsb-[log_suffix].log
  • client_num: the number of parallel YCSB client threads
  • mode: rubble or baseline
  • shard_num: same as the previous definition
  • replication_factor: same as above

To run a loading phase followed by running workload A:

bash eval.sh load a 100000 rubble-offload-load 4 rubble 3 3
bash eval.sh run a 80000 rubble-offload-a 4 rubble 3 3

Citation

@inproceedings{li2023rubbledb,
  title={$\{$RubbleDB$\}$:$\{$CPU-Efficient$\}$ Replication with $\{$NVMe-oF$\}$},
  author={Li, Haoyu and Jiang, Sheng and Chen, Chen and Raina, Ashwini and Zhu, Xingyu and Luo, Changxu and Cidon, Asaf},
  booktitle={2023 USENIX Annual Technical Conference (USENIX ATC 23)},
  pages={689--703},
  year={2023}
}

rubbledb's People

Contributors

lei-houjyu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

rubbledb's Issues

How to initialize and run?

Could you please guide me on how to initialize and run RubbleDB? The documentation in the codebase contains conflicting and confusing information. For instance, the setup for the parameters of the eval.sh script at this location contradicts the description at this location. Additionally, there are no instructions on how to initialize the NVMe-oF connection between nodes in the documentation. I tried to use the setup-rubble.sh script for initialization, but the file paths set in the script conflict with those mentioned at this location, preventing correct execution.

I also attempted the method described in the documentation at this location, but it still didn't work.

Could you please advise on the sequence in which I should read the markdown documents in the codebase? Alternatively, what order should I follow, and which executable files or scripts and parameters should I use to initialize the nodes and run a complete YCSB test?

My execution environment consists of four r6525 nodes in CloudLab. I am using node0 as the client and node1 through node3 as servers, as described in the documentation at this location. Is this setup appropriate?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.