dicl / fast_fair Goto Github PK

View Code? Open in Web Editor NEW

114.0 15.0 36.0 108 KB

Endurable Transient Inconsistency in Byte-Addressable Persistent B+-Tree

License: GNU General Public License v3.0

Makefile 0.85% C++ 99.15%

fast_fair's Introduction

FAST_FAIR

Implementation of the paper, "Endurable Transient Inconsistency in Byte-Addressable Persistent B+-Tree".

The paper is to appear in FAST 2018.

Failure-Atomic ShifT(FAST) and Failure-Atomic In-place Rebalancing(FAIR) are simple and novel algorithms that make B+-Tree tolerant againt system failures without expensive COW or logging for Non-Volatile Memory(NVM). A B+-Tree with FAST and FAIR can achieve high performance comparing to the-state-of-the-art data structures for NVM. Because the B+-Tree supports the sorted order of keys like a legacy B+-Tree, it is also beneficial for range queries.

In addition, a read query can detect a transient inconsistency in a B+-Tree node during a write query is modifying it. It allows read threads to search keys without any lock. That is, the B+-Tree with FAST and FAIR increases throughputs of multi-threaded application with lock-free search.

We strongly recommend to refer to the paper for the details.

Directories
- single - a single thread version without lock
- concurrent - a multi-threaded version with std::mutex in C++11
How to run (single)

git clone https://github.com/DICL/FAST_FAIR.git
cd FAST_FAIR/single
make
./btree -n [the # of data] -w [write latency of NVM] -i [path] (e.g. ./btree -n 10000 -w 300 -i ~/input.txt)

How to run (concurrent)

git clone https://github.com/DICL/FAST_FAIR.git
cd FAST_FAIR/concurrent
make
There are two versions of concurrent test programs - One is only search and only insertion, the other is a mixed workload.
1. ./btree_concurrent -n [the # of data] -w [write latency of NVM] -i [input path] -t [the # of threads] (e.g. ./btree -n 10000 -w 300 -i ~/input.txt -t 16)
2. ./btree_concurrent_mixed -n [the # of data] -w [write latency of NVM] -i [input path] -t [the # of threads] (e.g. ./btree -n 10000 -w 300 -i ~/input.txt -t 16)

fast_fair's People

Contributors

Stargazers

Watchers

fast_fair's Issues

Implementation issue of lazy recovery mechanism

Current implementation for insertion operation cannot correctly handle crash-inconsistent data after crash occurs in the middle of node split. Below algorithm shows the current insertion implementation provided by this repository.

Traversing down tree with lock-free search until reaching leaf node.
i) If it is found that the next node's smallest key is smaller than the requested key during traversing, thread moves to right sibling node, and then continues to traverse tree.
Acquiring the lock of the found leaf node and inserting new key
i) Before inserting new key, check again the next leaf node's smallest key. If the smallest key in the next node is larger than the new key, the thread unlocks current locked leaf node, and then continues to do insertion from the next leaf node.

The key problem is that this insertion algorithm does not involve any recovery mechanism even if writer thread finds that the smallest key of next sibling node is larger than new key. If we use this algorithm as-is, the following problem can occur after system crash occurs in the middle of node split.

1. Initial tree has only one leaf node (1, 5, 9, 15) where the maximum number of entries is four.
2. Writer thread tries to split first to insert new key 20.
3. (1, 5, 9, 15) is split to (1, 5) --> (9, 15)
4. **System crash** occurs before allocating new root and inserting middle key into it.

**System resume**

5. Following writers insert new keys 20, 30, and 40. When inserting 40, node split is conducted, and then allocating new root node.
6. After previous insertions, the final tree structure is formed like below.
                    (20)
                  /      \
   (1, 5)-->(9, 15)-->(20, 30, 40)
7. Reader thread tries to search key 5, but it will return failure even if key 5 is valid in this tree structure.

The lazy recovery mechanism, originally proposed in FAST'18 paper, performs recovery whenever a thread finds the smallest key in the right side node is bigger than the requested key. It seems the current implementation choice is because of performance. However, it would be expected that performance become somewhat different if the implementation is changed to correctly address lazy recovery mechanism.

The search function of this btree can not find the value

According to lines the btree_search() function member:

  page *t;
  while((t = (page *)p->linear_search(key)) == p->hdr.sibling_ptr) {
    p = t;
    if(!p) {
      break;
    }
  }
  if(!t || (char *)t != (char *)key) {
    printf("NOT FOUND %lu, t = %x\n", key, t);
    return false;
  }

linear_search() searchs a key in the leafnode and returns the corresponding value. Then the tree_search function compares the key with the found value in the leafnode, if doesn't match, takes it as not found and output a message. It is NOT right if the key and its corresponding value are not equivalent.

It has a few extra bugs with some special workloads: fastfair can not support zero value; it may run into fault when two consecutive Records has the same value;

[BUG] range scan in multi-threads environment

Problem)

Current implementation of range scan can read duplicated keys repeatedly in multi-threads environment and can return value buffer, unsorted by keys. The node scan mechanism of FAST_FAIR proposed in FAST'18 paper changes node scan direction only when insertion and deletion operations are switched each other during scanning. It is understood that this mechanism can correctly address exact matching query, but not for range scan. Please refer to below example.

Example)

"v" is scanner's pointer

                              Insertion 5, 6
Switch counter = 2            Switch counter = 2                Switch counter = 2
Scan buffer (10)              Scan buffer (10, 6)               Scan buffer (10, 6, 10)
  v                             v                                    v
(10, 20, 30, 40, 50) -----> (5, 6, 10, 20, 30, 40, 50) -----> (5, 6, 10, 20, 30, 40, 50)

Solution)

In current implementation, switch counter is only changed with odd or even number depending on whether insertion and deletion operations are switched or not. Our proposed solution is to constantly increase switch counter whenever insertion and deletion operations perform. If applying this solution, range scanner thread can realize that concurrent insertion or deletion is working in the same node, and then restart scanning with rolling back the increased buffer's index number.

Error PMDK "Segmentation fault (core dumped)" when reopen the pool has already data!!!

Hi All,
For PMDK environment when I first run the pool initialization and data generation everything is ok. When you turn off the program and run again, read data from the pool that already has data, it will be dumped by the system with the error "Segmentation fault (core dumped)".
Thank you very much.

Node destruction

Can you please point, where do you make old node destruction, if split or merge occurs?
Want to use your source.

Missing flushes

Hi @deukyeon ,

Recently, we were working on a test framework to identify bugs in frameworks that use persistent memory. We used our tool to check FAST_FAIR benchmark and we were able to find some bugs in them. More descriptions about these bugs and the corresponding fixes are listed below. We fixed these bugs in our local repository, but I thought maybe you are interested in fixing them in your repository as well. Thank you for sharing your work with the community. It helped us a lot to develop and test our tool. Feel free to reach out and ask us any questions about the fixes.

Missing flush in btree constructor(btree.h:824 in constructor of btree() ):
clflush((char*)root, sizeof(page));
Missing flush in btree constructor (btree.h:825 in the constructor of btree() ):
clflush((char*)this, sizeof(btree), false, true);

The consistency issue of the inner node search

Hi all,

Thank you for sharing your work with the community. Recently, we find that the current implementation of lock-free inner node search may reach the incorrect child pointer due to the read-write conflict.

Consider the example below, concurrent insert and search execute in the following order. The search operation may ignore the key of 10 in (1) because its left and right pointers are the same due to an on-going shifting. Then, it might load the next key of 10 after the insert operation has been finished and return ptr in the end. However, p1 is what the search operation expects to return. The incorrect result is caused by the state change between two load operations.

    Search(3):				         (1) skip 10                (2) return ptr
    						    (p1=p1)                     (3<10)
                                                       v                          v
Insert(5,ptr): (p1,10,p2) --> (p1,10,p2,10,p2) --> (p1,10,p1,10,p2) --> (p1,5,ptr,10,p2)

A possible solution is to maintain a "low_key" field in each inner node. Do you have any better solution towards this problem.

Is WOART open source?

The implement of B+Tree help us get a better understanding of the paper "Endurable Transient Inconsistency in Byte-Addressable Persistent B+-Tree". And thank your lab for your contribution to NVM research. Does your lab have any plan to open source other implements of your papers, such as "WOART ". Thank you!

WORT & FPTree Open Source

Hello, I would like to ask if you can open source WORT and FPTree source code, if you can, thank you very much~~~~~~QAQ

[BUG] FAIR algorithm

We found FAIR algorithm proposed in FAST'18 has a problem in handling node splits to be concurrent and crash-recoverable. Original FAIR algorithm can make an incorrect tree structure due to internal node split and key deletion. Please refer to our pull request for more detail ( #4 ).