Giter Site home page Giter Site logo

surf's Introduction

Succinct Range Filter (SuRF)

Build Status Coverage Status

SuRF is a fast and compact filter that provides exact-match filtering, range filtering, and approximate range counts. This is the source code for our SIGMOD best paper. We also host a demo website. The RocksDB experiments with SuRF can be found here.

Install Dependencies

sudo apt-get install build-essential cmake libgtest.dev
cd /usr/src/gtest
sudo cmake CMakeLists.txt
sudo make
sudo cp *.a /usr/lib

Build

git submodule init
git submodule update
mkdir build
cd build
cmake ..
make -j

Simple Example

A simple example can be found here. To run the example:

g++ -mpopcnt -std=c++11 simple_example.cpp
./a.out

Note that the key list passed to the SuRF constructor must be SORTED.

Run Unit Tests

make test

Benchmark

Step 1: Download YCSB

cd bench/workload_gen
bash ycsb_download.sh

Step 2: Generate Workloads

cd bench/workload_gen
bash gen_workload.sh

You must provide your own email list to generate email-key workloads.

Step 3: Run Workloads

cd bench
bash run.sh

Note that run.sh only includes several representative runs. Refer to bench/workload.cpp, bench/workload_multi_thread.cpp and bench/workload_arf.cpp for more experiment configurations.

License

Copyright 2018, Carnegie Mellon University

Licensed under the Apache License.

surf's People

Contributors

huanchenz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

surf's Issues

Failed to lookupRange via deserialized SuRF

I am trying to wrap SuRF lib with JNI. But I found a bug when testing serialize & deserialize function.
Then I modified the simple_example.cpp and the bug still happend.
Test code:

#include <iostream>
#include <vector>

#include "include/surf.hpp"

using namespace surf;

int main() {
    std::vector<std::string> keys = {
	"f",
	"far",
	"fast",
	"s",
	"top",
	"toy",
	"trie",
    };

    // basic surf
    SuRF* surf = new SuRF(keys);

    char* data = surf->serialize();
    surf = SuRF::deSerialize(data);
    std::cout << surf->lookupKey("f");
    std::cout << surf->lookupKey("far");
    std::cout << surf->lookupKey("fast");
    std::cout << surf->lookupKey("s");
    std::cout << surf->lookupKey("top");
    std::cout << surf->lookupRange("a", true, "a", true);
    std::cout << surf->lookupRange("fa", true, "fast", true);
    std::cout << surf->lookupRange("st", true, "top", true);
    std::cout << surf->lookupRange("fat", true, "trie", true);
    std::cout << surf->lookupRange("st", true, "z", true);
    return 0;
}

ScreenShot: (both happend in Windows and WSL)
image

Got a "double free or corruption" when destroy a deserialized SuRF in BitvectorRank

Test Code:

#include
#include
#include "include/surf.hpp"
using namespace surf;
int main() {
std::vectorstd::string keys = {
"f",
"far",
"fast",
"s",
"top",
"toy",
"trie",
};
// basic surf
SuRF* surf = new SuRF(keys);
surf = SuRF::deSerialize(surf->serialize());
std::cout << "destory surf" << std::endl;
surf->destroy();
return 0;
}

Got a core dumped
image

Maybe the issue happens when destroy BitVectorRank:

void destroy()
{
delete[] bits_;
delete[] rank_lut_;
}

Non-0 false negative rate!

I've observed multiple scenarios where SuRF has a non-0 false negative rate. I'll submit the dataset, queries, and setup for when this happens shortly.

How can the lookUpRange function be parallel?

Hi, I come here again...

Here, the most important feature range query :

bool SuRF::lookupRange(const std::string& left_key, const bool left_inclusive,

And in the bench, there is a multi-thread work load test

positives += (int)thread_arg->filter->lookupRange(txn_keys[i],

This lookUpRange function uses a private class variable iter_. It first calls iter_.clear(), which clear some variables, but do not new them. When multi-threads call this function, they will all modify this variable at the same time which could nullify the programming logic.
Perhaps it should new a iter every time when there is a call?

Thanks!

Array index out of bound?

Hi, I am trying to rewrite a java version of SuRF. However, I encountered several problems which all involved with index out of bound exception.

In bitvector.hpp
We initially define bits_=[numWords()];

However, here

test_bits = bits_[word_id];

and here,
bits_[word_id] |= (last_word << (kWordSize - bit_shift));

we all have a chance to visit bits_=[numWords()], which is invalid. C++ actually does not raise an error or warning for this.
Although this do not cause real problems like crashing the program (not for sure), I still think this might be a problem.
Or perhaps I misunderstand the intention of doing that, so could you please give me an explanation of this?
Thanks!

deSerialize(serialize()) doesn't clone properly.

Trying to use deSerialize(serialize()) as a cloning function, but it doesn't work properly.
In the following code, I use deSerialize(serialize()) to make a copy of an instance, and then destroy the copy immediately, which crashes the program with a "double free or corruption (out); Signal: SIGABRT (Aborted)" error.

#include <iostream>
#include <vector>

#include "include/surf.hpp"

using namespace surf;
#include <iostream>
#include <vector>
#include <algorithm>

#include "include/surf.hpp"
#include <string>

#include <fstream>

using namespace surf;
using namespace std;

vector<string> read_dataset(const string& file_path) {
    vector<string> dataset;
    ifstream file(file_path);
    cout << "INIT reading dataset file " << file_path << endl;
    int id = 0;
    while (!file.eof()) {
        string line;
        getline(file, line);
        if(!line.empty()) {
            dataset.push_back(line);
        }
        id++;
    }
    file.close();
    cout << "END reading dataset file " << file_path << endl;
    cout << "READ " << dataset.size() << " rows" << endl;
    return dataset;
}

static vector<string> split(string str, const string& delim) {
    vector<string> ret;
    while (str.find(delim) != std::string::npos) {
        if (str.find(delim) >= 1) {
            ret.push_back(str.substr(0, str.find(delim)));
        }
        str = str.substr(str.find(delim) + delim.size(), str.size() - (str.find(delim) + delim.size()));
    }
    if (!str.empty()) {
        ret.push_back(str);
    }
    return ret;
}

vector<pair<string, string> > read_workload(const string& file_path) {
    vector<pair<string, string> > workload;

    ifstream file(file_path);

    cout << "INIT reading workload file " << file_path << endl;
    int id = 0;
    while (!file.eof()) {
        string line;
        getline(file, line);
        assert(line.find(' ') != std::string::npos);
        if (!line.empty()) {
            vector<string> parts = split(line, " ");
            assert(parts.size() == 2);
            workload.emplace_back(parts[0], parts[1]);
        }
        id++;
    }
    file.close();
    cout << "END reading workload file " << file_path << endl;
    cout << "READ " << workload.size() << " rows" << endl;
    return workload;
}

int surf_main() {
    vector<string> dataset = read_dataset("dataset.txt");
    sort(dataset.begin(), dataset.end());
    vector<pair<string, string> > workload = read_workload("workload.txt");
    surf::SuRF *surf_real = new surf::SuRF(dataset, surf::kReal, 0, 8);
    surf::SuRF::deSerialize(surf_real->serialize())->destroy();
    assert(false);
    return 0;
}

Why lookupKey function doesn't work as expected?

I use 10000 key to build SuRF, and then lookup those key , but SuRF miss some key?

// test code
std::vector<std::string> keys;
for (int i = 0; i < 10000; i++) {
    keys.push_back(std::to_string(i));
}
// basic surf
SuRF* surf = new SuRF(keys);

// use default dense-to-sparse ratio; specify suffix type and length
SuRF* surf_hash = new SuRF(keys, surf::kHash, 8, 0);
SuRF* surf_real = new SuRF(keys, surf::kReal, 0, 8);

// customize dense-to-sparse ratio; specify suffix type and length
SuRF* surf_mixed = new SuRF(keys, true, 16,  surf::kMixed, 4, 4);

int n_error = 0;
int n_correct = 0;
for (int i = 0; i < 10000; i++) {
    if (surf->lookupKey(std::to_string(i)))
        n_correct++;
    else
        n_error++;
}
std::cout << "num:"<< n_error << ":" << n_correct << std::endl;

num:9:9991

Buffer overrun in bitvector

I saw a previous issue has been closed, citing assertions at the beginning of the functions. But those assertions are NOT sufficient.

Description of the problem:
potential buffer overrun in line 95 and 96 of bitvector.hpp

word_id++;

The assertion at beginning of the function (assert(pos <= num_bits_);) does NOT work, suppose there's a run of 0s longer than 128 near the end of the bit vector, when we enter the function the pos might as well within range, but the logic in the function will keep reading the next long until it runs over the buffer by exactly 8 bytes. The function will returns the wrong result because of this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.