Giter Site home page Giter Site logo

huangqiang / qalsh Goto Github PK

View Code? Open in Web Editor NEW
26.0 2.0 9.0 10.19 MB

Query-Aware LSH for Approximate NNS (PVLDB 2015 and VLDBJ 2017)

License: GNU General Public License v3.0

Makefile 0.01% C++ 99.54% Shell 0.45%
nnsearch lsh qalsh kd-tree dataset distance b-plus-tree btree locality-sensitive-hashing

qalsh's Introduction

QALSH: Query-Aware Locality-Sensitive Hashing

Welcome to the QALSH GitHub!

QALSH is a package for the problem of Nearest Neighbor Search (NNS) over high-dimensional Euclidean spaces. Given a set of data points and a query, the problem of NNS aims to find the nearest data point to the query. It is a very fundamental probelm and has wide applications in many data mining and machine learning tasks.

This package provides the external memory implementations (disk-based) of QALSH and QALSH+ for c-Approximate Nearest Neighbor Search (c-ANNS) under lp norm, where 0 < p ⩽ 2. The internel memory version can be found here.

If you want to get more details of QALSH and QALSH+, please refer to our works Query-Aware Locality-Sensitive Hashing for Approximate Nearest Neighbor Search and Query-Aware Locality-Sensitive Hashing Scheme for lp Norm, which have been published in PVLDB 2015 and VLDBJ 2017, respectively.

Datasets and Queries

We study the performance of QALSH and QALSH+ over 4 real-life high-dimensional datasets, i.e., Sift, Gist, Trevi, and P53. We also include a toy dataset Mnist for illustration and a large-scale dataset Sift10M for further validation. For each dataset, we provide 100 queries (randomly select from its test set or extract from the dataset itself) for evaluations. The statistics of datasets and queries are summarized as follows.

Datasets #Data Points (n) #Queries Dimensionality (d) Range of Coordinates Data Type
Sift 1,000,000 100 128 [0, 255] uint8
Gist 1,000,000 100 960 [0, 15,000] uint16
Trevi 100,800 100 4,096 [0, 255] uint8
P53 31,059 100 5,408 [0, 10,000] uint16
Mnist 60,000 100 50 [0, 255] uint8
Sift10M 10,000,000 100 128 [0, 255] uint8

Note that all the datasets and queries are in a binary format, which can be considered as an array of n·d coordinates, where each coordinate is specified by the data type, e.g., uint8 or uint16. We currently support four data types: uint8, uint16, int32, and float32. One can determine the data type of the dataset based on the range of its coordinates. If you want to support more data types, you can update the interface in the main.cc and re-compile the package.

Compilation

The package requires g++ with c++11 support. To download and compile the c++ source codes, please run the commands as follows:

git clone [email protected]/HuangQiang/QALSH.git
cd QALSH/methods/
make -j

Usages

Suppose you have cloned the project and you are in the folder QALSH/. We provide bash scripts to run experiments for the six real-life datasets.

Step 1: Get the Datasets and Queries

Please download the datasets and copy them to the directory data/.

For example, when you get Sift.ds and Sift.q, please move them to the paths data/Sift/Sift.ds and data/Sift/Sift.q, respectively. We provide Mnist at this package, you can follow the same pattern to move the datasets to the right place.

Step 2: Run Experiments

When you run the package, please ensure the paths for the dataset, query set, and truth set are correct. The package will automatically create folder for the output path, so please keep the output path as short as possible. All of the experiments can be run with the following commands:

cd methods/
bash run_all.sh

A gentle reminder is that when running QALSH and QALSH+, since they need the ground truth results for evaluation, please run -alg 0 to get the ground truth results first.

Step 3. Parameter Settings

Finally, if you would like to use this package for c-ANNS over other datasets, you may want to get more information about the parameters and know how to set them effectively. Based on our experience when we conducted the experiments, we now share some tricks on setting up the parameters, i.e., B, lf, L, M, p, z, and c. The illustration of the parameters are as follows.

Usage: qalsh [OPTIONS]

This package supports 6 options to evaluate the performance of QALSH, QALSH^+,
and Linear_Scan for c-k-ANNS. The parameters are introduced as follows.

  -alg    integer    options of algorithms (0 - 5)
  -n      integer    cardinality of dataset
  -d      integer    dimensionality of dataset and query set
  -qn     integer    number of queries
  -B      integer    page size
  -lf     integer    leaf size of kd_tree
  -L      integer    number of projections for drusilla_select
  -M      integer    number of candidates  for drusilla_select
  -p      float      l_{p} norm, where 0 < p ⩽ 2
  -z      float      symmetric factor of p-stable distribution (-1 ⩽ z ⩽ 1)
  -c      float      approximation ratio for c-k-ANNS (c > 1)
  -dt     string     data type (i.e., uint8, uint16, int32, float32)
  -pf     string     the prefix of dataset, query set, and truth set
  -of     string     output folder to store output results

The settings of B

B is the page size in bytes. It is determined by the data dimensiona d. Based on our experience, we use the following rules to set up B:

  • (1) 1 ⩽ d < 256: we set B = 4096;
  • (2) 256 ⩽ d < 512: we set B = 8192;
  • (3) 512 ⩽ d < 1024: we set B = 16384;
  • (4) 1024 ⩽ d < 4096: we set B = 32768;
  • (5) 4096 ⩽ d < 8192: we set B = 65536;

for the case d ⩾ 8192, we can set up a corresponding larger B value following the rules above.

The settings of lf, L, and M

lf is the maximum leaf size of kd-tree. L and M are two parameters used for Drusilla_Select, where L is the number of random projections; M is the number of representative data points we select on each random projection.

Let K be the number of blocks, and let n0 be the actual leaf size after kd-tree partitioning. Once lf is determined, K and n0 can be computed as follows:

  • K = 2h, where h is the height of the kd-tree, i.e., h = ceil(log_2 (n / lf));
  • n0 = floor(n / K) or n0 = ceil(n / K) (Note: if n is not divisible, these two cases can happen.)

When we set up these three parameters lf, L and M, it might be better to satisfy the following three conditions:

  • lf < n: It is a natural condition that the maximum leaf size should be smaller than the cardinality of dataset;
  • L · M < n0: It is a natural condition to restrict its size (L · M) less than n0 when we run Drusilla_Select to select the representative data points on each block;
  • K · L · M ≈ n0: This condition is the main trade-off between efficiency and accuracy to set up these three parameters.
    • On the one hand, if the total number of representative data points (K · L · M) is large, we can accurately identify the close blocks to the query, but it may introduce much time for the first-level close block search.
    • On the other hand, if (K · L · M) is small, the time for the first-level close block search can be reduced. However, since these blocks may not be really close to the query, the accuracy for the second-level c-ANNS may also be reduced.
    • In our experiments, we find that selecting the representative data points with cardinality approximate to n0 can achieve a good trade-off between accuracy and efficiency.

The settings of p, z, and c

p and z determine the distance metric and the corresponding p-stable distribution. There are four cases as follows.

  • l2 distance: set up p = 2.0 and z = 0.0, and apply standard Gaussian distribution.
  • l1 distance: set up p = 1.0 and z = 0.0 and apply standard Cauchy distribution.
  • l0.5 distance: set up p = 0.5 and z = 1.0 and apply standard Levy distribution.
  • General lp distance: set up 0 < p ⩽ 2 and -1 ⩽ z ⩽ 1.

c is the approximation ratio for c-ANNS. We set c = 2.0 by default. If the dataset is easy, it is also satisfied to set c = 3.0 or am even larger value.

Reference

Please use the following BibTex to cite this work if you use QALSH for publications.

@article{huang2017query,
    title={Query-aware locality-sensitive hashing scheme for $$ l\_p $$ norm}
    author={Huang, Qiang and Feng, Jianlin and Fang, Qiong and Ng, Wilfred and Wang Wei},
    booktitle={The VLDB Journal},
    volumn={26},
    number={5},
    pages={683--708},
    year={2017}
}

@article{huang2015query,
    title={Query-aware locality-sensitive hashing for approximate nearest neighbor search}
    author={Huang, Qiang and Feng, Jianlin and Zhang, Yikai and Fang, Qiong and Ng, Wilfred},
    booktitle={Proceedings of the VLDB Endowment},
    volumn={9},
    number={1},
    pages={1--12},
    year={2015}
}

It is welcome to contact me ([email protected]) if you meet any issue. Thank you.

qalsh's People

Contributors

huangqiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

qalsh's Issues

QALSH_PLUS error: `nb <= num_blocks_' failed

Dear author:
This is a very great job and I try to apply your algorithm on my own dataset. My dataset consists of 3605 objects and the dimension is 100. Due to the characteristic of the task, both the dataset and the query set are the same for the top 10 nearest neighbor searches.

However, while the QALSH could run successfully, the QALSH_PLUS appears an error :

qalsh: qalsh_plus.cc:561: int QALSH_PLUS::get_block_order(int, MinK_List*, std::vector<int>&): Assertion `nb <= num_blocks_' failed.
Aborted (core dumped)

According to your article, each leaf of the KD-Tree is regarded as a block and I check the parameter num_blocks_ and find out the value is 1 but the iteration of nb is greater than 1.

Thus my question is if there is any solution to adjust the programme so that the data could be split to generate more num_blocks_ to make the script runs successfully? I have adjusted the input parameters such as leaf , B , 'c' or 'p' but none of them seems to be useful.

I would be so appreciated if you could reply. Thanks a lot!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.