Giter Site home page Giter Site logo

huangqiang / bc-tree Goto Github PK

View Code? Open in Web Editor NEW
13.0 1.0 1.0 186 KB

BC-Tree and Ball-Tree for Point-to-Hyperplane NNS (ICDE 2023)

License: MIT License

Makefile 0.22% C++ 68.06% Python 18.54% Shell 12.68% C 0.50%
bc-tree ball-tree hyperplane-query point-to-hyperplane-nns hyperplane-hashing

bc-tree's Introduction

Lightweight-Yet-Efficient: Revitalizing Ball-Tree for Point-to-Hyperplane Nearest Neighbor Search

Welcome to the Ball-Tree and BC-Tree GitHub!

In this repository, we target at solving the problem of Point-to-Hyperplane Nearest Neighbor Search (P2HNNS). Given a set of data points in a ($d-1$)-dimensional Euclidean space and a hyperplane query (represented as a vector in $d$-dimensional Euclidean space), the problem of P2HNNS aims to find the point whose distance to the hyperplane query is minimized among all the data.

This problem plays a vital role in many research domains. For example, in the applications of pool-based active learning with SVMs, the goal is to request labels for the data points closest (with minimum margin) to the SVM's decision hyperplane to reduce human efforts for annotation. Moreover, motivated by the success of SVM for classification, the maximum margin clustering aims at finding the hyperplane maximizing the minimum margin to the data, which can separate the data from different classes. Such applications require finding the data points that are closest to the hyperplane.

This repository provides the implementations and experiments of our work entitled Lightweight-Yet-Efficient: Revitalizing Ball-Tree for Point-to-Hyperplane Nearest Neighbor Search that has been published in IEEE ICDE 2023. You could also find our arXiv version here. We implement Ball-Tree and BC-Tree for performing P2HNNS in high-dimensional spaces. To make a systematic comparison, we include two state-of-the-art hyperplane hashing schemes NH and FH (published in SIGMOD 2021) as baselines for evaluations, where their implementation codes can be found here.

Data Sets

Data Sets Details

We study the performance of Ball-Tree and BC-Tree on 16 real-world data sets, i.e., Music, GloVe, Sift, UKBench, Tiny, Msong, NUSW, Cifar-10, Sun, LabelMe, Gist, Enron, Trevi, P53, Deep100M, and Sift100M. Their statistics are summarized as follows.

Data Sets # Data Points # Dim # Queries Data Size Data Type
Music 1,000,000 100 100 386 MB Rating
GloVe 1,183,514 100 100 460 MB Text
Sift 985,462 128 100 485 MB Image
UKBench 1,097,907 128 100 541 MB Image
Tiny 1,000,000 384 100 1.5 GB Image
Msong 992,272 420 100 1.6 GB Audio
NUSW 268,643 500 100 514 MB Image
Cifar-10 50,000 512 100 98 MB Image
Sun 79,106 512 100 155 MB Image
LabelMe 181,093 512 100 355 MB Image
Gist 982,694 960 100 3.6 GB Image
Enron 94,987 1,369 100 497 MB Text
Trevi 100,900 4,096 100 1.6 GB Image
P53 31,153 5,408 100 643 MB Biology
Deep100M 100,000,000 96 100 36.1 GB Image
Sift100M 99,986,452 128 100 48.0 GB Image

Hyperplane Query Generation

Different from NH and FH that generate the hyperplane queries uniformly at random from a hypercube, we generate the hyperplane queries uniformly at random from a $d$-ball. We prefer this new way because it is more natural to simulate the hyperplane queries appeared in real-world applications. Please click here for more details about the $d$-ball and its generation algorithm. The data sets and hyperplane queries used in our experiments can be donwload here.

Except for the link to download the data sets and hyperplane queries, we have also enclosed the sourece codes to generate the hyperplane query and the source codes for data sets deduplication, so that users can conduct experiemnts on any data sets of interests. Please refer to the fold pre_process/ for more details.

Data Format

To reduce the I/O cost, all the datasets and queries are stored in a binary format. For a dataset (or query set) with n vectors in d dimensions, the binary file comparises with a float array of n*d length. Please specify the cardinality n and data dimension d in the bash scripts in advance.

Requirements

  • Ubuntu 18.04 (or higher version)
  • g++ 8.3.1 (or higher version).
  • Python 3.7 (or higher version)

Compilation

The source codes are implemented by C++, which requires g++-8 with c++17 (or higher version) for compilation. Thus, please check whether the g++-8 is installed before the compilation. If not, we provide a way to install g++-8 in Ubuntu 18.04 as follows.

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install g++-8
sudo apt-get install gcc-8 (optional)

Users can use the following commands to compile the C++ source codes:

cd methods/
make -j

Experiments

We provide the bash scripts to run all experiments. Once you have downloaded the data sets and queries and completed the compilation, you can reproduce the experiments by simply running the following commands:

cd methods/
bash run.sh

Visualization

Finally, we privode python scripts for visualization. These scripts require python 3.7 (or higher versions) with numpy, scipy, and matplotlib installed. If not, you might need to use anaconda to create a new virtual environment and use pip to install those packages.

After you have conducted all experiments, you can plot all the figures in our paper with the following commands.

cd scripts/
python3 plot.py

or

cd scripts/
python plot.py

Reference

Thank you for being patient in reading the user manual. We will appreciate using the following BibTeX to cite this work when you use these source codes in your paper.

@inproceedings{huang2023lightweight,
  title={Lightweight-Yet-Efficient: Revitalizing Ball-Tree for Point-to-Hyperplane Nearest Neighbor Search},
  author={Huang, Qiang and Tung, Anthony K. H.},
  booktitle={2023 IEEE 39th International Conference on Data Engineering (ICDE)},
  pages={436--449},
  year={2023},
  organization={IEEE}
}

It is welcome to contact me ([email protected]) if you meet any issue. Thank you for your interest!

bc-tree's People

Contributors

huangqiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

benjaminxiang

bc-tree's Issues

about query

Excuse me, is there no query function in ball_tree? How do I get query's nearest neighbor vector index
this is my code
my code:XX is a double* (size N*D) X_test is double** N is 10 D is 512

int leaf = 40;
p2h::Ball_Tree<double>* tree = new p2h::Ball_Tree<double>(N, D, leaf, XX);
tree->display();
p2h::MinK_List* list = new  p2h::MinK_List(10);
std::cout << "   nns   " << tree->nns(10, 10, 1, X_test[0], list) << std::endl;;

How to get X_test[0]'s nearest neighbor index
tasnks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.