Giter Site home page Giter Site logo

curator-v2's Introduction

Curator: Efficient Tree-Based Vector Indexing for Filtered Search

At its core, Curator constructs a memory-efficient clustering tree that indexes all vectors and embeds multiple per-label indexes as sub-trees. These per-label indexes are not only extremely lightweight but also capture the unique vector distribution of each label, leading to high search performance and a low memory footprint. Furthermore, each per-label index can be constructed and updated independently with minimal cost, and multiple per-label indexes can be flexibly composed to handle queries with complex filter predicates.

Repository Structure

  • 3rd_party/faiss: C++ impl of Curator and baselines

    • MultiTenantIndexIVFHierarchical.cpp: Curator
    • MultiTenantIndexIVFFlat.cpp: IVF with metadata filtering
    • MultiTenantIndexIVFFlatSep.cpp: IVF with per-tenant indexing
    • MultiTenantIndexHNSW.cpp: HNSW with metadata filtering
  • indexes: Python API for indexes

    • ivf_hier_faiss.py: Curator
    • ivf_flat_mt_faiss.py: IVF with metadata filtering
    • ivf_flat_sepidx_faiss.py: IVF with per-tenant indexing
    • hnsw_mt_hnswlib.py: HNSW with metadata filtering
    • hnsw_sepidx_hnswlib.py: HNSW with per-tenant indexing
  • dataset: code for evaluation datasets

    • arxiv_dataset.py: arXiv dataset
    • yfcc100m_dataset.py: YFCC100M dataset
  • benchmark: code for running benchmarks

How to Use

Install Dependencies

We assume that you have installed Anaconda. To install the required Python packages, run the following command:

conda env create -f environment.yml -n ann_bench
conda activate ann_bench

Build from Source

cd 3rd_party/faiss

cmake -B build . \
  -DFAISS_ENABLE_GPU=OFF \
  -DFAISS_ENABLE_PYTHON=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DFAISS_OPT_LEVEL=avx2 \
  -DBUILD_TESTING=ON

make -C build -j32 faiss_avx2
make -C build -j32 swigfaiss_avx2
cd build/faiss/python
python setup.py install

Generate Datasets

mkdir -p data/yfcc100m
yfcc100m_base_url="https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/yfcc100M"
wget -P data/yfcc100m ${yfcc100m_base_url}/base.10M.u8bin
wget -P data/yfcc100m ${yfcc100m_base_url}/base.metadata.10M.spmat

mkdir -p data/arxiv
# manually download arxiv dataset from https://www.kaggle.com/datasets/Cornell-University/arxiv
# and put it at data/arxiv/arxiv-metadata-oai-snapshot.json

python -m dataset.yfcc100m_dataset
python -m dataset.arxiv_dataset

Build Docker Image

# Download the cuda-keyring package for updating the CUDA linux GPG repository key
# https://developer.nvidia.com/blog/updating-the-cuda-linux-gpg-repository-key/
# Please replace $distro and $arch with your own distro and arch
wget https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-keyring_1.0-1_all.deb
sudo docker build --rm -t ann-bench .

Run Benchmarks

Please refer to scripts in scripts folder for details. For example, to evaluate Curator on YFCC100M dataset, run the following command:

python=$(which python)  # assuming conda env is activated

sudo ${python} \
run_parallel_exp.py run_curator_overall_exp \
  --dataset yfcc100m \
  --cpu-limit 0 \
  --mem_limit 20000000000 \
  --num_runs 1

curator-v2's People

Contributors

hatsu3 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.