Giter Site home page Giter Site logo

efanna_graph's Introduction

EFANNA_graph: an Extremely Fast Approximate Nearest Neighbor graph construction Algorithm framework

EFANNA is a flexible and efficient library for approximate nearest neighbor search (ANN search) on large scale data. It implements the algorithms of our paper EFANNA : Extremely Fast Approximate Nearest Neighbor Search Algorithm Based on kNN Graph.
This project (efanna_graph) contains only the approximate nearest neighbor graph construction part in our EFANNA paper. The reasons are as follows:

  • Some advanced graph based ANN search algorithms (e.g., HNSW, NSG) make search with Efanna almost meaningless.
  • But the approximate kNN graph construction part in Efanna is still interesting and important.
  • The idea of kNN graph construction in Efanna is very simple. It just combines KD-tree and nndescent. The KD-tree part can be repleced by other kNN graph initilization algorithms. To make the algorithm framework more flexible, we decompose the nndescent part and KD-tree initilization part.
  • We also add an example to use Faiss to build an initial graph and use nndescent to refine the graph.

How To Complie

Case 1: If the dataset is very large (more than 4M points), you may want to use faiss to build a graph for initilization. Then you need to downlad the faiss package and compile with it.

cd efanna_graph/extern_libraries/
git clone https://github.com/facebookresearch/faiss.git
cd faiss/

Please see the documentation of faiss for how to compile the faiss.

Case 2: If you don't need the faiss for initilization, you can compile efanna_graph without the faiss support.

cd efanna_graph/
rm include/efanna2e/index_pq.h src/index_pq.cpp
comment the last two lines in tests/CMakeLists.txt

Then go to the root directory of efanna_graph and make.

cd efanna_graph/
cmake .
make

How To Use

  • kNN graph building with nndescent:

      cd tests/   
      ./test_nndescent data_file save_graph K L iter S R
    

Meaning of the parameters:

**data_file** is the path of the origin data.
**save_graph** is the path of the kNN graph to be saved.
**K** is the 'K' of kNN graph.
**L** is the parameter controlling the graph quality, larger is more accurate but slower, no smaller than K.
**iter** is the parameter controlling the iteration times, iter usually < 30.
**S** is the parameter contollling the graph quality, larger is more accurate but slower.
**R** is the parameter controlling the graph quality, larger is more accurate but slower.
  • kNN graph initilization with KD-tree and refine by nndescent (efanna):

      cd tests/   
      ./test_kdtree_graph data_file nTrees mLevel K saving_graph
      ./test_nndescent_refine data_file init_graph save_graph K L iter S R
    

Meaning of the parameters:

**nTrees** is the number of trees used to build the graph (larger is more accurate but slower)
**mLevel** conquer-to-depeth (smaller is more accurate but slower) 
**K** is the 'K' of kNN graph.
**saveing_graph** is the path of the kNN graph to be saved.
**init_graph** is the graph built by test_kdtree_graph, same with **saveing\_graph** here.
  • kNN graph initilization with faiss and refine by nndescent:

      cd tests/   
      ./test_faiss_graph data_file IndexKey SearchKey K saving_graph
      ./test_nndescent_refine data_file init_graph save_graph K L iter S R
    

Meaning of the parameters(from left to right):

**IndexKey** is the parameter in faiss to build the index.
**SearchKey** is the parameter in faiss to search with the index.

Please see the documentation of faiss for how to set these two parameters.

Output format

Suppose the database has N points, and numbered from 0 to N-1. You want to build an approximate kNN graph. The graph can be regarded as a N * k Matrix. The saved kNN graph binary file saves the matrix by row. The first 4 bytes of each row saves the int value of k, next 4 bytes saves the value of M and next 4 bytes saves the float value of the norm of the point. Then it follows k*4 bytes, saving the indices of the k nearest neighbors of respective point. The N rows are saved continuously without seperating characters.

Input of EFANNA

Because there is no unified format for input data, users may need to write input function to read your own data. You may imitate the input function in our sample code (sample/efanna_efanna_index_buildgraph.cc) to load the data into our matrix. To use SIMD instruction optimization, you should pay attention to the data alignment problem of SSE / AVX instruction.

Benchmark data set

  • SIFT1M and GIST1M

      ./test_nndescent gist_base.fvecs gist.100NN.graph 100 120 10 15 100
    

    With the above command, one can build a 100NN graph on GIST1M with around 96% accuracy in 960 seconds on a i7-4790K cpu with 8 threads.

      ./test_nndescent sift_base.fvecs sift.50NN.graph 50 70 8 10 100
    

    With the above command, one can build a 50NN graph on SIFT1M with around 90% accuracy in 72 seconds on a i7-4790K cpu with 8 threads.

      ./test_nndescent sift_base.fvecs sift.50NN.graph 50 70 10 10 50
    

    With the above command, one can build a 50NN graph on SIFT1M with around 97% accuracy in 106 seconds on a i7-4790K cpu with 8 threads.

Acknowledgment

The implemnetation of nndescent is taken from Kgraph. They proposed the nndescent algorithm. Many thanks to them for inspiration.

efanna_graph's People

Contributors

dengcai78 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

efanna_graph's Issues

Recall stays 0.0001

I am trying to generate a graph using efanna_graph. I have a dataset of 4000 images. I have calculated and wrote all SIFT descriptors to a .fvecs file and used that to generate the graph. Unfortunately efanna_graph recall never went above 0.0001. I believe it is an issue with the way I write descriptors to .fvecs.

I have tried to write .fvecs multiple ways. The code I am using now is this: write_to_fvecs . As you can see after I have calculated descriptors for each image and concatenated these descriptors in a single array I write them to .fvecs using:
vectorArray.astype(np.int32).tofile('./my_sift_descriptors.fvecs')
As you can see I use np.int32 which seems wrong to me. The reason for using np.int32 is as follows.

First I tried writing to file like this:
vectorArray.astype().tofile('./vanbeeklederwaren_astype_int32.fvecs')
But when I start efanna_graph test_nndescent I get this message: "data dimension: 1124073472
Floating point exception (core dumped)".

Then I tried writing to file like this(which seems to me is the correct way to this):
vectorArray.astype(np.float32).tofile('./vanbeeklederwaren_astype_int32.fvecs')
But again when running efanna_graph I get this message: "data dimension: 1124073472
Floating point exception (core dumped)".

Then I used this snippet: read_fvecs which you can use to read fvecs files in python. I used this snippet to read the first 4 bytes of 4 different files. The first:
The fvecs file provided by TexMex showed the first 4 bytes to be of type of float32 and the value was 1.8e-43.

The file saved without specifying a type was also of type float32 but displayed 128.0 when printed.

The file saved as float32 also was of type float32 and also displayed 128.0 when printed.

The file saved as int32 also was of type but displayed 1.8e-43 when printed.

I assumed the last file should be correct, thus I continued and calculated all my descriptors, saved them to .fvecs and started efanna_graph. However the training did no go as expected and the recall never went above 0.0001. The parameters I used: 200 200 20 10 100.

I can't seem to find a solution. Can you please provide your snippet on how you compute SIFT descriptors and save these to .fvecs file?

Thank you.

Can't compile: ‘index_factory’ is not a member of ‘faiss’

Hello, i'm trying to compile with faiss.
But after command "make" there is error:

/home/kimal/Kema/gitty/efanna_graph/src/index_pq.cpp: In member function ‘virtual void efanna2e::IndexPQ::Build(size_t, const float*, const efanna2e::Parameters&)’:
/home/kimal/Kema/gitty/efanna_graph/src/index_pq.cpp:46:18: error: ‘index_factory’ is not a member of ‘faiss’
   46 |   index = faiss::index_factory(dimension_, pq_index_key.c_str());
      |   

Screenshot from 2019-12-21 20-42-40
I'm using Fedora.
I've already compiled faiss in "extern_libraries" directory.
Is there something I've missed?

Thank you for reading. I'll be glad to see any response.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.