Giter Site home page Giter Site logo

erikbern / ann-benchmarks Goto Github PK

View Code? Open in Web Editor NEW
4.8K 117.0 724.0 21.95 MB

Benchmarks of approximate nearest neighbor libraries in Python

Home Page: http://ann-benchmarks.com

License: MIT License

Python 88.07% HTML 2.42% Dockerfile 9.51%
nearest-neighbors benchmark docker

ann-benchmarks's Introduction

Benchmarking nearest neighbors

Build Status

Doing fast searching of nearest neighbors in high dimensional spaces is an increasingly important problem with notably few empirical attempts at comparing approaches in an objective way, despite a clear need for such to drive optimization forward.

This project contains tools to benchmark various implementations of approximate nearest neighbor (ANN) search for selected metrics. We have pre-generated datasets (in HDF5 format) and prepared Docker containers for each algorithm, as well as a test suite to verify function integrity.

Evaluated

Data sets

We have a number of precomputed data sets in HDF5 format. All data sets have been pre-split into train/test and include ground truth data for the top-100 nearest neighbors.

Dataset Dimensions Train size Test size Neighbors Distance Download
DEEP1B 96 9,990,000 10,000 100 Angular HDF5 (3.6GB)
Fashion-MNIST 784 60,000 10,000 100 Euclidean HDF5 (217MB)
GIST 960 1,000,000 1,000 100 Euclidean HDF5 (3.6GB)
GloVe 25 1,183,514 10,000 100 Angular HDF5 (121MB)
GloVe 50 1,183,514 10,000 100 Angular HDF5 (235MB)
GloVe 100 1,183,514 10,000 100 Angular HDF5 (463MB)
GloVe 200 1,183,514 10,000 100 Angular HDF5 (918MB)
Kosarak 27,983 74,962 500 100 Jaccard HDF5 (33MB)
MNIST 784 60,000 10,000 100 Euclidean HDF5 (217MB)
MovieLens-10M 65,134 69,363 500 100 Jaccard HDF5 (63MB)
NYTimes 256 290,000 10,000 100 Angular HDF5 (301MB)
SIFT 128 1,000,000 10,000 100 Euclidean HDF5 (501MB)
Last.fm 65 292,385 50,000 100 Angular HDF5 (135MB)

Results

These are all as of April 2023, running all benchmarks on a r6i.16xlarge machine on AWS with --parallelism 31 and hyperthreading disabled. All benchmarks are single-CPU.

glove-100-angular

glove-100-angular

sift-128-euclidean

glove-100-angular

fashion-mnist-784-euclidean

fashion-mnist-784-euclidean

nytimes-256-angular

nytimes-256-angular

gist-960-euclidean

gist-960-euclidean

glove-25-angular

glove-25-angular

TODO: update plots on http://ann-benchmarks.com.

Install

The only prerequisite is Python (tested with 3.10.6) and Docker.

  1. Clone the repo.
  2. Run pip install -r requirements.txt.
  3. Run python install.py to build all the libraries inside Docker containers (this can take a while, like 10-30 minutes).

Running

  1. Run python run.py (this can take an extremely long time, potentially days)
  2. Run python plot.py or python create_website.py to plot results.
  3. Run python data_export.py --out res.csv to export all results into a csv file for additional post-processing.

You can customize the algorithms and datasets as follows:

  • Check that ann_benchmarks/algorithms/{YOUR_IMPLEMENTATION}/config.yml contains the parameter settings that you want to test
  • To run experiments on SIFT, invoke python run.py --dataset glove-100-angular. See python run.py --help for more information on possible settings. Note that experiments can take a long time.
  • To process the results, either use python plot.py --dataset glove-100-angular or python create_website.py. An example call: python create_website.py --plottype recall/time --latex --scatter --outputdir website/.

Including your algorithm

Add your algorithm in the folder ann_benchmarks/algorithms/{YOUR_IMPLEMENTATION}/ by providing

  • A small Python wrapper in module.py
  • A Dockerfile named Dockerfile
  • A set of hyper-parameters in config.yml
  • A CI test run by adding your implementation to .github/workflows/benchmarks.yml

Check the available implementations for inspiration.

Principles

  • Everyone is welcome to submit pull requests with tweaks and changes to how each library is being used.
  • In particular: if you are the author of any of these libraries, and you think the benchmark can be improved, consider making the improvement and submitting a pull request.
  • This is meant to be an ongoing project and represent the current state.
  • Make everything easy to replicate, including installing and preparing the datasets.
  • Try many different values of parameters for each library and ignore the points that are not on the precision-performance frontier.
  • High-dimensional datasets with approximately 100-1000 dimensions. This is challenging but also realistic. Not more than 1000 dimensions because those problems should probably be solved by doing dimensionality reduction separately.
  • Single queries are used by default. ANN-Benchmarks enforces that only one CPU is saturated during experimentation, i.e., no multi-threading. A batch mode is available that provides all queries to the implementations at once. Add the flag --batch to run.py and plot.py to enable batch mode.
  • Avoid extremely costly index building (more than several hours).
  • Focus on datasets that fit in RAM. For billion-scale benchmarks, see the related big-ann-benchmarks project.
  • We mainly support CPU-based ANN algorithms. GPU support exists for FAISS, but it has to be compiled with GPU support locally and experiments must be run using the flags --local --batch.
  • Do proper train/test set of index data and query points.
  • Note that we consider that set similarity datasets are sparse and thus we pass a sorted array of integers to algorithms to represent the set of each user.

Authors

Built by Erik Bernhardsson with significant contributions from Martin Aumüller and Alexander Faithfull.

Related Publication

Design principles behind the benchmarking framework are described in the following publications:

Related Projects

ann-benchmarks's People

Contributors

aaalgo avatar aagnone3 avatar ale-f avatar alexklibisz avatar ankane avatar benwtrent avatar ceh avatar cococo2000 avatar erikbern avatar etiennedi avatar generall avatar guilhemn avatar hhy3 avatar jeadie avatar khylonwong avatar kshivendu avatar lmcinnes avatar masajiro avatar maumueller avatar mohamed-ali avatar raoufchebri avatar sammymax avatar sdu-l avatar searchivarius avatar shikharj avatar stephenleo avatar tahle avatar thomasahle avatar toregge avatar yurymalkov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ann-benchmarks's Issues

Website Title Isn't Very Descriptive

Just a suggestion, I think having a more descriptive <title> tag on the website's homepage would be helpful. Right now, it's simply ANN-Benchmarks. It might be better as "ANN-Benchmarks: Approximate Nearest-Neighbors Algorithm Benchmarks" or something like that.

Colors

@erikbern Your choices of colors make the charts really hard to read... and I am cannot fathom what it would be for a color blind person. It would be great if you can pick something with more contrast and possibly avoid shades of blue/green/orange. Thanks!

New run May 2018

Have been running it on a c5.4xlarge for about 36h. It's still on glove-100-angular and it's mostly done:

image

Not sure what's up with NGT... think maybe it timed out and we should increase timeout a bit. Especially since we're now running many searches in the same container and CPU is limited to 1. I'll double it from 2h to 4h I think.

Random datasets too easy

The random blobs created from sklearn seem to be far too easy to solve, see screenshot below. I propose to bring back the old random datasets that use https://github.com/maumueller/random-inputs/. For me the easiest way would be to create them, put them on a website, add a function to datasets.py that fetches it and transforms it correctly.

image

Run ONNG for nytimes

@maumueller
I resolved the problem (#93) of the results due to invalid zero vectors in the queries. Sorry for troubling you, but could you rebuild NGT and recreate results for the nytimes dataset?
Thank you in advance.

Check README or datasets

In README it says that GloVe datasets have around 60k queries, while in the files there are 10k of them.

Is there any plan to compare with Rayuela.jl ??

Is there any plan to compare with Rayuela.jl ??

https://github.com/una-dinosauria/Rayuela.jl
This is the code for the paper
Julieta Martinez, Shobhit Zakhmi, Holger H. Hoos, James J. Little. LSQ++: Lower running time and higher recall in multi-codebook quantization. In ECCV 2018. [CVF].

Rayuela.jl is a package that implements non-orthogonal multi-codebook quantization methods (MCQ). These methods are useful for fast search of high-dimensional (d=100s-1000s) dense vectors. Rayuela is written in Julia.

Which libraries support incremental item additions?

Thanks for putting this project together. Is there any way a table or small list could be included to describe which libraries support incremental item additions? I imagine for many this is an important criteria in real-time or near-real-time applications where new content is continuously added and needs to be indexed for look-ups.

Noob question: OS you used and requirements

Hello @erikbern and team,

First of all, let me congratulate you for your fine job, it is amazing :D

I'm trying to run your benchmark for K = 1 (seems like performance changes). I'm using Ubuntu 16.04 and python3.5.

While I was installing (sudo python3 install.py) I noticed it installs lots of dependencies from bionic beaver. Is Bionic Beaver (18.04) or any other OS a requirement?

Are numerical results available as well?

Right now, I can see only plots in the repo, which are pretty useless for the "external" comparison as such.

In particular, is HNSW still using hours for preprocessing? :) There does not seem to be a way to check now.

the module 'ann_benchmarks.algorithms.faiss_gpu' could not be loaded How can I solve it?

python run.py --docker-tag ann-benchmarks-faiss --dataset gist-960-euclidean --run-disabled --local --batch

ann_benchmarks.algorithms.faiss_gpu.FaissGPU([4096, 1]): warning: the module 'ann_benchmarks.algorithms.faiss_gpu' could not be loaded; skipping
ann_benchmarks.algorithms.nmslib.NmslibReuseIndex([u'euclidean', 'sw-graph', {'NN': 16}, False]): warning: the module 'ann_benchmarks.algorithms.nmslib' could not be loaded; skipping
ann_benchmarks.algorithms.panns.PANNS([u'euclidean', 10, 50]): warning: the module 'ann_benchmarks.algorithms.panns' could not be loaded; skipping
ann_benchmarks.algorithms.nmslib.NmslibReuseIndex([u'euclidean', 'vptree', {'tuneK': 10, 'desiredRecall': 0.4}, False]): warning: the module 'ann_benchmarks.algorithms.nmslib' could not be loaded; skipping
ann_benchmarks.algorithms.nearpy.NearPy([u'euclidean', 16, 20]): warning: the module 'ann_benchmarks.algorithms.nearpy' could not be loaded; skipping
ann_benchmarks.algorithms.mrpt.MRPT([u'euclidean', 25, 2]): warning: the module 'ann_benchmarks.algorithms.mrpt' could not be loaded; skipping
ann_benchmarks.algorithms.nmslib.NmslibReuseIndex([u'euclidean', 'vptree', {'tuneK': 10, 'desiredRecall': 0.9}, False]): warning: the module 'ann_benchmarks.algorithms.nmslib' could not be loaded; skipping
ann_benchmarks.algorithms.hnswlib.HnswLib([u'euclidean', {'efConstruction': 500, 'M': 48}]): warning: the module 'ann_benchmarks.algorithms.hnswlib' could not be loaded; skipping
ann_benchmarks.algorithms.pynndescent.PyNNDescent([u'euclidean', 10, 4, 20]): warning: the module 'ann_benchmarks.algorithms.pynndescent' could not be loaded; skipping
ann_benchmarks.algorithms.faiss_gpu.FaissGPU([1024, 10]): warning: the module 'ann_benchmarks.algorithms.faiss_gpu' could not be loaded; skipping
ann_benchmarks.algorithms.faiss.FaissIVF([u'euclidean', 8192]): warning: the module 'ann_benchmarks.algorithms.faiss' could not be loaded; skipping
ann_benchmarks.algorithms.hnswlib.HnswLib([u'euclidean', {'efConstruction': 500, 'M': 8}]): warning: the module 'ann_benchmarks.algorithms.hnswlib' could not be loaded; skipping
ann_benchmarks.algorithms.nmslib.NmslibReuseIndex([u'euclidean', 'vptree', {'tuneK': 10, 'desiredRecall': 0.2}, False]): warning: the module 'ann_benchmarks.algorithms.nmslib' could not be loaded; skipping
ann_benchmarks.algorithms.faiss.FaissIVF([u'euclidean', 512]): warning: the module 'ann_benchmarks.algorithms.faiss' could not be loaded; skipping
ann_benchmarks.algorithms.hnswlib.HnswLib([u'euclidean', {'efConstruction': 500, 'M': 24}]): warning: the module 'ann_benchmarks.algorithms.hnswlib' could not be loaded; skipping
ann_benchmarks.algorithms.faiss.FaissIVF([u'euclidean', 1024]): warning: the module 'ann_benchmarks.algorithms.faiss' could not be loaded; skipping
ann_benchmarks.algorithms.flann.FLANN([u'euclidean', 0.9]): warning: the module 'ann_benchmarks.algorithms.flann' could not be loaded; skipping

plot.py does not plot

I used python plot.py --dataset=glove-100-angular to plot the results after the simulations are done. The results are in results/glove-100-angular/10.
But it does not plot. Instead, it says Exception: Nothing to plot
I tracked into compute_metrics in utils.py and found that the res is empty.
Can you help me out?

float32 vs float64

It appears that the benchmark has switched from float32 to float64 quite a while ago, and this somehow contributed to a substantial performance regression of KGraph, which is only optimized for float32 although float64 also runs.

Below are my justifications for using float32 in the context of approximate K-NN search:

  • The search process is usually approximate, and the feature data are usually noisy by nature, so typically the precision of float64 is not necessary.
  • The necessity of an index to speed up search presumes that data is big. So a factor of 2x in the cost just to store the data in memory is significant.
  • For some algorithm/implementation, using float32 means more data can be fit into cache and the code being much faster.

I believe the performance behavior of some library will be very different for float32 and float64 (though I believe the relative ranking of different libraries should roughly stay the same). If only one can be chosen, I feel it should be float32 so that the benchmark represents the typical use case.

Is there a particular reason to keep float64, or it's better supported by some libraries in the benchmark?
(There's another reason why KGraph is slow which I'm still investigating.)

for help: code raise a exception

when runing install.py this line of code raise a exception,
subprocess.check_call( 'docker build \ --rm -t ann-benchmarks -f install/Dockerfile .', shell=True)
Exception is:
subprocess.CalledProcessError: Command 'docker build --rm -t ann-benchmarks -f install/Dockerfile .' returned non-zero exit status 1.
I hope you can help me

disable nearpy, panns, dolphinnpy

They seem pretty bad. I don't mind keeping the code, but probably makes sense to disable them by default. They add a bunch of noise to the plots and also take a disproportionate amount of time during benchmarks

index construction time plotting

I plotted index construction time using data saved in glove.txt and see a spread of almost 2 orders of magnitude (hnsw(nmslib) index construction might have been parallelized according to code). I'm most interested in the performance of sw-graph vs hnsw. It's very hard to say which one is better because one might be able to improve the search performance of sw-graph by building more aggressive indices. I collaborated with Princeton folks on developing a hierarchical version of the algorithm behind KGraph and the results were negative. Initially there was noticeable speedup at the same accuracy, but after carefully controlling index construction setup, the speedup was gone.

I also suggest that we establish a standard hardware configuration to run the benchmark. Preferrably some type of AWS instance which can be easily reproduced. I think something close to your current setup should be good.

build
throughput

Zero vectors in nytimes-256-angular

I found that nytimes-256-angular and nytimes-16-angular datasets have hundreds of zero-valued vectors.
The problem is that the cosine distance to such vectors is undefined, making the search an ill-defined task.
Probably, such vectors should be removed. Another option is using unnormalized inner product distance instead of cosine.

Remove FALCONN from benchmarks

Please, remove FALCONN from the benchmarks. It's not properly tuned for the new settings (100 neighbors instead of 10 etc), and the results are confusing.

We might later add it back (the new, faster version), but meanwhile I don't want the results to be here.

benchmark running for too long

I've been running the benchmark on a machine that shouldn't be much slower than c4.2xlarge (core i7-2600k @ 3.4GHz). It's been 5 days and I only see 195 lines in sift.txt (the plot looks good though). There are 321 lines in the standard glove.txt, and 82 lines in the standard sift.txt which I think is incomplete. So I estimate it will take 15 - 20 days for me to finish running everything. Is it supposed to be so?

I'm attaching the SIFT plot I produced with the data I have so far, which is coherent to the Glove plot I see from the project page. (KGraph bug fixed.)

xx

Nearpy distance is Cosine by default - Euclidean not applied

Hi,

I was checking some libraries and realized that in the case of Nearpy, the euclidean distance is not applied in the benchmark. As can be seen in the docs of Nearpy, the default is the cosine but the euclidean one is not applied. I will PR changes
thanks for the bench, it's great

if I only use the MNIST dataset?

hi,dear,
for the project may cost a long time,so I just want to try a little Dataset,
maybe the MNIST will be okey
so how to modify the codes ?
thx

About SPTAG test results

I tested it with a GIST data set,The size of the query vector is (1000*960),Return 100 neighbors,I used the official default parameters,The time I built the index is 20562.084 s,The query time is 35.082 s,recall is 0.92669,The time to add 100 vectors is 22.281 s. I feel this time is very long, am I using it correctly? Its index construction and query time are longer than NSG. Can you give SPTAG test reference results?
my machine:
Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz
Ubuntu 18.04
python3

Request: Provide nearest neighbors for training data

For datasets like SIFT 1M, it would be great if nearest neighbors were provided for training data as well. From the perspective of computational cost, the test data is very small and users can create its metrics themsleves. But if you could provide the NNs for 1M training points that would help those who want to use the larger datasets.

Batch queries and single-threaded mode

There are currently two relicts in the code which are not accessible since the arguments for them have been removed. What shall we do with them?

  1. Running an algorithm with runner.py has a force_single which avoids the thread-pooling mechanism altogether, independent of the use_threads method of an algorithm. I find it nice for checking whether the ThreadPool decreases performance.

  2. Running an algorithm also has support for batch mode and each algorithm already implements batch_query. However, only faiss_gpu.py currently implements an actual batch mode. However, HNSW and faiss implementations actually have good batch query implementations and they can easily be included to the benchmark in a different section.

Split up building and query parameters

Currently, we spend a lot of time basically building the same data structure over and over again. What we could do instead, is split up build and query parameters. For example, Annoy is currently configured by

    annoy:
      constructor: Annoy
      base-args: ["@metric"]
      run-groups:
        annoy:
          args: [[100, 200, 400], [100, 200, 400, 1000, 2000, 4000, 10000,
                20000, 40000, 100000, 200000, 400000]] # number of trees x number of points to gather

It would save a lot of time, if one were to separate query and build parameters in the following way:

    annoy:
      constructor: Annoy
      base-args: ["@metric"]
      run-groups:
        annoy:
          build-args: [[100, 200, 400]] # number of trees 
          query-args: [[100, 200, 400, 1000, 2000, 4000, 10000,
                20000, 40000, 100000, 200000, 400000]] # number of points to gather during query

The idea would now be to build the data structure for each build-args group, and query it for all (combinations of) query-args. This should be rather straight-forward to do.

Should be labeled as "enhancement".

Licence?

Much appreciate addition of a licence file.

10X performance difference between multi-thread and single thread tests

Good day!

I have measured the difference in performance when switching the use_threads parameter from True to False on hnsw (nmslib). As it turns out, the difference is really huge, ~ 10X on 1M SIFT for low recalls (see enclosed plots for a 6-core intel cpu), and likely is caused by the python-part overhead of the tester scripts.

Multi-thread (default):
sift_mt
Single-thread (use_threads=False):
sift_st

This makes the current results questionable, especially when compared to FALCONN (which seems to be the only algorithm which takes advantage of not using multi-thread tests). Multithread tests also significantly change the shape of the recall-QPS curve (flat vs rapidly decreasing QPS at low recall).

I suggest using the single thread version for all algorithms. Otherwise, for the case of relatively high QPS (~>2 kps) we just measure the performance of the tester script.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.