erikbern / ann-benchmarks Goto Github PK
View Code? Open in Web Editor NEWBenchmarks of approximate nearest neighbor libraries in Python
Home Page: http://ann-benchmarks.com
License: MIT License
Benchmarks of approximate nearest neighbor libraries in Python
Home Page: http://ann-benchmarks.com
License: MIT License
Have been running it on a c5.4xlarge for about 36h. It's still on glove-100-angular and it's mostly done:
Not sure what's up with NGT... think maybe it timed out and we should increase timeout a bit. Especially since we're now running many searches in the same container and CPU is limited to 1. I'll double it from 2h to 4h I think.
This one is promising:
https://github.com/kakao/n2
We should resurrect it
Randomly saw https://github.com/agnusmaximus/Word2Bits on Hacker News today – maybe we can use that dataset
python run.py --docker-tag ann-benchmarks-faiss --dataset gist-960-euclidean --run-disabled --local --batch
ann_benchmarks.algorithms.faiss_gpu.FaissGPU([4096, 1]): warning: the module 'ann_benchmarks.algorithms.faiss_gpu' could not be loaded; skipping
ann_benchmarks.algorithms.nmslib.NmslibReuseIndex([u'euclidean', 'sw-graph', {'NN': 16}, False]): warning: the module 'ann_benchmarks.algorithms.nmslib' could not be loaded; skipping
ann_benchmarks.algorithms.panns.PANNS([u'euclidean', 10, 50]): warning: the module 'ann_benchmarks.algorithms.panns' could not be loaded; skipping
ann_benchmarks.algorithms.nmslib.NmslibReuseIndex([u'euclidean', 'vptree', {'tuneK': 10, 'desiredRecall': 0.4}, False]): warning: the module 'ann_benchmarks.algorithms.nmslib' could not be loaded; skipping
ann_benchmarks.algorithms.nearpy.NearPy([u'euclidean', 16, 20]): warning: the module 'ann_benchmarks.algorithms.nearpy' could not be loaded; skipping
ann_benchmarks.algorithms.mrpt.MRPT([u'euclidean', 25, 2]): warning: the module 'ann_benchmarks.algorithms.mrpt' could not be loaded; skipping
ann_benchmarks.algorithms.nmslib.NmslibReuseIndex([u'euclidean', 'vptree', {'tuneK': 10, 'desiredRecall': 0.9}, False]): warning: the module 'ann_benchmarks.algorithms.nmslib' could not be loaded; skipping
ann_benchmarks.algorithms.hnswlib.HnswLib([u'euclidean', {'efConstruction': 500, 'M': 48}]): warning: the module 'ann_benchmarks.algorithms.hnswlib' could not be loaded; skipping
ann_benchmarks.algorithms.pynndescent.PyNNDescent([u'euclidean', 10, 4, 20]): warning: the module 'ann_benchmarks.algorithms.pynndescent' could not be loaded; skipping
ann_benchmarks.algorithms.faiss_gpu.FaissGPU([1024, 10]): warning: the module 'ann_benchmarks.algorithms.faiss_gpu' could not be loaded; skipping
ann_benchmarks.algorithms.faiss.FaissIVF([u'euclidean', 8192]): warning: the module 'ann_benchmarks.algorithms.faiss' could not be loaded; skipping
ann_benchmarks.algorithms.hnswlib.HnswLib([u'euclidean', {'efConstruction': 500, 'M': 8}]): warning: the module 'ann_benchmarks.algorithms.hnswlib' could not be loaded; skipping
ann_benchmarks.algorithms.nmslib.NmslibReuseIndex([u'euclidean', 'vptree', {'tuneK': 10, 'desiredRecall': 0.2}, False]): warning: the module 'ann_benchmarks.algorithms.nmslib' could not be loaded; skipping
ann_benchmarks.algorithms.faiss.FaissIVF([u'euclidean', 512]): warning: the module 'ann_benchmarks.algorithms.faiss' could not be loaded; skipping
ann_benchmarks.algorithms.hnswlib.HnswLib([u'euclidean', {'efConstruction': 500, 'M': 24}]): warning: the module 'ann_benchmarks.algorithms.hnswlib' could not be loaded; skipping
ann_benchmarks.algorithms.faiss.FaissIVF([u'euclidean', 1024]): warning: the module 'ann_benchmarks.algorithms.faiss' could not be loaded; skipping
ann_benchmarks.algorithms.flann.FLANN([u'euclidean', 0.9]): warning: the module 'ann_benchmarks.algorithms.flann' could not be loaded; skipping
C++ code here: https://github.com/ZJULearning/nsg
I think providing wrapping with cython should be easy
It appears that the benchmark has switched from float32 to float64 quite a while ago, and this somehow contributed to a substantial performance regression of KGraph, which is only optimized for float32 although float64 also runs.
Below are my justifications for using float32 in the context of approximate K-NN search:
I believe the performance behavior of some library will be very different for float32 and float64 (though I believe the relative ranking of different libraries should roughly stay the same). If only one can be chosen, I feel it should be float32 so that the benchmark represents the typical use case.
Is there a particular reason to keep float64, or it's better supported by some libraries in the benchmark?
(There's another reason why KGraph is slow which I'm still investigating.)
Is there a specific reason why Falconn is not included?
saw this randomly: https://github.com/lmcinnes/pynndescent
I used python plot.py --dataset=glove-100-angular to plot the results after the simulations are done. The results are in results/glove-100-angular/10.
But it does not plot. Instead, it says Exception: Nothing to plot
I tracked into compute_metrics in utils.py and found that the res is empty.
Can you help me out?
This looks interesting: https://github.com/microsoft/SPTAG
I've been running the benchmark on a machine that shouldn't be much slower than c4.2xlarge (core i7-2600k @ 3.4GHz). It's been 5 days and I only see 195 lines in sift.txt (the plot looks good though). There are 321 lines in the standard glove.txt, and 82 lines in the standard sift.txt which I think is incomplete. So I estimate it will take 15 - 20 days for me to finish running everything. Is it supposed to be so?
I'm attaching the SIFT plot I produced with the data I have so far, which is coherent to the Glove plot I see from the project page. (KGraph bug fixed.)
In README it says that GloVe datasets have around 60k queries, while in the files there are 10k of them.
when runing install.py this line of code raise a exception,
subprocess.check_call( 'docker build \ --rm -t ann-benchmarks -f install/Dockerfile .', shell=True)
Exception is:
subprocess.CalledProcessError: Command 'docker build --rm -t ann-benchmarks -f install/Dockerfile .' returned non-zero exit status 1.
I hope you can help me
There are currently two relicts in the code which are not accessible since the arguments for them have been removed. What shall we do with them?
Running an algorithm with runner.py
has a force_single
which avoids the thread-pooling mechanism altogether, independent of the use_threads
method of an algorithm. I find it nice for checking whether the ThreadPool decreases performance.
Running an algorithm also has support for batch mode and each algorithm already implements batch_query
. However, only faiss_gpu.py
currently implements an actual batch mode. However, HNSW and faiss implementations actually have good batch query implementations and they can easily be included to the benchmark in a different section.
@erikbern do you have any thoughts or observations about the kepler-mapper
?
https://github.com/MLWave/kepler-mapper
https://mlwave.github.io/kepler-mapper/examples.html
SPTAG: https://github.com/Microsoft/SPTAG
Thank you
Is there a known metric to combine both build time and query time into a metric, for those that wish to do approximate k nearest neighbor within a different dataset each time.
I found that nytimes-256-angular and nytimes-16-angular datasets have hundreds of zero-valued vectors.
The problem is that the cosine distance to such vectors is undefined, making the search an ill-defined task.
Probably, such vectors should be removed. Another option is using unnormalized inner product distance instead of cosine.
The random blobs created from sklearn seem to be far too easy to solve, see screenshot below. I propose to bring back the old random datasets that use https://github.com/maumueller/random-inputs/. For me the easiest way would be to create them, put them on a website, add a function to datasets.py
that fetches it and transforms it correctly.
Thanks for putting this project together. Is there any way a table or small list could be included to describe which libraries support incremental item additions? I imagine for many this is an important criteria in real-time or near-real-time applications where new content is continuously added and needs to be indexed for look-ups.
Please, remove FALCONN from the benchmarks. It's not properly tuned for the new settings (100 neighbors instead of 10 etc), and the results are confusing.
We might later add it back (the new, faster version), but meanwhile I don't want the results to be here.
Maybe?
Paper is here: https://arxiv.org/abs/1807.05614
Right now, I can see only plots in the repo, which are pretty useless for the "external" comparison as such.
In particular, is HNSW still using hours for preprocessing? :) There does not seem to be a way to check now.
Currently, we spend a lot of time basically building the same data structure over and over again. What we could do instead, is split up build and query parameters. For example, Annoy is currently configured by
annoy:
constructor: Annoy
base-args: ["@metric"]
run-groups:
annoy:
args: [[100, 200, 400], [100, 200, 400, 1000, 2000, 4000, 10000,
20000, 40000, 100000, 200000, 400000]] # number of trees x number of points to gather
It would save a lot of time, if one were to separate query and build parameters in the following way:
annoy:
constructor: Annoy
base-args: ["@metric"]
run-groups:
annoy:
build-args: [[100, 200, 400]] # number of trees
query-args: [[100, 200, 400, 1000, 2000, 4000, 10000,
20000, 40000, 100000, 200000, 400000]] # number of points to gather during query
The idea would now be to build the data structure for each build-args
group, and query it for all (combinations of) query-args
. This should be rather straight-forward to do.
Should be labeled as "enhancement".
Much appreciate addition of a licence file.
It'll be very interesting to see the comparisons between different suitable methods for higher dimensional data (e.g., several thousands or even larger).
Thank you.
distances = f.create_dataset('distances', (len(test), count), dtype='f'),
why count=100?
"count" means "the number of near neighbours to search for"?
Here is not direct comparision of faiss library:
https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors
https://github.com/milvus-io/milvus
"Milvus -- the world's fastest vector search engine." would love to verify that claim :)
I tested it with a GIST data set,The size of the query vector is (1000*960),Return 100 neighbors,I used the official default parameters,The time I built the index is 20562.084 s,The query time is 35.082 s,recall is 0.92669,The time to add 100 vectors is 22.281 s. I feel this time is very long, am I using it correctly? Its index construction and query time are longer than NSG. Can you give SPTAG test reference results?
my machine:
Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz
Ubuntu 18.04
python3
For datasets like SIFT 1M, it would be great if nearest neighbors were provided for training data as well. From the perspective of computational cost, the test data is very small and users can create its metrics themsleves. But if you could provide the NNs for 1M training points that would help those who want to use the larger datasets.
@maumueller
I resolved the problem (#93) of the results due to invalid zero vectors in the queries. Sorry for troubling you, but could you rebuild NGT and recreate results for the nytimes dataset?
Thank you in advance.
I plotted index construction time using data saved in glove.txt and see a spread of almost 2 orders of magnitude (hnsw(nmslib) index construction might have been parallelized according to code). I'm most interested in the performance of sw-graph vs hnsw. It's very hard to say which one is better because one might be able to improve the search performance of sw-graph by building more aggressive indices. I collaborated with Princeton folks on developing a hierarchical version of the algorithm behind KGraph and the results were negative. Initially there was noticeable speedup at the same accuracy, but after carefully controlling index construction setup, the speedup was gone.
I also suggest that we establish a standard hardware configuration to run the benchmark. Preferrably some type of AWS instance which can be easily reproduced. I think something close to your current setup should be good.
Good day!
I have measured the difference in performance when switching the use_threads parameter from True to False on hnsw (nmslib). As it turns out, the difference is really huge, ~ 10X on 1M SIFT for low recalls (see enclosed plots for a 6-core intel cpu), and likely is caused by the python-part overhead of the tester scripts.
Multi-thread (default):
Single-thread (use_threads=False):
This makes the current results questionable, especially when compared to FALCONN (which seems to be the only algorithm which takes advantage of not using multi-thread tests). Multithread tests also significantly change the shape of the recall-QPS curve (flat vs rapidly decreasing QPS at low recall).
I suggest using the single thread version for all algorithms. Otherwise, for the case of relatively high QPS (~>2 kps) we just measure the performance of the tester script.
They seem pretty bad. I don't mind keeping the code, but probably makes sense to disable them by default. They add a bunch of noise to the plots and also take a disproportionate amount of time during benchmarks
Just a suggestion, I think having a more descriptive <title>
tag on the website's homepage would be helpful. Right now, it's simply ANN-Benchmarks
. It might be better as "ANN-Benchmarks: Approximate Nearest-Neighbors Algorithm Benchmarks" or something like that.
Hello @erikbern and team,
First of all, let me congratulate you for your fine job, it is amazing :D
I'm trying to run your benchmark for K = 1 (seems like performance changes). I'm using Ubuntu 16.04 and python3.5.
While I was installing (sudo python3 install.py) I noticed it installs lots of dependencies from bionic beaver. Is Bionic Beaver (18.04) or any other OS a requirement?
@erikbern Your choices of colors make the charts really hard to read... and I am cannot fathom what it would be for a color blind person. It would be great if you can pick something with more contrast and possibly avoid shades of blue/green/orange. Thanks!
Was it supposed to be c4.2xlarge (not 2.4)?
Hi,
I was checking some libraries and realized that in the case of Nearpy, the euclidean distance is not applied in the benchmark. As can be seen in the docs of Nearpy, the default is the cosine but the euclidean one is not applied. I will PR changes
thanks for the bench, it's great
Is there any plan to compare with Rayuela.jl ??
https://github.com/una-dinosauria/Rayuela.jl
This is the code for the paper
Julieta Martinez, Shobhit Zakhmi, Holger H. Hoos, James J. Little. LSQ++: Lower running time and higher recall in multi-codebook quantization. In ECCV 2018. [CVF].
Rayuela.jl is a package that implements non-orthogonal multi-codebook quantization methods (MCQ). These methods are useful for fast search of high-dimensional (d=100s-1000s) dense vectors. Rayuela is written in Julia.
There was a hacker news thread linking to the itu.dk page: https://news.ycombinator.com/item?id=21374400
Probably makes sense to remove that page and redirect to ann-benchmarks.com or what do you think @maumueller @ale-f ?
Reminds me it's probably time to run another benchmark soon. There's been some additions lately.
randomly stumbled upon this one: https://github.com/hdidx/hdidx
interface looks simple, should be easy to add
hi,dear,
for the project may cost a long time,so I just want to try a little Dataset,
maybe the MNIST will be okey
so how to modify the codes ?
thx
I stumbled over https://github.com/dbaranchuk/ivf-hnsw
Did anyone tried to benchmark their implementation?
Is there conzeptual difference between ivf-hnsw and hnsw from hnswlib or nswlib?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.