rust-cv / hnsw Goto Github PK

HNSW ANN from the paper "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"

License: MIT License

Rust 97.06% Python 2.94%

hnsw's Introduction

hnsw's People

Contributors

Stargazers

Watchers

Forkers

icodein igxactly fujinochan jimexist jackbackes terminusdb-labs yokozoona isgasho alexmikhalev jackkrasmus-vorrath emmanuel-keller stencillogic

hnsw's Issues

Add ability to remove data from HNSW

This might be difficult without a rewrite, but we need to be able to remove data from the HNSW.

Create linear-search-optimized HNSW

Some optimizations can be performed to massively speed up HNSW by utilizing linear search on clusters and in levels of NSW that are small enough that linear search is faster. This will also result in lower memory consumption. In the current architecture, these optimizations require a linear-time operation to:

Order every level so that the indices are the same
- This allows performing incremental linear search on every level until we reach a level where the number of features is too large.
Cluster features below a certain level to their nearest neighbor in a higher level to permit linear search in that cluster
- This allows performing linear search at the bottom of the graph in clusters.

This will make it so that the "small worlds" in the NSW are clusters that can be searched linearly, which is incredibly fast on modern computers.

It is not possible to efficiently maintain this during operation because every insertion would require a linear time operation to invalidate all directed edges connected to a node that must be moved (although they can be moved in log(N) time). We could make an alternate version of HNSW that does allow the first optimization in log(N) by utilizing the heap to store neighbors pointing into a given node which wouldn't affect the performance of normal searching, but would slow insertion some (although the overall speed improvement may make insertion faster).

For the second optimization, it isn't clear exactly what the semantics of insertion would be if you were to maintain clusters at the bottom of the HNSW for speed, but it is clear how you would convert an existing HNSW into a clustered version. If we could do it, allowing insertion into the clustered version would be game-changing, but it would require knowing how to split up a cluster once it becomes too big into a NSW. One idea would be to store clusters and nodes at each level of the graph and then once a cluster (which is always a leaf of the graph) needs to be broken up, one random feature is chosen from the cluster to become a new cluster center and all features are separated into two clusters based on which of the two centers they are closest to. The performance characteristics of this approach are unknown and must be benchmarked.

We should attempt to create static versions of HNSW that are built from a dynamic version in linear time complexity and also a dynamic version of HNSW that contains both of the above optimizations in their dynamic form. The two resulting data structures should be benchmarked to see if there is a benefit to the static version, otherwise the dynamic version should be preferred since the memory consumption difference should be negligable and so performance is the main factor.

Allow M_MAX and M_MAX0 to be paramaterizable

For the time being, this will need to be done with generic-array. Ultimately, const generics will be used to do this, but it may be a few months before that is even possible on nightly.

Make it easier to use arrays with HNSW

In rust-ml/linfa#15, it was decided that a new interface needed to be developed to make it easier to add arrays to HNSW. Due to a lack of abstractions surrounding numbers (mostly due to const generics), we will need to support specific array lengths. Each array length supported should be trivially convertible into an aligned SIMD type for usage in HNSW.

improve readme

The readme currently features a lot of benchmark pictures but no code example.

Adding a separate benchmark file (linked in the readme) and scaling down to one picture plus one or two typical use case examples in the readme would make the library easier to use.

Use usize everywhere instead of u32

Right now there is some usize and some u32 usage to refer to internal nodes. Specifically, Neighbor from space uses usize. We should only use usize.

Create non-discrete HNSW

Right now this crate only has DiscreteHNSW, which is incredibly useful for some specific tasks, particularly in performance computer vision. However, HNSW has nothing to do with discrete spaces at all, and the DiscreteHNSW is just a bunch of optimizations specific to discrete spaces. A general HNSW should be created that can operate on any two feature vectors for which distance can be computed where the computed distance is an f32. A trait should be created for this distance type, and it should also be implemented on things that support the DiscreteDistance trait.

Benchmark different filters and hash functions to use for previously visited nodes

Currently a HashSet<u32> is being used to check if a node is in the set of seen nodes. The u32 in the set is the index in the zero layer which is unique for all elements. The u32 is very small and we need to be able to perform a hash on it very quickly. Additionally, the data structure for this filter should likely be changed for performance and memory consumption reasons.

Since HNSW performs an ANN search, we can use things like bloom filters which produce some false positives, so long as there are no false negatives (could introduce infinite cycles in the graph).

I would like to investigate and benchmark the following filters and hash algorithms (more will be added to this list as needed):

Filters:

Bloom filter
Cuckoo filter

Hash functions:

SipHash
Anything in this crate
aHash or other AES-based keyed-hashes

All of these should be benchmarked in every combination with each other specifically for the purposes of the HNSW nearest-neighbor search. Whatever gives the best performance with negligable impact on recall will be chosen.

Add serde feature to hnsw

We should allow serializing and deserializing via serde. This should be gated behind a feature.

ndarray example?

I wanted to implement distance as ndarray's dot but I'm finding the type errors inscrutable so far. Are there any examples anywhere of using this crate and ndarray together?

I implemented:

struct Embedding;
impl Metric<ArrayView1<'_, f32>> for Embedding {
    type Unit = u64;

    fn distance(&self, a: &ArrayView1<f32>, b: &ArrayView1<f32>) -> Self::Unit {
        a.dot(b) as u64
    }
}

But when I try to insert like so:

    let mut searcher: Searcher<_> = Searcher::default();
    let mut index = Hnsw::new(Embedding);
    for i in 0..n {
        let row: ArrayView1<f32> = v.row(i);
        index.insert(row, &mut searcher);
    }

I hit:

error[E0599]: the method `insert` exists for struct `Hnsw<Embedding, _, _, {_: usize}, {_: usize}>`, but its trait bounds were not satisfied
  --> src/hnsw.rs:64:15
   |
13 | struct Embedding;
   | ----------------- doesn't satisfy `Embedding: space::Metric<_>`
...
64 |         index.insert(row, &mut searcher);
   |               ^^^^^^ method cannot be called on `Hnsw<Embedding, _, _, {_: usize}, {_: usize}>` due to unsatisfied trait bounds
   |
   = note: the following trait bounds were not satisfied:
           `Embedding: space::Metric<_>`
note: the following trait must be implemented
  --> /home/alex/.cargo/registry/src/github.com-1ecc6299db9ec823/space-0.17.0/src/lib.rs:50:1
   |
50 | / pub trait Metric<P> {
51 | |     type Unit: Unsigned + Ord + Copy;
52 | |
53 | |     fn distance(&self, a: &P, b: &P) -> Self::Unit;
54 | | }
   | |_^

It's not jumping out to me what's missing given I see no bounds on P and the bounds of the Self::Unit are met.

parallel version of hnsw

Hello rust-cv hnsw team,

It seems the build and search process is not parallelized. Despite the fact that the reference c++ complementation in the original paper is also not parallelized, they do provide task level parallelism using python (which is more easy than C++). In Rust, rayon or cross-beam can be use to parallelize any the build and search of hnsw structure. Any plans in the future to also add parallelized version? As the number of database size grows even larger (like 10^6 or above), fast search using all processors is necessary.

Thanks,

Jianshu

search_layer and search_zero_layer have code duplication

I was unable to easily solve the code duplication between these two methods. It is possible to do so with generic associated types (GATs), but even when enabling the nightly feature, I was unable to get it to work, as the feature is fairly unstable currently. Once this feature or some other method of implementing this with zero-cost abstractions becomes possible, I will do so. Until then, these two methods will violate the DRY principle.

Integrate Graph Reordering into HNSW

Coleman et al. showed that reordering the nodes in every layer such that the neighbors of each node are laid out closer in memory improves query time performance by about 40%. The idea is that reordering provides a cache-efficient search mechanism that reduces the search overhead due to random accesses in HNSW.

They also showed that using the hierarchy is not strictly necessary in certain settings. They replaced the hierarchy with "a process where we randomly sample 50 nodes and use the closest option as the initialization." They observed no statistically significant difference between the hierarchical search procedure and this random sampling process in terms of recall or query time over 10k items.

I can work on integrating these features.

Persistence of HNSW

Thanks for the lib!

Do you have any plans or ideas on implementing save and load functionality in this library?

HNSW for biology

Dear HNSW author,

This is Jianshu, a bioinformatics phd student at Georgia Tech. I am writing to you to ask your interest to form a new collaboration. Specifically, applying HNSW into genome classification problems, that is to find the closest genome in a big genome database to see the close related ones in the database so that enivronmental microbiologist can tell what taxonomy the query genome is. This will have a very big impact on the field and will definitely have a lot of citations. I want to completely rely on rust for this project without using any other language considering the advantages compared to C++ and python. I am not an expert in rust but have been using it for 2 years. The biology needed and taxonomy related information will be my strong part. I also know all the classification software in this field and have benign using/modifying them for my master and half of my Ph.D I am confident that HNSW in rust will greatly improved the speed of genome classifiers. Hopefully we can come up with a paper in the end. My email is [email protected]

Let me know if you are interest and if something I mentioned above is not clear.

Many thanks,

Jianshu

Add Knn impl for a wrapper on HNSW

In space 0.13.0, the Knn trait was added. We want to implement this trait for HNSW, but it needs a wrapper that provides the ef value to the search method. This should be simple to do.

Turn HNSW into Map and Set variants

HNSW is more difficult to use than it should be. HNSW should be changed to have an HNSWMap and HNSWSet variant. The current behavior should be preserved in that features can be accessed by index and that the index should be returned on adding something to either type of HNSW. However, the HNSW should also additionally store an item in it. The set variant will simply set that item to (), inuring no overhead.

Use actual test datasets

We should probably attempt to get the datasets for everything here. These datasets are small, so we will also need some larger data sets. These could be generated just using OpenCV using ORB and AKAZE or using SIFT with binarization.

Create examples that generate recall graphs

We need examples that generate recall graphs for different parameters of HNSW so users can choose the appropriate parameters for their use case.

feature() doc typo

In the doc it says under feature(): The item must be retrieved from HNSW::search_layer.

This should probably just be removed from the doc, or it should be corrected to say "must not be" rather than "must be".

Switch to linear search below a threshold

Linear search is generally faster than HNSW on a given computer below a certain threshold value. We should create a configurable size at which the HNSW transitions from linear search internally to its own search. The default for that transition point should be around where benchmarks say it crosses over. Benchmarks already exist that show the user the tradeoff point.

Panic when searching in an HNSW

The library panics with the following message when trying to search in it. The exact issue is located in the layer_search method due to wrong usage of the copy_from_slice function.

[...] panicked at 'source slice length (80) does not match destination slice length (84)', /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hnsw-0.11.0/src/hnsw/hnsw_const.rs:308:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Dependency on packed_simd

Hnsw depends on packed_simd, which is no longer in development.

On my mac, this is hnsw is showing the same issues seen with pack_simd in other libraries:
rust-lang/cargo#9139

It would be great to move to https://github.com/rust-lang/stdsimd maybe

packed_simd v0.3.3
└── hnsw v0.3.1

Example in README doesn't work

HNSW was renamed to Hnsw in version 0.7 so the readme is out of date in that respect. Also I went to the docs looking for recommended values for M and M0 as these values now need to be set, but don't see any relevant documentation for them. I guess I should look to the original paper, although it would be better if it was mentioned within the crate docs

Get illustrations

The authors of the paper created some helpful illustrations. This issue tracks the progress of getting permission from the authors to use those illustrations in our README.md.

Main thread panicks in examples

I was trying out one of the examples in the examples folder when the execution failed due to:
thread 'main' panicked at 'source slice length (1) does not match destination slice length (2)'

The error refers to line 308 in src/hnsw/hnsw_const.rs. This happens since the minimum array length you're getting from the K-NN search is less than the default one set on the Opt object to hold nearest neighbors. Therefore copy_from_slice() panicks.

I managed to solve this by decreasing by 1 the default number (initially set to 2) of nearest neighbors in Opt structure since I think there will always be at least 1 node in the graph.

I believe there's a more sofisticated way to fix this, but I haven't yet found it 🙃

This is the full execution log

Generating 100 random bitstrings...
Done.
Generating 100 independent random query strings...
Done.
Computing the correct nearest neighbor distance for all 100 queries...
Done.
Generating HNSW...
Done.
Computing recall graph...
thread 'main' panicked at 'source slice length (1) does not match destination slice length (2)', /Users/cinderella/.cargo/registry/src/github.com-1ecc6299db9ec823/hnsw-0.11.0/src/hnsw/hnsw_const.rs:308:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

rust-cv / hnsw Goto Github PK

hnsw's Introduction

Rust Computer Vision

What is computer vision

Goals

hnsw's People

Contributors

Stargazers

Watchers

Forkers

hnsw's Issues

Recommend Projects

Recommend Topics

Recommend Org