Giter Site home page Giter Site logo

instant-labs / instant-distance Goto Github PK

View Code? Open in Web Editor NEW
282.0 8.0 25.0 201 KB

Fast approximate nearest neighbor searching in Rust, based on HNSW index

License: Apache License 2.0

Rust 87.61% Makefile 0.57% Python 11.82%
rust approximate-nearest-neighbor-search hnsw

instant-distance's People

Contributors

dependabot-preview[bot] avatar dependabot[bot] avatar djc avatar hartshorne avatar jinglybits avatar knpwrs avatar messense avatar nrempel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

instant-distance's Issues

Apparent deadlock when `Heuristic` option enabled

Description

While debugging another issue, I found a deadlock when a extend_candidates is set to true in the Heuristic:

Reproduce

  1. Change the map test in tests/all.rs as below (set the Heuristic in the builder).
  2. cargo test map
  3. Note that the test won't complete.

Test code

diff --git a/instant-distance/tests/all.rs b/instant-distance/tests/all.rs
index b9fa973..8e916c6 100644
--- a/instant-distance/tests/all.rs
+++ b/instant-distance/tests/all.rs
@@ -4,7 +4,7 @@ use ordered_float::OrderedFloat;
 use rand::rngs::{StdRng, ThreadRng};
 use rand::{Rng, SeedableRng};
 
-use instant_distance::{Builder, Point as _, Search};
+use instant_distance::{Builder, Heuristic, Point as _, Search};
 
 #[test]
 #[allow(clippy::float_cmp, clippy::approx_constant)]
@@ -16,7 +16,13 @@ fn map() {
 
     let seed = ThreadRng::default().gen::<u64>();
     println!("map (seed = {seed})");
-    let map = Builder::default().seed(seed).build(points, values);
+    let map = Builder::default()
+        .seed(seed)
+        .select_heuristic(Some(Heuristic {
+            extend_candidates: true,
+            ..Default::default()
+        }))
+        .build(points, values);
     let mut search = Search::default();
 
     for (i, item) in map.search(&Point(2.0, 2.0), &mut search).enumerate() {

Building index ahead of time lacks information required for future mapping

By building an index ahead of time, the order of the candidates returned are lost. This makes it hard to map an embedding back onto it's originating word if we load the instant-distance instance from a dump.

Unless I'm missing something, does it make sense to add this information to the bincode dump?

could not install requirements.txt

there is an issue with instant-distance package:

ERROR: Could not find a version that satisfies the requirement instant-distance==0.3.1 (from versions: none)
ERROR: No matching distribution found for instant-distance==0.3.1

how to reproduce the error

my requirements.txt file:

numpy
pytest
nltk
aiohttp==3.8.3
progress==1.6
instant-distance==0.3.1
python version: 3.9.6
pip version: 23.1.2
MacOS 13.4 (22F66)

Generalize SIMD distance implementation to n-length vectors

Our Rust API is currently completely abstract over types that implement the Point trait:

pub trait Point: Clone + Sync {
    fn distance(&self, other: &Self) -> f32;
}

An important part of making instant-distance fast is making the distance() implementation fast, which comes down to using a SIMD implementation. I wrote such an implementation which is specialized for the case of [f32; 300] because that's what we typically use for InstantDomainSearch (since Meta's FastText vectors have 300 elements).

However, for the Python bindings, we have some other needs. There, we don't really have the opportunity to make use of compile-time generics; the vector length should be run-time property of the vector. Since we probably don't want to dereference through a pointer for every vector element access, this also means we might want to change the Hnsw implementations to hold a Vec<f32> instead of a Vec<[f32; 300]> (for example), without losing the performance benefits of avoiding bounds checking where possible. For the Python API, we should then also do a SIMD distance implementation that can adjust to the size of the vector at run-time, ideally without much performance loss compared to the current, fixed-length implementation.

make test-python error

Hello,
I encountered an error with make test-python . The error says
cp: cannot stat 'target/release/libinstant_distance.dylib': No such file or directory make: *** [Makefile:3: test-python] Error 1

I would greatly appreciate any help.

Thank you.

Best,
Yazhini

Incremental index construction

Hello, does this crate support incremental index construction? For example, I would like to store the constructed index on disk to quickly retrieve desired data, and incrementally update the index when the data is updated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.