Giter Site home page Giter Site logo

meilisearch / arroy Goto Github PK

View Code? Open in Web Editor NEW
175.0 10.0 6.0 1.94 MB

Annoy-inspired Approximate Nearest Neighbors in Rust, based on LMDB and optimized for memory usage :boom:

Home Page: https://docs.rs/arroy

License: MIT License

Rust 100.00%
annoy approximate-nearest-neighbor-search lmdb rust diskann

arroy's Introduction

Dependency status License Bors enabled

⚑ A lightning-fast search engine that fits effortlessly into your apps, websites, and workflow πŸ”

Meilisearch helps you shape a delightful search experience in a snap, offering features that work out-of-the-box to speed up your workflow.

A bright colored application for finding movies screening near the user A dark colored application for finding movies screening near the user

πŸ”₯ Try it! πŸ”₯

✨ Features

  • Search-as-you-type: find search results in less than 50 milliseconds
  • Typo tolerance: get relevant matches even when queries contain typos and misspellings
  • Filtering and faceted search: enhance your users' search experience with custom filters and build a faceted search interface in a few lines of code
  • Sorting: sort results based on price, date, or pretty much anything else your users need
  • Synonym support: configure synonyms to include more relevant content in your search results
  • Geosearch: filter and sort documents based on geographic data
  • Extensive language support: search datasets in any language, with optimized support for Chinese, Japanese, Hebrew, and languages using the Latin alphabet
  • Security management: control which users can access what data with API keys that allow fine-grained permissions handling
  • Multi-Tenancy: personalize search results for any number of application tenants
  • Highly Customizable: customize Meilisearch to your specific needs or use our out-of-the-box and hassle-free presets
  • RESTful API: integrate Meilisearch in your technical stack with our plugins and SDKs
  • Easy to install, deploy, and maintain

πŸ“– Documentation

You can consult Meilisearch's documentation at https://www.meilisearch.com/docs.

πŸš€ Getting started

For basic instructions on how to set up Meilisearch, add documents to an index, and search for documents, take a look at our Quick Start guide.

⚑ Supercharge your Meilisearch experience

Say goodbye to server deployment and manual updates with Meilisearch Cloud. No credit card required.

🧰 SDKs & integration tools

Install one of our SDKs in your project for seamless integration between Meilisearch and your favorite language or framework!

Take a look at the complete Meilisearch integration list.

Logos belonging to different languages and frameworks supported by Meilisearch, including React, Ruby on Rails, Go, Rust, and PHP

βš™οΈ Advanced usage

Experienced users will want to keep our API Reference close at hand.

We also offer a wide range of dedicated guides to all Meilisearch features, such as filtering, sorting, geosearch, API keys, and tenant tokens.

Finally, for more in-depth information, refer to our articles explaining fundamental Meilisearch concepts such as documents and indexes.

πŸ“Š Telemetry

Meilisearch collects anonymized data from users to help us improve our product. You can deactivate this whenever you want.

To request deletion of collected data, please write to us at [email protected]. Don't forget to include your Instance UID in the message, as this helps us quickly find and delete your data.

If you want to know more about the kind of data we collect and what we use it for, check the telemetry section of our documentation.

πŸ“« Get in touch!

Meilisearch is a search engine created by Meili, a software development company based in France and with team members all over the world. Want to know more about us? Check out our blog!

πŸ—ž Subscribe to our newsletter if you don't want to miss any updates! We promise we won't clutter your mailbox: we only send one edition every two months.

πŸ’Œ Want to make a suggestion or give feedback? Here are some of the channels where you can reach us:

Thank you for your support!

πŸ‘©β€πŸ’» Contributing

Meilisearch is, and will always be, open-source! If you want to contribute to the project, please take a look at our contribution guidelines.

πŸ“¦ Versioning

Meilisearch releases and their associated binaries are available in this GitHub page.

The binaries are versioned following SemVer conventions. To know more, read our versioning policy.

Differently from the binaries, crates in this repository are not currently available on crates.io and do not follow SemVer conventions.

arroy's People

Contributors

curquiza avatar dureuill avatar irevoire avatar kerollmops avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arroy's Issues

Do not use the vector dimensions as the number of items in a descendant node

When porting the build tree functions from Annoy, we kept the constant value for the number of descendants we could fit into a descendant node. The reason why they were doing that is because they needed constant-length nodes. However, our system no longer needs this, as LMDB entries can have any length.

Implement incremental updates

By keeping a list of unreachable_items that corresponds to the newly added or modified items we can make sure that the Writer::build method keeps track of them and insert them incrementally into all of the trees.

However, it must keep the split plane balanced and therefore recompute some of the normal along the way.

It must also make sure that the tree depth is growing with the number of items too. By appending new split plane if necessary and reducing the branch size. That could be a rebuild method that erase everything and start from the beginning to keep a good balance.

Make it possible to filter directly from arroy

At Meilisearch, we must ensure documents or vectors are hidden from some queries. The first version of the vector store feature was to iterate on the best results in the order found in the HNSW and conditionally ignore them. The complexity was O(n), which is not great when the user filters the results on a couple, e.g., 5, 10, 100.

We can do better than that now that we control the whole source code. There are two main solutions to implement:

  • When there is a fairly small number of selected items, compute the distance without going through the whole tree/graph. However, we need to define a correct threshold.
  • The algorithm must be more clever when many more items are selected. We will filter the items from the Descendant variant when "iterating" over the tree nodes in the nns_by_item/by_vector. We can barely not even touch the build phase. However, it would also be great to store bitmaps instead of lists of u32s in the Descendant nodes to be able to perform faster intersections.

We will also probably need to provide a nss_builder to reduce the number of conditional parameters to specify. For example, we can provide the RoaringBitmap to filter with, the number of trees to explore, and maybe two methods to query the database: with and without the distances (is it useful?).

TODO

  • Update the README's feature list

Look into binary indexes

One of our customers is interested in binary Indexes. It could be interesting to look into this. We can potential find a good fit with this.

BIN_FLAT
This index is exactly the same as FLAT except that this can only be used for binary embeddings.

For vector similarity search applications that require perfect accuracy and depend on relatively small (million-scale) datasets, the BIN_FLAT index is a good choice. BIN_FLAT does not compress vectors, and is the only index that can guarantee exact search results. Results from BIN_FLAT can also be used as a point of comparison for results produced by other indexes that have less than 100% recall.

BIN_FLAT is accurate because it takes an exhaustive approach to search, which means for each query the target input is compared to every vector in a dataset. This makes BIN_FLAT the slowest index on our list, and poorly suited for querying massive vector data. There are no parameters for the BIN_FLAT index in Milvus, and using it does not require data training or additional storage.

BIN_IVF_FLAT
This index is exactly the same as IVF_FLAT except that this can only be used for binary embeddings.

BIN_IVF_FLAT divides vector data into nlist cluster units, and then compares distances between the target input vector and the center of each cluster. Depending on the number of clusters the system is set to query (nprobe), similarity search results are returned based on comparisons between the target input and the vectors in the most similar cluster(s) only β€” drastically reducing query time.

By adjusting nprobe, an ideal balance between accuracy and speed can be found for a given scenario. Query time increases sharply as both the number of target input vectors (nq), and the number of clusters to search (nprobe), increase.

BIN_IVF_FLAT is the most basic BIN_IVF index, and the encoded data stored in each unit is consistent with the original data.

The `Side` function behave strangely on two dimensions

I wrote a visualizer of every iteration of two_means + the final repartition between the left children and right children on a split node.

At the very end of the split node function, we found two pretty good centroids:
image

But the final result we chose is this one:
image

The more dimensions we add, the less the issue shows.

Improve the deletion of `tmp_nodes`

Sometimes we need to delete new nodes previously inserted in the tmp_nodes.
From my understanding, it was always either the last one or the last last one, but it looks like it's not.

It would be good to check measure if we often do a lot of iterations, and if that's the case try to optimize it.

if let Some(el) = self.ids.iter_mut().rev().take(2).find(|i| **i == item) {

Measure and improve the constant numbers used when building the tree

We must take three parameters into account:

  1. Time to build the tree
  2. Relevancy of the searches
  3. Time to search in the tree

Fun fact: the lowest in the tree you are, the less impact a dummy plane has on the search cost.


arroy/src/writer.rs

Lines 248 to 259 in 7fc6031

if split_imbalance(children_left.len(), children_right.len()) < 0.95
|| remaining_attempts == 0
{
break normal;
}
remaining_attempts -= 1;
};
// If we didn't find a hyperplane, just randomize sides as a last option
// and set the split plane to zero as a dummy plane.
while split_imbalance(children_left.len(), children_right.len()) > 0.99 {

fn split_imbalance(left_indices_len: usize, right_indices_len: usize) -> f64 {
    let ls = left_indices_len as f64;
    let rs = right_indices_len as f64;
    let f = ls / (ls + rs + f64::EPSILON); // Avoid 0/0
    f.max(1.0 - f)
}

fn main() {
    dbg!(split_imbalance(29464, 18394));
    dbg!(split_imbalance(30000, 30000));
    dbg!(split_imbalance(30000, 1580));
}

Update the pre-processor

  • About preprocess:
    • We can probably compute the max-norm on the fly while adding the items
    • If the max-norm doesn't change (from a doc addition OR deletion), then we don't need to run the second part of

Support binary quantization

It would be great to support binary quantization in arroy. The main principle is to convert the dimensions values x <= 0 to 0 and x > 0 to 1. This way, we can represent the quantized vector with 32x less space and compute the distances in a much faster and CPU-friendly. We are currently limited to something like 15M (float 32bit, 768dims) on a 63GiB machine, but with binary quantization, we can go up to 480M vectors on the same machine.

Here is an example of implementing the Euclidean distance with binary data. Here is the formula: $\sqrt{(p1-q1)^2+(p2-q2)^2}$.
This means that computing the difference at the power of two is equivalent to a xor:
$(0-1)^2 = (-1)^2 = 1$
$(1-0)^2 = 1^2 = 1$
$(0-0)^2 = 0$
$(1-1)^2 = 0$

Ultimately, the Euclidean operation is the sum of the XORed dimensions of both vectors squared: $\sqrt{(p1 \bigoplus q1)+(p2 \bigoplus q2)}$. All the necessary operations can be SIMD-optimized or maybe using the u8::BitXor and u8::count_ones methods will be SIMD-optimized by itself πŸ€”

Appending items no longer works

It is no longer possible to append items in the database because we are updating the updated item IDs that live after the item entries. We could move the metadata information before the item entries. We must add a test for this method.

arroy/src/writer.rs

Lines 166 to 176 in 19e0a07

let mut updated = self
.database
.remap_data_type::<RoaringBitmapCodec>()
.get(wtxn, &Key::updated(self.index))?
.unwrap_or_default();
updated.insert(item);
self.database.remap_data_type::<RoaringBitmapCodec>().put(
wtxn,
&Key::updated(self.index),
&updated,
)?;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.