Giter Site home page Giter Site logo

zachnagengast / similarity-search-kit Goto Github PK

View Code? Open in Web Editor NEW
267.0 12.0 24.0 179.1 MB

๐Ÿ”Ž SimilaritySearchKit is a Swift package providing on-device text embeddings and semantic search functionality for iOS and macOS applications.

Home Page: https://discord.gg/2vBQcF3nU5

License: Apache License 2.0

Swift 100.00%
question-answering semantic-search semantic-similarity text-embeddings vector-embeddings coreml information-retrieval nlp pretrained-models apple-neural-engine

similarity-search-kit's Introduction

SimilaritySearchKit

License

ssk-logo

SimilaritySearchKit is a Swift package enabling on-device text embeddings and semantic search functionality for iOS and macOS applications in just a few lines. Emphasizing speed, extensibility, and privacy, it supports a variety of built-in state-of-the-art NLP models and similarity metrics, in addition to seamless integration for bring-your-own options.

Chat With Files Example

Use Cases

Some potential use cases for SimilaritySearchKit include:

  • Privacy-focused document search engines: Create a search engine that processes sensitive documents locally, without exposing user data to external services. (See example project "ChatWithFilesExample" in the Examples directory.)

  • Offline question-answering systems: Implement a question-answering system that finds the most relevant answers to a user's query within a local dataset.

  • Document clustering and recommendation engines: Automatically group and organize documents based on their textual content on the edge.

By leveraging SimilaritySearchKit, developers can easily create powerful applications that keep data close to home without major tradeoffs in functionality or performance.

Installation

To install SimilaritySearchKit, simply add it as a dependency to your Swift project using the Swift Package Manager. I recommend using the Xcode method personally via:

File โ†’ Add Packages... โ†’ Search or Enter Package Url โ†’ https://github.com/ZachNagengast/similarity-search-kit.git

Xcode should give you the following options to choose which model you'd like to add (see available models below for help choosing):

Xcode Swift Package Manager Import

If you want to add it via Package.swift, add the following line to your dependencies array:

.package(url: "https://github.com/ZachNagengast/similarity-search-kit.git", from: "0.0.1")

Then, add the appropriate target dependency to the desired target:

.target(name: "YourTarget", dependencies: [
    "SimilaritySearchKit", 
    "SimilaritySearchKitDistilbert", 
    "SimilaritySearchKitMiniLMMultiQA", 
    "SimilaritySearchKitMiniLMAll"
])

If you only want to use a subset of the available models, you can omit the corresponding dependency. This will reduce the size of your final binary.

Usage

To use SimilaritySearchKit in your project, first import the framework:

import SimilaritySearchKit

Next, create an instance of SimilarityIndex with your desired distance metric and embedding model (see below for options):

let similarityIndex = await SimilarityIndex(
    model: NativeEmbeddings(),
    metric: CosineSimilarity()
)

Then, add your text that you want to make searchable to the index:

await similarityIndex.addItem(
    id: "id1", 
    text: "Metal was released in June 2014.", 
    metadata: ["source": "example.pdf"]
)

Finally, query the index for the most similar items to a given query:

let results = await similarityIndex.search("When was metal released?")
print(results)

Which outputs a SearchResult array: [SearchResult(id: "id1", score: 0.86216, metadata: ["source": "example.pdf"])]

Examples

The Examples directory contains multple sample iOS and macOS applications that demonstrates how to use SimilaritySearchKit to it's fullest extent.

Example Description Requirements
BasicExample A basic multiplatform application that indexes and compares similarity of a small set of hardcoded strings. iOS 16.0+, macOS 13.0+
PDFExample A mac-catalyst application that enables semantic search on the contents of individual PDF files. iOS 16.0+
ChatWithFilesExample An advanced macOS application that indexes any/all text files on your computer. macOS 13.0+

Available Models

Model Use Case Size Source
NaturalLanguage Text similarity, faster inference Built-in Apple
MiniLMAll Text similarity, fastest inference 46 MB HuggingFace
Distilbert Q&A search, highest accuracy 86 MB (quantized) HuggingFace
MiniLMMultiQA Q&A search, fastest inference 46 MB HuggingFace

Models conform the the EmbeddingProtocol and can be used interchangeably with the SimilarityIndex class.

A small but growing list of pre-converted models can be found in this repo on HuggingFace. If you have a model that you would like to see added to the list, please open an issue or submit a pull request.

Available Metrics

Metric Description
DotProduct Measures the similarity between two vectors as the product of their magnitudes
CosineSimilarity Calculates similarity by measuring the cosine of the angle between two vectors
EuclideanDistance Computes the straight-line distance between two points in Euclidean space

Metrics conform to the DistanceMetricProtocol and can be used interchangeably with the SimilarityIndex class.

Bring Your Own

All the main parts of the SimilarityIndex can be overriden with custom implementations that conform to the following protocols:

EmbeddingsProtocol

Accepts a string and returns an array of floats representing the embedding of the input text.

func encode(sentence: String) async -> [Float]?

DistanceMetricProtocol

Accepts a query embedding vector and a list of embeddings vectors and returns a tuple of the distance metric score and index of the nearest neighbor.

func findNearest(for queryEmbedding: [Float], in neighborEmbeddings: [[Float]], resultsCount: Int) -> [(Float, Int)]

TextSplitterProtocol

Splits a string into chunks of a given size, with a given overlap. This is useful for splitting long documents into smaller chunks for embedding. It returns the list of chunks and an optional list of tokensIds for each chunk.

func split(text: String, chunkSize: Int, overlapSize: Int) -> ([String], [[String]]?)

TokenizerProtocol

Tokenizes and detokenizes text. Use this for custom models that use different tokenizers than are available in the current list.

func tokenize(text: String) -> [String]
func detokenize(tokens: [String]) -> String

VectorStoreProtocol

Save and load index items. The default implementation uses JSON files, but this can be overriden to use any storage mechanism.

func saveIndex(items: [IndexItem], to url: URL, as name: String) throws -> URL
func loadIndex(from url: URL) throws -> [IndexItem]
func listIndexes(at url: URL) -> [URL]

Acknowledgements

Many parts of this project were derived from the existing code, either already in swift, or translated into swift thanks to ChatGPT. These are some of the main projects that were referenced:

Motivation

This project has been inspired by the incredible advancements in natural language services and applications that have come about with the emergence of ChatGPT. While these services have unlocked a whole new world of powerful text-based applications, they often rely on cloud services. Specifically, many "Chat with Data" services necessitate users to upload their data to remote servers for processing and storage. Although this works for some, it might not be the best fit for those in low connectivity environments, or handling confidential or sensitive information. While Apple does have bundled library NaturalLanguage for similar tasks, the CoreML model conversion process opens up a much wider array of models and use cases. With this in mind, SimilaritySearchKit aims to provide a robust, on-device solution that enables developers to create state-of-the-art NLP applications within the Apple ecosystem.

Future Work

Here's a short list of some features that are planned for future releases:

  • In-memory indexing
  • Disk-backed indexing
    • For large datasets that don't fit in memory
  • All around performance improvements
  • Swift-DocC website
  • HSNW / Annoy indexing options
  • Querying filters
    • Only return results with specific metadata
  • Sparse/Dense hybrid search
    • Use sparse search to find candidate results, then rerank with dense search
    • More info here
  • More embedding models
  • Summarization models
    • Can be used to merge several query results into one, and clean up irrelevant text
  • Metal acceleration for distance calcs

I'm curious to see how people use this library and what other features would be useful, so please don't hesitate to reach out over twitter @ZachNagengast or email znagengast (at) gmail (dot) com.

similarity-search-kit's People

Contributors

bernhardeisvogel avatar brn avatar buhe avatar kilo-loco avatar lexiestleszek avatar michaeljelly avatar weswalla avatar wiktorwojcik112 avatar zachnagengast avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

similarity-search-kit's Issues

how to make more embeddings?

Hi! Thanks a lot for this project, super interesting.

How can i generate more ebmeddings? I assume that embeddings are based on a relative big text chunks and my question is: how can i improve the quality of the search by making more embeddings with a smaller size?

Custom embedding

Nice work!
I custom embedding, but "Cannot find type 'EmbeddingProtocol' in scope"

import Foundation
import SimilaritySearchKit

class OpenAI11: EmbeddingProtocol {
    
}
public class SimilaritySearchKit: VectorStore {
    let vs: SimilarityIndex
    let embeddings: Embeddings
    
    init(vs: SimilarityIndex, embeddings: Embeddings) {
        self.vs = vs
        self.embeddings = embeddings
    }
    
    override func similaritySearch(query: String, k: Int) async -> [MatchedModel] {
        []
    }
    
    override func addText(text: String, metadata: [String: String]) async {
        
    }
}

What i missed.

BR

buhe

how to split chunks without cutting the sentences?

How can I split the chunks based on the dots in sentences? I mean I dont want to have half of sentences in each chunk.

Sometimes it splits one sentence and first half of it becomes part of one chunk, and the second part becomes the part of another chunk. Basically i want to split the chunk to < N tokens and finish one chunk at the closest from the end.

Why doesn't adding overlapSize not increasing the total number of embeddings/chunks?

So I experiment with the chunk sizes and decided to use overlapSize. But what I encountered is that no matter whether I add it or not, it doesn't change anything in my search results nor in the total number of embeddings.

I use this code inside my ContentView (the one that was adviced in one of the previous issues on how to change chunksSize):
let splitter = RecursiveTokenSplitter(withTokenizer: BertTokenizer())
let (splitText, _) = splitter.split(text: documentText, chunkSize: 200, overlapSize: 100)
chunks = splitText

how to add new embedding models?

Hi!

I thought about adding couple more embedding models and wanted to do that via Exporters tool to convert them from HF to CoreML (Zach, I assume you did it via that tool also), but couldnt figure out how make the resulted CoreML model not in the size of default 128 tokens for both input and output (the Exporters library seems to be having this size as a default and every model that you put through Exporters results in having this size as input and output).

Any ideas how to make that happen?

Exception in ChatWithFilesExample

2023-11-08 15:02:34.693989+0200 ChatWithFilesExample[17397:4904410] [espresso] [Espresso::handle_ex_plan] exception=Espresso exception: "Invalid state": inner product kernel p.has_biases is true, but biases were not found! [Exception from Layer: 14: x_1_cast] 2023-11-08 15:02:34.694158+0200 ChatWithFilesExample[17397:4904410] [coreml] Error computing NN outputs -1 2023-11-08 15:02:34.694214+0200 ChatWithFilesExample[17397:4904410] [coreml] Failure in -executePlan:error:. Failed to generate query embedding for 'hello'.

macOS 10.3.1 Xcode Version 14.3.1 (14E300c)

Not usable on Intel macs

Overview

I couldn't use this on an Intel Mac. For some reason, it could not load custom models (miniqa or distilbert), the only model that works is NativeEmbeddings

Here is the logs:

2023-08-09 17:28:57.802134+0800 BasicExample [50044:803119] Metal API Validation Enabled
2023-08-09 17:28:59.012154+0800 BasicExample[50044:8031191 [espresso
[Espresso::handle ex plan] exception=Espresso exception: "Invalid state": inner product kernel p.has biases is true, but biases were not found! [Exception from Layer: 24:x_9_cast.
2023-08-09 17:28:59.012343+0800 BasicExample[50044:803119] [coremll Error computing NN outputs -1
2023-08-09 17:28:59.012400+0800 BasicExample[50044:803119] [coreml] Failure in
-executePlan:error:.
Failed to generate a test input vector

I'm totally clueless at this point. Can you give some pointer @ZachNagengast

PS: I'm using a remote mac to test. Happy to give you access for debugging.

EDIT: I did some more testing and looks like the CoreML model can only run on CPU or Neural Engine, not GPU.

Disk indexing coming?

Is disk storage for the indexes planned, or is it fast enough to use that the indexes can be built on every app launch?

[Bug] NativeContextualEmbeddings causes crash

I have attempted to switch from using NativeEmbeddings to NativeContextualEmbeddings and it is causing my iOS app to crash. It seems that the embeddings object is failing to initialize the model.

Error:

SimilaritySearchKit/NativeContextualEmbeddings.swift:43: Fatal error: 'try!' expression unexpectedly raised an error: Foundation._GenericObjCError.nilError

Usage:

await SimilarityIndex(
    model: NativeContextualEmbeddings(),
    metric: CosineSimilarity()
)

Device:
iPhone 13 Pro Max
iOS Version 17.2.1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.