Giter Site home page Giter Site logo

asg017 / sqlite-vss Goto Github PK

View Code? Open in Web Editor NEW
1.6K 22.0 59.0 2.63 MB

A SQLite extension for efficient vector search, based on Faiss!

License: MIT License

Shell 1.68% CMake 1.61% Python 24.88% C++ 43.49% Makefile 4.85% TypeScript 5.25% JavaScript 1.72% CSS 1.27% Ruby 1.02% Rust 2.70% Go 1.03% C 0.33% Elixir 10.16%
sqlite sqlite-extension vector-search faiss

sqlite-vss's Introduction

sqlite-vss

Warning

sqlite-vss is not in active development. Instead, my effort is now going towards sqlite-vec, which is a similar vector search SQLite extension, but should be much easier to install and use than sqlite-vss. See this blog post for more info.

sqlite-vss (SQLite Vector Similarity Search) is a SQLite extension that brings vector search capabilities to SQLite, based on Faiss. It can be used to build semantic search engines, recommendations, or questions-and-answering tools.

See Introducing sqlite-vss: A SQLite Extension for Vector Search (February 2023) for more details and a live example!

If your company or organization finds this library useful, consider supporting my work!

Usage

.load ./vector0
.load ./vss0

select vss_version(); -- 'v0.0.1'

sqlite-vss has a similar API to the fts5 Full-Text Search Extension. Use the vss0 module to create virtual tables that can efficiently store and query your vectors.

-- 384 == number of dimensions for this example
create virtual table vss_articles using vss0(
  headline_embedding(384),
  description_embedding(384),
);

sqlite-vss is a Bring-your-own-vectors database, it is compatable with any embedding or vector data you have. Consider using OpenAI's Embeddings API, HuggingFace's Inference API, sentence-transformers, or any of these open source model. In this example, we are using sentence-transformers/all-MiniLM-L6-v2 to generate embeddings from our text, which have 384 dimensions.

You can insert vectors into vss0 tables as JSON or raw bytes.

insert into vss_articles(rowid, headline_embedding)
  select rowid, headline_embedding from articles;

To query for similar vectors ("k nearest neighbors"), use the vss_search function in the WHERE clause. Here we are searching for the 100 nearest neighbors to the embedding in row #123 in the articles table.

select rowid, distance
from vss_articles
where vss_search(
  headline_embedding,
  (select headline_embedding from articles where rowid = 123)
)
limit 100;

You can INSERT and DELETE into these tables as necessary, but UPDATE operations aren't supported yet. This can be used with triggers for automatically updated indexes. Also note that "small" INSERT/DELETE operations that only insert a few rows can be slow, so batch where necessary.

begin;

delete from vss_articles
  where rowid between 100 and 200;

insert into vss_articles(rowid, headline_embedding, description_embedding)
  values (:rowid, :headline_embedding, :description_embedding)

commit;

You can pass in custom Faiss factory strings for specific columns to control how the Faiss index is stored and queried. By default the factory string is "Flat,IDMap2", which can be slow to query as your database grows. Here, we add an inverted file index with 4096 centroids, a non-exhaustive option that makes large database queries much faster.

create virtual table vss_ivf_articles using vss0(
  headline_embedding(384) factory="IVF4096,Flat,IDMap2",
  description_embedding(384) factory="IVF4096,Flat,IDMap2"
);

This IVF will require training! You can define training data with a INSERT command in a single transaction, with the special operation="training" constraint.

insert into vss_ivf_articles(operation, headline_embedding, description_embedding)
  select
    'training',
    headline_embedding,
    description_embedding
  from articles;

Beware! Indexes that require training can take a long time. With the News Category Dataset (386 dimension over 210k vectors) that this example is based on, the default index would take 8 seconds to build. But with the custom "IVF4096,Flat,IDMap2" factory, it took 45 minutes to train and 4.5 minutes to insert data! This likely can be reduced with a smaller training set, but the faster queries can be helpful.

Documentation

See docs.md for a instructions to compile sqlite-vss yourself, as well as a full SQL API reference.

Installing

The Releases page contains pre-built binaries for Linux x86_64 and MacOS x86_64 (MacOS Big Sur 11 or higher). More pre-compiled targets will be available in the future. Additionally, sqlite-vss is distributed on common package managers like pip for Python and npm for Node.js, see below for details.

Do note that on Linux machines, you'll have to install some packages to make these options work:

sudo apt-get update
sudo apt-get install -y libgomp1 libatlas-base-dev liblapack-dev

Note: The 0 in the filename (vss0.dylib/ vss0.so) denotes the major version of sqlite-vss. Currently sqlite-vss is pre v1, so expect breaking changes in future versions.

Language Install More Info
Python pip install sqlite-vss sqlite-vss with Python PyPI
Datasette datasette install datasette-sqlite-vss sqlite-vss with Datasette Datasette
Node.js npm install sqlite-vss sqlite-vss with Node.js npm
Deno deno.land/x/sqlite_vss sqlite-vss with Deno deno version
Ruby gem install sqlite-vss sqlite-vss with Ruby Gem
Elixir hex.pm/packages/sqlite_vss sqlite-vss with Elixir Hex.pm
Go go get -u github.com/asg017/sqlite-vss/bindings/go sqlite-vss with Go Go Reference
Rust cargo add sqlite-vss sqlite-vss with Rust Crates.io
Github Release GitHub tag (latest SemVer pre-release)

With the sqlite3 CLI

For using sqlite-vss with the official SQLite command line shell, download the vector0.dylib/vss0.dylib (for MacOS Big Sur 11 or higher) or vector0.so/vss0.so (Linux) files from a release and load it into your SQLite environment.

The vector0 extension is a required dependency, so make sure to load that before vss0.

.load ./vector0
.load ./vss0
select vss_version();
-- v0.0.1

Python

For Python developers, install the sqlite-vss package with:

pip install sqlite-vss
import sqlite3
import sqlite_vss

db = sqlite3.connect(':memory:')
db.enable_load_extension(True)
sqlite_vss.load(db)

version, = db.execute('select vss_version()').fetchone()
print(version)

See bindings/python for more details.

Node.js

For Node.js developers, install the sqlite-vss npm package with:

npm install sqlite-vss
import Database from "better-sqlite3"; // also compatible with node-sqlite3
import * as sqlite_vss from "sqlite-vss";

const db = new Database(":memory:");
sqlite_vss.load(db);

const version = db.prepare("select vss_version()").pluck().get();
console.log(version);

See npm/sqlite-vss/README.md for more details.

Deno

For Deno developers, use the deno.land/x/sqlite_vss module:

// Requires all permissions (-A) and the --unstable flag

import { Database } from "https://deno.land/x/[email protected]/mod.ts";
import * as sqlite_vss from "https://deno.land/x/sqlite_vss/mod.ts";

const db = new Database(":memory:");

db.enableLoadExtension = true;
sqlite_vss.load(db);

const [version] = db.prepare("select vss_version()").value<[string]>()!;

console.log(version);

See deno/sqlite-vss/README.md for more details.

Datasette

And for Datasette, install the datasette-sqlite-vss plugin with:

datasette install datasette-sqlite-vss

See bindings/datasette for more details.

Disadvantages

  • The underlying Faiss indicies are capped at 1GB. Follow #1 for updates.
  • Additional filtering on top of KNN searches aren't supported yet. Follow #2 for updates.
  • Only CPU Faiss indicies are supported, not GPU yet. Follow #3 for updates.
  • mmap'ed indices aren't supported yet, so indicies have to fit in RAM. Follow #4 for updates.
  • This extension is written in C++ and doesn't have fuzzy testing yet. Follow #5 for updates.
  • UPDATE statements on vss0 virtual tables are not supported, though INSERT and DELETE statements are. Follow #7 for updates.

Supporting

I (Alex πŸ‘‹πŸΌ) spent a lot of time and energy on this project and many other open source projects. If your company or organization uses this library (or you're feeling generous), then please consider sponsoring my work, sharing this project with a friend, or hiring me for contract/consulting work!

See Also

sqlite-vss's People

Contributors

asg017 avatar dleviminzi avatar frisoft avatar gworkman avatar jeregrine avatar learning4life avatar linonetwo avatar mattn avatar mattvr avatar miserlou avatar penberg avatar polterguy avatar tgfjt avatar trodrigu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sqlite-vss's Issues

Tracking: Memory Issues

sqlite-vss.cpp and sqlite-vector.cpp have many memory leaks that need to be cleaned up. There are a few reasons why:

  • SQLite's extension API handles memory and lifetime of certain items in an odd way
  • Faiss handles memory of indicies/query vectors in an odd way
  • sqlite-vector and sqlite-vss need to share a few functions/pointers that make it difficult to debug
  • I'm not a C++ developer so I yolo'ed a bunch

This issue will track different memory problems to solve and how to test against them.

  • Tests for memory leaks. I've done this before manually with Valgrind which is painful, but if anyone has suggestions to automated memory leak tests for C++ applications I'd love to hear about it!
  • code quality:
    • unique_pointer
    • use stack where possible
    • [ ]
  • Ensure vss_search queries don't leak memory.
  • Ensure all other vss0 queries don't leak
  • Ensure INSERT and DELETE operations by themselves don't leak
  • Ensure COMMIT operations don't leak
  • Cleanup and document the lifetime of different objects in the sqlite-vector <-> sqlite-vss APIs

Support push-down filters on vss indices

tl;dr "The Missing WHERE Clause in Vector Search" describes this exact problem

Problem

Say your database looks like this:

create table articles(
  headline text, 
  headline_embedding blob,
  newsroom text,
  published_at date
);

create virtual table vss_articles using vss(
  headline_embedding(384)
);

Most queries go like "show me the 10 closest articles that talk about 'global warming'", which is currently supported and easy:

select rowid
from vss_articles
where vss_search(headline_embedding, embedding_api('global warming'))
limit 10;

Now you have the rowids of the 10 closest articles that talk about "global warming", and you're good to go!

But now consider you want the same 10 closest articles, but with extra filters. Say "show me the 10 closest articles that talk about 'global warming', that published in 2016-2019 from the NY times".

.... well crap. You have a few options, but none will 100% get the correct answer. You could cast a wide net by getting the nearest 1,000 matches then filtering those:

w

Solution: indexed non-vector columns in vss0

create virtual table vss_articles using vss0(
  headline_embeddings(384),
  published_at date,
  newsroom text
);

insert into vss_articles
  select headline_embedding, published_at, newsroom
  from articles;
select rowid
from vss_articles
where vss_search(headline_embeddings, embedding_api('global warming'))
  and published_at between '2016-01-01' and '2019-12-31'
  and newsroom = 'NY Times'
limit 20;

Implementation

We could support non-vector columns in vss0, which get added to the vss_xyz_data shadow table. Then, we can create indices on those columns on the shadow tables like so:

create index idx_vss0_vss_articles_data_published_at 
  on vss_articles_data(published_at);
create index idx_vss0_vss_articles_data_newsroom 
  on vss_articles_data(newsroom);

Then, in xFilter, we determine "candidate vectors" by querying those new data columns using the new indicies. We'll get all the candidate rowids and pass it into SearchParameters->IDSelector, with maybe some perf addtions (bitmap, bloom filter, etc.).

Bindings for Elixir

I'd like to bundle sqlite-vss as an Elixir package and distribute it on hex. Ideally Elixir developers should be able to run:

mix deps.get sqlite-vss

Existing Language Binding notes

The Python/Node.js bindings (and the future Ruby binding I'm working on) all do similar things:

  1. Bundle pre-compiled loadable extension files (.dylib/.so/.dll) inside the language package (the .wheel file for python, the bundled .tar.gz package for Node.js, and the .gem file for Ruby
  2. Distribute different version of the package for different platforms. For Python this means different wheels per platform, in Node different optionalDependencies, and in Ruby it's building different .gem files with gem bundle --platform x86_64-darwin etc
  3. The language code itself is typically just a single loadable_path function that returns the path to the compiled extension. Also maybe a load() function that wraps a sqlite3 db connection that just calls db.load_extension(loadable_path()) like the python api

Open questions

  1. How do we publish different platforms for the same package? Can we publish a single sqlite-vss package that has different "wheels" for all our platforms, or do we need to publish separate sqlite-vss-macos-x86_64/sqlite-vss-linux-x86_64 packages for every platform?
  2. Can we just stuff the loadable extensions in def project? This seems to work for me:
  def project do
    [
      app: :hello_world,
      version: "0.1.0",
      elixir: "~> 1.12",
      start_permanent: Mix.env() == :prod,
      deps: deps(),
      files: ["priv/*.dylib"]
    ]
  end

Specifically the files: ["priv/*.dylib"] part. How can we make it work for multiple platforms?

Runtime error: vss_search() only support vss_search_params() as a 2nd parameter for SQLite versions below 3.41.0

Hey! Thanks for doing this project, it is super cool!

Runtime error: vss_search() only support vss_search_params() as a 2nd parameter for SQLite versions below 3.41.0

I hit this issue doing a simple example like the documentation.
image

Any guidance on getting around it. Other than having to downgrade sqlite, or is that what is recommended? I saw that the issue on the sqlite site had seemed to be addressed. Maybe I should pull the sqlite-vss code and try to build locally? Thanks!

pVtabCursor->pVtab->zErrMsg = sqlite3_mprintf("vss_search() only support vss_search_params() as a 2nd parameter for SQLite versions below 3.41.0");

https://sqlite.org/forum/info/6b32f818ba1d97ef

Sharing faiss::Index across multiple connections

The way I've understood the lib, it's reading the same faiss::Index from the database's shadow table every single time a connection is created. This of course has dramatic repercussions for multi user environments (the web?), due to the size of the index.

Would it be a good idea to have some sort of shared faiss::Index between multiple connections, such that the index is only read once and shared for multiple connections, and stays in memory? This would require some sort of thread synchronisation objects of course, particularly on updates/writes - But I suspect it would bring down the memory footprint, and increase the general usability of the library by a lot.

Suggestions ...?

I can do it if you like the idea ...

Support memory-mapped on-disk Indices

The underlying Faiss indicis are stored in SQLite shadow tables, which can't be mmaped with the IO_FLAG_MMAP.

One solution: Introduce a new option to store a vss0 column index on disk, allowing mmaped indices for larger-than-memory.

create virtual table articles using vss0(
  headline_embedding(1024) factory="..." on_disk=True,
  description_embedding(1024) factory="..." on_disk=True,
);

Then, your directory would look like:

$ tree .
.
β”œβ”€β”€ my_data.db
β”œβ”€β”€ my_data.db.vss0.articles.description_embedding.faissindex
└── my_data.db.vss0.articles.headline_embedding.faissindex

sqlite3_db_filename() would be useful here.

One problem: It's kindof nice to have all Faiss indices stored on one file in the SQLite database, and this config option would instead mean users would have to move around multiple files around instead of a single SQLite file. But since this is an "optimization" feature that's not enabled by default, I think it'll be ok.

Issues using rust crate on M2 Max

Not sure if anyone else on apple silicon is facing issues, but for the rust crate I had to manually add RUSTFLAGS="-L/opt/homebrew/opt/libomp/lib -L/opt/homebrew/opt/llvm/lib -lblas -llapack" in order for the linker to work properly.

Curious if this is a result of my local setup or an issue that others on silicon were facing for the rust crate

Tracking: Rust Bindings

Do you have any issues using the new Rust bindings for sqlite-vss? Comment on this issue with any bugs or crashes you come across, or with suggestions on how to make it better.

TODOs for Rust bindings:

  • docs.rs work
  • site documentation
  • rust-bert example?
  • Add rust test to test.yaml workflow?

Different vector storage for write-heavy workflows

With the current "faiss in a shadow table" storage strategy, there's one drawback: every time data in a vss0 table is inserted or updated and a COMMIT happens, the entire Faiss index has to be written to the DB. So if your index as 1 million vectors and you insert 1 more, then 1,000,001 vectors will have to be written to the DB.

This is a limitation of Faiss's IO tools: I believe only entire writes are supported (unless you mmap which isn't possible with in-db storage).

Proposal: A non-faiss index storage option

We should created our own be-spoke index storage option that allows for incremental, streaming writes without re-writing the entire index. This would be outside of Faiss, but can still use some of Faiss's C++ API for querying.

For example, we could have:

create virtual table vss_demo using vss0(
  index_storage="incremental",
  chunk_embeddings(768)
);

insert into demo(rowid, chunk_embeddings) 
  select rowid, chunk_embeddings from demo;


-- inserting 1 row should be fast and not re-write the entire index
insert into demo(rowid, chunk_embeddings)
  values (1234, :embedding);

The index_storage="incremental" option is the key: that signals that instead of the default faiss-shadow-table index storage strategy, the vss_demo table should instead store vectors in this new "incremental" storage format that's not Faiss-based.

We can still use Faiss for actual KNN and range searches, through knn_L2sqr_by_idx () and range_search_L2sqr(). But we'll need to design our own SQLite-based vector storage solution.

Design notes

This is a WIP, will likely change.

create table vss_demo_config(
  chunk_size integer
);
create table vss_demo_streaming_chunks(
  -- blob of size `chunk_size * sizeof(int64)`, storing the rowid values of the corresponding vectors in `vectors`
  rowids blob,
  -- bitmap of size `chunk_size / 8` that tracks if the i-th vector is valid/not-deleted
  valid blob,
  -- blob of size `chunk_size * sizeof(float), containing the raw float vectors
  vectors blob
);

At creation time, an all-zero chunk of size chunk_size is inserted into vss_demo_streaming_chunks. At insert time, each query and their rowid is stored in the rowids and vectors column using SQLite's BLOB I/O API, and the i-th bit in the valid bitmap is updated. At query time, the vectors and rowids columns are read into memory, the range_search_L2sqr() function is called to find the K-nearest vectors in that chunk, and the chunks are de-allocated and the next chunk is worked on. In the end, we re-sort the the results and get the true K-nearest vectors.

Possibly could be a new operation="optimize" option that'll re-arrange these chunks for delete-heavy workflows, to save space.

insert into vss_demo(operation) values ('optimize');

Disadvantages

  • This index will likely be slower to query, since each query will have to read from the disk. However, it'll be reading from the already-opened SQLite database and not external files, and small BLOBs in SQLite are extremely fast, so this may not be noticeable in most applications.
  • This probably can only support flat indexes, ie it'll be an exhaustive search. Possibly could incorporate something like HNSW or IVF, but that'll be complex and further in the future.

Load failure on MacOS Mojave

Hi, this extension looks awesome. Just a heads up that loading it on my Mac running 10.14.6 gives the following error.

sqlite> .load ./vector0
Error: dlopen(./vector0.dylib, 10): no suitable image found.  Did find:
	./vector0.dylib: cannot load 'vector0.dylib' (load command 0x80000034 is unknown)

From googling the error, I suspect this is a MacOS version issue.

Handle larger Indices (current limit 1GB)

The vector indices that support the vss0 virtual table are limited to 1GB. This is because they are stored as a BLOB in a single row in a shadow table, which has a limit of ~1GB.

Instead, we should store large FAISS indices across several rows, so they can (in theory) grow with infinite space. This will likely be complicated and require the SQLite BLOB I/O API and a custom faiss IOWriter.

Segfault when inserting embeddings into vss_table

Summary:
When inserting data into the virtual table "vss_chunks" using SQLite, a segmentation fault occurs in the C++ code of the "vssIndexUpdate" function.

Steps to reproduce:

  1. Create a virtual table "vss_chunks" using the vss0 module with the following SQL statement:
  2. Insert data into the "chunks" table with the following Python code:
def save_resource(resource: Resource):
    conn = create_connection()
    cursor = conn.cursor()
    cursor.execute('''
        INSERT INTO resources (url, title)
        VALUES (?, ?)
    ''', (resource.url, resource.title))

    resource_id = cursor.lastrowid
    for chunk, embedding in zip(resource.chunks, resource.embeddings):
        embedding_bytes = embedding.tobytes()
        cursor.execute('''
            INSERT INTO chunks (chunk, embedding, resource_id)
            VALUES (?, ?, ?)
        ''', (chunk, embedding_bytes, resource_id))

    conn.commit()
    conn.close()

r = salmon.create_article('https://news.ycombinator.com/item?id=34218724')
save_resource(r)
  1. Insert data from the "chunks" table into the "vss_chunks" virtual table with the following SQL statement:
connection = create_connection()
cursor = connection.cursor()
cursor.execute('''
            INSERT INTO vss_chunks (rowid, chunk_embedding)
                SELECT rowid, embedding
                FROM chunks
        ''')
connection.commit()
connection.close()

Logs from IntelliJ when attempting the same query using its SQL console:

# Problematic frame:
# C  [vss0.so+0xb7eb9]  vssIndexUpdate(sqlite3_vtab*, int, sqlite3_value**, long long*)+0x459

Environment:

Operating system: Ubuntu 22.04.2 LTS
Python version: 3.10.6
SQLite version: 3.40.0
sqlite_vss version: 0.0.4 (installed with pip)
sentence_transformers: multi-qa-mpnet-base-cos-v1 model when generates 768 dimensional embeddings

Schema:

def init_db():
    conn = create_connection()
    cursor = conn.cursor()
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS resources (
            id INTEGER PRIMARY KEY,
            url TEXT,
            title TEXT
        )
    ''')

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS chunks (
            id INTEGER PRIMARY KEY,
            chunk TEXT,
            embedding BLOB,
            resource_id INTEGER,
            FOREIGN KEY (resource_id) REFERENCES resources (id)
        )
    ''')

    cursor.execute('''
        CREATE VIRTUAL TABLE vss_chunks USING vss0(
        chunk_embedding(768)
    );
    ''')
    conn.commit()
    conn.close()

How to install on windows?

I used pip install sqlite vss and it gave me an error, I searched to install it manually in site-packages and I only found it for mac and linux

Bug in sqlite3_vector_init / sqlite3_vss_init

sqlite3_vss_init requires the vector plugin to have been initialised before initialising the VSS plugin. The following statement fails in fact.

        // TODO: The && false parts needs to be removed, however if it is removed, the invocation returns
        // an error saying the function cannot be modified due to open statements. Not sure what this is?
        if (rc != SQLITE_OK && false) {

            *pzErrMsg = sqlite3_mprintf("%s: %s", "vector0", sqlite3_errmsg(db));
            return rc;
        }

... which is why I added the && false parts. I suspect there's something wrong in the order of initialising things, and that it might have some consequences that are "bad" in the long run. I'd love to fix this myself, but don't know exactly how ... :/

The vector0_api_from_db method seems to be correctly finalising the statement, so there should not be any "open statements" - However, if you remove the && false parts, you will see it complains about "not being able to modify/delete functions due to 'open statements'".

Might there be some statement that is not correctly disposed in there somewhere ...? :/

Support `rowid in (...)` constraints in `vss_search()` KNN queries

In KNN style searches, we should support rowid in (...) constraints in queries like so:

select rowid, distance
from vss_articles
where vss_search(description_embeddings, :query_vector)
  and rowid in (1, 2, 3, ..., 100)
limit 25

Currently we ignore the "equals" constraint on rowid, but if we were to capture that constraint (and enable sqlite3_vtab_in), we could read in all the rowids and use IDSelector to pre-filter results.

This would be especially great when paired with subqueries:

with subset as (
  select rowid
  from articles
  where published_at between '2022-01-01' and '2023-01-01'
    and newsroom = 'NY Times'
)
select rowid, distance
from vss_articles
where vss_search(description_embeddings, :query_vector)
  and rowid in (select rowid in subset)
limit 25

This would enable "pre-filtering" according to this post. This would be an easy-to-implement but probably-slow solution to push-down filters described in #2.

Vector transformation functionality (to reduce size of index)

I've looked a little bit on Faiss, and I realise one of its really cool features is the ability to "pack" vectors, reducing the size of the index drastically. For instance, OpenAI returns 1,536 dimensions. For a lot of domains, being able to pack these down to 96D would probably create more than enough accuracy. Faiss supports this, and also SQLite-VSS's storage capabilities supports this, through its "index factory" constructs.

However, once queried later down the road, there is no mechanism to take the resulting query vector and reduce down to the correct size required to query the index. I suspect something needs to be applied inside of vssIndexFilter, or one of the init functions, that can somehow for instance take a 1,536 dimensional vector, and return it down to for instance a 96D vector, before querying the index database.

I would love to see this feature in SQLite-VSS, and I'm ready to build it myself too, however I'd need some guidance on exactly how to do it, since I suspect you know more about both SQLite and Faiss than me ...

Potential memory leaks

First I want to thank you for an incredible library. It seems to be consistently returning results in the milliseconds timeframe, where my previous logic had to apply a table scan, manually apply the dot product in C#, resulting in 3 to 5 orders of magnitudes faster results. Najs work! :)

Then my question which is about memory leaks. I run this in a Kubernetes environment, and I've noticed it keeps on eating memory, especially when you re-index, and/or query. I know C++ quite well, even though I haven't touched it for more than a decade, and I can see the following code. My comments are prefixed with QUESTION for brevity;

std::vector<float> * vectorFromBlobValue(sqlite3_value*value, const char ** pzErrMsg) {
  int n = sqlite3_value_bytes(value);
  const void * b;
  char header;
  char type;

  if(n < (2)) {
    *pzErrMsg = "Vector blob size less than header length";
    return NULL;
  }
  b = sqlite3_value_blob(value);
  memcpy(&header, ((char *) b + 0), sizeof(char));
  memcpy(&type,   ((char *) b + 1), sizeof(char));

  if(header != VECTOR_BLOB_HEADER_BYTE) {
    *pzErrMsg = "Blob not well-formatted vector blob";
    return NULL;
  }
  if(type != VECTOR_BLOB_HEADER_TYPE) {
    *pzErrMsg = "Blob type not right";
    return NULL;
  }
  int numElements = (n - 2)/sizeof(float);
  float * v = (float *) ((char *)b + 2);


  // QUESTION; Creates new std vector as heap memory and returns to method below
  return new std::vector<float>(v, v+numElements);
}

static std::vector<float>* valueAsVector(sqlite3_value*value) {
  // Option 1: If the value is a "vectorf32v0" pointer, create vector from that
  VectorFloat* v = (VectorFloat*) sqlite3_value_pointer(value, VECTOR_FLOAT_POINTER_NAME);


  // QUESTION; Creates new std vector as heap memory and returns to method below
  if (v!=NULL) return new std::vector<float>(v->data, v->data + v->size);
  std::vector<float> * vec;

  // Option 2: value is a blob in vector format
  if(sqlite3_value_type(value) == SQLITE_BLOB) {
    const char * pzErrMsg = 0;
    if((vec = vectorFromBlobValue(value, &pzErrMsg)) != NULL) {
      return vec;
    }
    if((vec = vectorFromRawBlobValue(value, &pzErrMsg)) != NULL) {
      return vec;
    }
  }
  // Option 3: if value is a JSON array coercible to float vector, use that
  //if(sqlite3_value_subtype(value) == JSON_SUBTYPE) {
  if(sqlite3_value_type(value) == SQLITE_TEXT) {
    if((vec = vectorFromTextValue(value)) != NULL) {
      return vec;
    }else {
      return NULL;
    }
  }

  // else, value isn't a vector
  return NULL;
}

static void vector_value_at(sqlite3_context *context, int argc, sqlite3_value **argv) {

  // QUESTION: Invokes above method returning heaped memory
  std::vector<float>*v = valueAsVector(argv[0]);
  if(v == NULL) return;
  int at = sqlite3_value_int(argv[1]);
  try {
    float result = v->at(at);
    sqlite3_result_double(context, result);
  }
   catch (const std::out_of_range& oor) {
    char * errmsg = sqlite3_mprintf("%d out of range: %s", at, oor.what());
    if(errmsg != NULL){
      sqlite3_result_error(context, errmsg, -1);
      sqlite3_free(errmsg);
    }
    else sqlite3_result_error_nomem(context);
  }

  // QUESTION: v pointer is never deleted?
}

This is only one example of where I suspect the library might be losing memory. There are other examples, but I am not acquainted enough with SQLite plugins to be sure, and there might be some hidden internals in STLite freeing up the memory that I can't see. However, I see your library is using massive amounts of memory over time, and it seems to be increasing.

Suggestion

Use auto_ptr to move heaped memory around to avoid lost memory?

Tracking: Elixir Bindings

Do you have any issues using the new Elixir bindings for sqlite-vss? Comment on this issue with any bugs or crashes you come across, or with suggestions on how to make it better.

TODOs for Elixir bindings:

  • elxir docs work
  • site documentation
  • dedicated ecto example under examples/?
  • Add elixir test to test.yaml workflow?

Compile failure on M1 Mac: Missing OpenMP

Trying to compile from source on Mac OS X Ventura (M1), getting this error:

❯ make loadable
cmake -B build; make -C build
CMake Error at /opt/homebrew/Cellar/cmake/3.26.4/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES)
Call Stack (most recent call first):
  /opt/homebrew/Cellar/cmake/3.26.4/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
  /opt/homebrew/Cellar/cmake/3.26.4/share/cmake/Modules/FindOpenMP.cmake:577 (find_package_handle_standard_args)
  vendor/faiss/faiss/CMakeLists.txt:253 (find_package)


-- Configuring incomplete, errors occurred!
make[1]: *** No targets specified and no makefile found.  Stop.
make: *** [dist/debug/vector0.dylib] Error 2

Any ideas? I've tried installing libomp using homebrew but that didn't seem to help.

doc: use `INTEGER PRIMARY KEY AUTOINCREMENT` instead of default `rowid`

Using the default rowid will cause errors if rows are deleted and a VACUUM is performed. The rowid for a row can be changed, but won't change inside the faiss index. This will cause hidden errors, not great

So documentation that you should use an explicit INTEGER PRIMARY KEY AUTOINCREMENT column for data table primary keys.

create table items(
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  contents text
);

create virtual table vss_items using vss0(
  contents_embedding(364)
);

insert into vss_items(rowid, contents_embedding)
  select id, embedding(contents) from items;

distribution: Rust bindings

two options to statically include sqlite-vss into a rust app:

  • Automatically download the correct pre-compiled static files at build-time, if the download-sqlite-vss feature is enabled
  • Make authors define a LIBSQLITE_VSS variables that points to a build

Inspiro:

Support GPU indicies

Would need to distribute two version of sqlite-vss: A normal one with CPU support only, and another one with -DFAISS_ENABLE_GPU=ON.

Also might need to add an option to vss0 column declarations to enable GPU:

create virtual table photos using vss0(
  thumbnail_embedding(1024) factory="..." gpu=True,
);

optionalDependency not installed by pnpm

Seems pnpm i sqlite-vss won't install sqlite-vss-darwin-x64 on my MacBook Pro 16-inch, 2019

% pnpm ls sqlite-vss-darwin-x64                       
% pnpm ls sqlite-vss           
Legend: production dependency, optional only, dev only

dependencies:
sqlite-vss 0.1.1-alpha.8

ζˆͺ屏2023-05-29 00 00 54

Can it be used in c++?

Hello, I am trying to use this library from a c++ code. Could you please help to do so?

Include Mac M1 ARM pre-compiled builds

Currently, Github Actions only has Mac runners for x86_64, and we don't have pre-built loadable extensions for Mac M1 ARM platforms. This effects all distribution platforms, including the extensions on Github Releases, npm, pip, and Deno.

It should be possible to compile an M1 from a x86_64 runner, which I've done for extensions written in Rust and C. But C++ and CMake makes this more challenging, and I haven't figured it out yet. Plus, cross-compiling comes with a heavy performance regression, in my experience.

The easiest solution will be when Github Actions supports Mac M1 runners, but that's not until Q4 2023. I could also just buy the cheapest M1 mac mini myself and register it as a self-hosted runner, but I don't have the cash for that ( if you're willing to sponsor this, let me know!). Could also use a service like AWS's Mac M1 ec2 instances or macstadium, but those are expensive and seem complicated to setup with Github Actions.

So, the most likely options are:

  1. Figure out cross-compilation for C++/cmake/faiss on a Mac Github Action runner
  2. Get a sponsor for a physical low-tier Mac M1 mini, and run as a self-hosted Github Action runner
  3. Wait until the end of the year when Github supports Mac M1 runners natively

Make `faiss_avx2` optional

Not all computers support avx2, so we should expose some compile-time option to disable it, like -DSQLITE_VSS_DISABLE_AVX2. Users have to manually edit sqlite-vss/CMakeLists.txt, which isn't great.

doc: The data table <-> vector index table pattern

create table xyz(a, b, c);

create virtual table vss_xyz using vss0(
  a_embedding(1024), 
  b_embedding(1024),
  c_embedding(1024)
);

-- optional: doc why this could be nice
create table xyz_embeddings(
  a_embedding vector(1024), -- doc why vector(1024) or other options
  b_embedding vector(1024),
  c_embedding vector(1024),
);

Supporting the project

Hello,

This is a great idea and clearly a labor of love.

We are using this extension in our own open source project here: https://github.com/oneirocom/magickml

We were using docker and postrgres before, but people really struggled with that. A local sqlite instance is safe and secure for our creators who just want to get going. sqlite-vss is a perfect fit for our use case.

Currently no support on m1 or windows -- how do we help with that? Do you have a way we can sponsor this project? OpenCollective perhaps? I have a Macbook, can compile!

Tracking: Go Bindings

Do you have any issues using the new Go bindings for sqlite-vss? Comment on this issue with any bugs or crashes you come across, or with suggestions on how to make it better.

TODOs for the Go bindings:

  • godoc work
  • site documentation
  • cybertron example
  • Add Go test to test.yaml workflow?

Cleaning up `sqlite-vector`

  • Deprecate:
    • vector_from_blob
    • vector_to_blob
    • VECTOR_BLOB_HEADER_BYTE and alike
  • fix/re-think:
    • xValueAsVector - maybe output a data pointer + size and not a std::vector, for better ABI?
    • xResultVector - take in a data pointer and size?
    • Maybe xResultVector for json/bytes?
    • sqlite3_create_function_v2 on "vector0" weirdness

`UPSERT`-like statements fail on reconstruction

Save this in bug.sql:

.bail on
.mode box
.header on

.load ./vector0
.load ./vss0

create virtual table demo using vss0(a(2));
insert into demo(rowid, a)
  values (1, '[1.0, 2.0]'), (2, '[2.0, 3.0]');

select rowid, vector_debug(a) from demo; -- succeeds

delete from demo where rowid = 1;
insert into demo(rowid, a) select 1, '[99.0, 99.0]';

delete from demo where rowid = 2;
insert into demo(rowid, a) select 2, '[299.0, 299.0]';

select rowid, vector_debug(a) from demo;

And run it:

sqlite3 :memory: '.read bug.sql'
β”Œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ rowid β”‚       vector_debug(a)        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1     β”‚ size: 2 [1.000000, 2.000000] β”‚
β”‚ 2     β”‚ size: 2 [2.000000, 3.000000] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Runtime error near line 21: Error reconstructing vector - Does the column factory string end with IDMap2? Full error: Error in virtual void faiss::IndexIDMap2Template<faiss::Index>::reconstruct(idx_t, typename IndexT::component_t *) const [IndexT = faiss::Index] at /Users/runner/work/sqlite-vss/sqlite-vss/vendor/faiss/faiss/IndexIDMap.cpp:236: key 1 not found

Weirdly, if you only "upsert" one row it seems to work just fine...

Using this library on iOS

iOS doesn't allow for dynamic libraries and things need to be static linked. Has anyone done the work to get this working in that environment?

Can't compile on Mac (x86 CPU)

Following your recipe to the last detail, and nothing seems to work. It give me ...

This file contains any messages produced by compilers while
running configure, to aid debugging if configure makes a mistake.

It was created by sqlite configure 3.40.1, which was
generated by GNU Autoconf 2.71.  Invocation command line was

  $ ./configure

## --------- ##
## Platform. ##
## --------- ##

hostname = Thomass-MacBook-Air.local
uname -m = x86_64
uname -r = 20.6.0
uname -s = Darwin
uname -v = Darwin Kernel Version 20.6.0: Thu Mar  9 20:39:26 PST 2023; root:xnu-7195.141.49.700.6~1/RELEASE_X86_64

/usr/bin/uname -p = i386
/bin/uname -X     = unknown

/bin/arch              = unknown
/usr/bin/arch -k       = unknown
/usr/convex/getsysinfo = unknown
/usr/bin/hostinfo      = Mach kernel version:
	 Darwin Kernel Version 20.6.0: Thu Mar  9 20:39:26 PST 2023; root:xnu-7195.141.49.700.6~1/RELEASE_X86_64
Kernel configured for up to 8 processors.
4 processors are physically available.
8 processors are logically available.
Processor type: x86_64h (Intel x86-64h Haswell)
Processors active: 0 1 2 3 4 5 6 7
Primary memory available: 8.00 gigabytes
Default processor set: 362 tasks, 2086 threads, 8 processors
Load average: 4.23, Mach factor: 4.26
/bin/machine           = unknown
/usr/bin/oslevel       = unknown
/bin/universe          = unknown

PATH: /Users/thomashansen/.gem/ruby/2.6.0/bin/
PATH: /usr/local/bin/
PATH: /usr/bin/
PATH: /bin/
PATH: /usr/sbin/
PATH: /sbin/
PATH: /usr/local/go/bin/
PATH: /usr/local/MacGPG2/bin/
PATH: /usr/local/share/dotnet/
PATH: ~/.dotnet/tools/
PATH: /Users/thomashansen/.gem/ruby/2.6.0/bin/
PATH: /Users/thomashansen/.rvm/bin/
PATH: /Users/thomashansen/.rvm/bin/


## ----------- ##
## Core tests. ##
## ----------- ##

configure:2438: looking for aux files: config.guess config.sub ltmain.sh compile missing install-sh
configure:2451:  trying ././
configure:2480:   ././config.guess found
configure:2480:   ././config.sub found
configure:2480:   ././ltmain.sh found
configure:2480:   ././compile found
configure:2480:   ././missing found
configure:2462:   ././install-sh found
configure:2613: checking for a BSD-compatible install
configure:2686: result: /usr/local/bin/ginstall -c
configure:2697: checking whether build environment is sane
configure:2752: result: yes
configure:2911: checking for a race-free mkdir -p
configure:2955: result: /usr/local/bin/gmkdir -p
configure:2962: checking for gawk
configure:2997: result: no
configure:2962: checking for mawk
configure:2997: result: no
configure:2962: checking for nawk
configure:2997: result: no
configure:2962: checking for awk
configure:2983: found /usr/bin/awk
configure:2994: result: awk
configure:3005: checking whether make sets $(MAKE)
configure:3028: result: yes
configure:3058: checking whether make supports nested variables
configure:3076: result: yes
configure:3226: checking whether make supports the include directive
configure:3241: make -f confmf.GNU && cat confinc.out
this is the am__doit target
configure:3244: $? = 0
configure:3263: result: yes (GNU style)
configure:3339: checking for gcc
configure:3371: result: /usr/local/opt/llvm/bin/clang
configure:3724: checking for C compiler version
configure:3733: /usr/local/opt/llvm/bin/clang --version >&5
Homebrew clang version 16.0.5
Target: x86_64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /usr/local/opt/llvm/bin
configure:3744: $? = 0
configure:3733: /usr/local/opt/llvm/bin/clang -v >&5
Homebrew clang version 16.0.5
Target: x86_64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /usr/local/opt/llvm/bin
configure:3744: $? = 0
configure:3733: /usr/local/opt/llvm/bin/clang -V >&5
clang-16: error: argument to '-V' is missing (expected 1 value)
clang-16: error: no input files
configure:3744: $? = 1
configure:3733: /usr/local/opt/llvm/bin/clang -qversion >&5
clang-16: error: unknown argument '-qversion'; did you mean '--version'?
clang-16: error: no input files
configure:3744: $? = 1
configure:3733: /usr/local/opt/llvm/bin/clang -version >&5
clang-16: error: unknown argument '-version'; did you mean '--version'?
clang-16: error: no input files
configure:3744: $? = 1
configure:3764: checking whether the C compiler works
configure:3786: /usr/local/opt/llvm/bin/clang  -I/usr/local/opt/llvm/include -L/usr/local/opt/llvm/lib conftest.c  >&5
ld: library not found for -lSystem
clang-16: error: linker command failed with exit code 1 (use -v to see invocation)
configure:3790: $? = 1
configure:3830: result: no
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "sqlite"
| #define PACKAGE_TARNAME "sqlite"
| #define PACKAGE_VERSION "3.40.1"
| #define PACKAGE_STRING "sqlite 3.40.1"
| #define PACKAGE_BUGREPORT "http://www.sqlite.org"
| #define PACKAGE_URL ""
| #define PACKAGE "sqlite"
| #define VERSION "3.40.1"
| /* end confdefs.h.  */
| 
| int
| main (void)
| {
| 
|   ;
|   return 0;
| }
configure:3835: error: in `/Users/thomashansen/Documents/projects/magic/sqlite-vss/vendor/sqlite':
configure:3837: error: C compiler cannot create executables
See `config.log' for more details

## ---------------- ##
## Cache variables. ##
## ---------------- ##

ac_cv_env_CC_set=set
ac_cv_env_CC_value=/usr/local/opt/llvm/bin/clang
ac_cv_env_CFLAGS_set=
ac_cv_env_CFLAGS_value=
ac_cv_env_CPPFLAGS_set=set
ac_cv_env_CPPFLAGS_value=-I/usr/local/opt/llvm/include
ac_cv_env_LDFLAGS_set=set
ac_cv_env_LDFLAGS_value=-L/usr/local/opt/llvm/lib
ac_cv_env_LIBS_set=
ac_cv_env_LIBS_value=
ac_cv_env_LT_SYS_LIBRARY_PATH_set=
ac_cv_env_LT_SYS_LIBRARY_PATH_value=
ac_cv_env_build_alias_set=
ac_cv_env_build_alias_value=
ac_cv_env_host_alias_set=
ac_cv_env_host_alias_value=
ac_cv_env_target_alias_set=
ac_cv_env_target_alias_value=
ac_cv_path_install='/usr/local/bin/ginstall -c'
ac_cv_path_mkdir=/usr/local/bin/gmkdir
ac_cv_prog_AWK=awk
ac_cv_prog_ac_ct_CC=/usr/local/opt/llvm/bin/clang
ac_cv_prog_make_make_set=yes
am_cv_make_support_nested_variables=yes

## ----------------- ##
## Output variables. ##
## ----------------- ##

ACLOCAL='${SHELL} '\''/Users/thomashansen/Documents/projects/magic/sqlite-vss/vendor/sqlite/missing'\'' aclocal-1.16'
AMDEPBACKSLASH='\'
AMDEP_FALSE='#'
AMDEP_TRUE=''
AMTAR='$${TAR-tar}'
AM_BACKSLASH='\'
AM_DEFAULT_V='$(AM_DEFAULT_VERBOSITY)'
AM_DEFAULT_VERBOSITY='1'
AM_V='$(V)'
AR=''
AUTOCONF='${SHELL} '\''/Users/thomashansen/Documents/projects/magic/sqlite-vss/vendor/sqlite/missing'\'' autoconf'
AUTOHEADER='${SHELL} '\''/Users/thomashansen/Documents/projects/magic/sqlite-vss/vendor/sqlite/missing'\'' autoheader'
AUTOMAKE='${SHELL} '\''/Users/thomashansen/Documents/projects/magic/sqlite-vss/vendor/sqlite/missing'\'' automake-1.16'
AWK='awk'
BUILD_CFLAGS=''
CC='/usr/local/opt/llvm/bin/clang'
CCDEPMODE=''
CFLAGS=''
CPPFLAGS='-I/usr/local/opt/llvm/include'
CSCOPE='cscope'
CTAGS='ctags'
CYGPATH_W='echo'
DEFS=''
DEPDIR='.deps'
DLLTOOL=''
DSYMUTIL=''
DUMPBIN=''
ECHO_C='\c'
ECHO_N=''
ECHO_T=''
EGREP=''
ETAGS='etags'
EXEEXT=''
EXTRA_SHELL_OBJ=''
FGREP=''
GREP=''
INSTALL_DATA='${INSTALL} -m 644'
INSTALL_PROGRAM='${INSTALL}'
INSTALL_SCRIPT='${INSTALL}'
INSTALL_STRIP_PROGRAM='$(install_sh) -c -s'
LD=''
LDFLAGS='-L/usr/local/opt/llvm/lib'
LIBOBJS=''
LIBS=''
LIBTOOL=''
LIPO=''
LN_S=''
LTLIBOBJS=''
LT_SYS_LIBRARY_PATH=''
MAKEINFO='${SHELL} '\''/Users/thomashansen/Documents/projects/magic/sqlite-vss/vendor/sqlite/missing'\'' makeinfo'
MANIFEST_TOOL=''
MKDIR_P='/usr/local/bin/gmkdir -p'
NM=''
NMEDIT=''
OBJDUMP=''
OBJEXT=''
OTOOL64=''
OTOOL=''
PACKAGE='sqlite'
PACKAGE_BUGREPORT='http://www.sqlite.org'
PACKAGE_NAME='sqlite'
PACKAGE_STRING='sqlite 3.40.1'
PACKAGE_TARNAME='sqlite'
PACKAGE_URL=''
PACKAGE_VERSION='3.40.1'
PATH_SEPARATOR=':'
RANLIB=''
READLINE_LIBS=''
SED=''
SET_MAKE=''
SHELL='/bin/sh'
SHELL_CFLAGS=''
STRIP=''
VERSION='3.40.1'
ac_ct_AR=''
ac_ct_CC='/usr/local/opt/llvm/bin/clang'
ac_ct_DUMPBIN=''
am__EXEEXT_FALSE=''
am__EXEEXT_TRUE=''
am__fastdepCC_FALSE=''
am__fastdepCC_TRUE=''
am__include='include'
am__isrc=''
am__leading_dot='.'
am__nodep='_no'
am__quote=''
am__tar='$${TAR-tar} chof - "$$tardir"'
am__untar='$${TAR-tar} xf -'
bindir='${exec_prefix}/bin'
build=''
build_alias=''
build_cpu=''
build_os=''
build_vendor=''
datadir='${datarootdir}'
datarootdir='${prefix}/share'
docdir='${datarootdir}/doc/${PACKAGE_TARNAME}'
dvidir='${docdir}'
exec_prefix='NONE'
host=''
host_alias=''
host_cpu=''
host_os=''
host_vendor=''
htmldir='${docdir}'
includedir='${prefix}/include'
infodir='${datarootdir}/info'
install_sh='${SHELL} /Users/thomashansen/Documents/projects/magic/sqlite-vss/vendor/sqlite/install-sh'
libdir='${exec_prefix}/lib'
libexecdir='${exec_prefix}/libexec'
localedir='${datarootdir}/locale'
localstatedir='${prefix}/var'
mandir='${datarootdir}/man'
mkdir_p='$(MKDIR_P)'
oldincludedir='/usr/include'
pdfdir='${docdir}'
prefix='NONE'
program_transform_name='s,x,x,'
psdir='${docdir}'
runstatedir='${localstatedir}/run'
sbindir='${exec_prefix}/sbin'
sharedstatedir='${prefix}/com'
sysconfdir='${prefix}/etc'
target_alias=''

## ----------- ##
## confdefs.h. ##
## ----------- ##

/* confdefs.h */
#define PACKAGE_NAME "sqlite"
#define PACKAGE_TARNAME "sqlite"
#define PACKAGE_VERSION "3.40.1"
#define PACKAGE_STRING "sqlite 3.40.1"
#define PACKAGE_BUGREPORT "http://www.sqlite.org"
#define PACKAGE_URL ""
#define PACKAGE "sqlite"
#define VERSION "3.40.1"

configure: exit 77

Tracking: Ruby Bindings

Do you have any issues using the new Ruby bindings for sqlite-vss? Comment on this issue with any bugs or crashes you come across, or with suggestions on how to make it better.

TODOs for Ruby bindings:

  • rubygems.org README
  • site documentation
  • Some ruby example
  • Add ruby test to test.yaml workflow?

doc: tracking

  • getting started
  • Python
  • Node.js
  • Deno
  • Datasette
  • Loadable
  • Comparisons
  • Building from Source
  • API Reference
  • [ ]

`vss0` constructor: table-level options

create virtual table demo1 using vss0(
  factory="IVF4096,Flat,IDMap2",
  metric_type=L1,
  a(100),
  b(100),
  c(100)
);
create virtual table demo1 using vss0(
  factory="IVF4096,Flat,IDMap2",
  metric_type=L1,
  a(100) factory="Flat,IDMap2",
  b(100) metric_type=L2,
  c(100)
);

Storing non-vector column in `vss0` virtual tables

Spinoff of #2.

Instead of storing "metadata" in a separate regular SQLite table, you should be able to store non-vector columns like integers, text, and dates inside the vss0 virtual table like so:

create virtual table vss_articles using vss0(
  headline_embeddings(384),
  description_embeddings(384),
  published_at date,
  newsroom text
);

That way you don't need to do weird JOINs to supplement your vector queries. Could also be used to simplify pre-filtering.

things to consider:

  • Constructor needs to be able to distinguish between non-vector and vector columns
  • Probably need to limit the allowed column type names, probably only TEXT/INTEGER/DATETIME etc.
  • can extend the current *_data shadow table

Other notes:

  • store non-vector columns in a shadow strict table if in 3.37.0. Or maybe as an option?
  • Automatic indexes on data shadow table
  • Capture additional constraints on non-vector columns and use in pre-filters on KNN queries
  • Maybe store in some bespoke olap-stle columnar format?

load fails on ubuntu

Hi

When trying to load vss0.so, sqlite3 core-dumps (vector0 succeds):

SQLite version 3.42.0 2023-02-23 14:43:15
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .load ./vector0
sqlite> .load ./vss0
Illegal instruction (core dumped)

Ubuntu 20.04.5 LTS, freshly compiled sqlite3 from trunk.

Tried to build from source, succeded for vector0 so far (after getting recent cmake ...).
With vss0, stuck here (faiss built appearently in vendor/ in trial before):

~/MAKE/sqlite-vss$ make loadable
cmake -B build; make -C build
-- Could NOT find MKL (missing: MKL_LIBRARIES) 
-- Configuring done
-- Generating done
-- Build files have been written to: /home/myself/MAKE/sqlite-vss/build
make[1]: Entering directory '/home/myself/MAKE/sqlite-vss/build'
make[2]: Entering directory '/home/myself/MAKE/sqlite-vss/build'
make[3]: Entering directory '/home/myself/MAKE/sqlite-vss/build'
make[3]: Leaving directory '/home/myself/MAKE/sqlite-vss/build'
[ 48%] Built target faiss_avx2
make[3]: Entering directory '/home/myself/MAKE/sqlite-vss/build'
make[3]: Leaving directory '/home/myself/MAKE/sqlite-vss/build'
make[3]: Entering directory '/home/myself/MAKE/sqlite-vss/build'
[ 49%] Building CXX object CMakeFiles/sqlite-vss.dir/src/extension.cpp.o
/home/myself/MAKE/sqlite-vss/src/extension.cpp:496:34: error: β€˜idx_t’ is not a member of β€˜faiss’
  496 |   std::vector<std::vector<faiss::idx_t>*> * insert_to_add_ids;
      |                                  ^~~~~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.