Giter Site home page Giter Site logo

birchkwok / lynsedb Goto Github PK

View Code? Open in Web Editor NEW
51.0 9.0 10.0 20.28 MB

A pure Python-implemented, lightweight, server-optional, multi-end compatible, vector database deployable locally or remotely.

License: Apache License 2.0

Python 69.14% Jupyter Notebook 30.76% Dockerfile 0.10%
vector-database vector-search database llm-chain pure-python nlp rag vectordb approximate-nearest-neighbor-search embedding-similarity

lynsedb's Introduction

LynseDB logo

Discord PyPI version PyPI - Python Version Python testing Docker build

Server-optional, simple parameters, simple API.

Fast, memory-efficient, easily scales to millions of vectors.

Friendly caching technology stores recently queried vectors for accelerated access.

Based on a generic Python software stack, platform-independent, highly versatile.

LynseDB is a vector database implemented purely in Python, designed to be lightweight, server-optional, and easy to deploy locally or remotely. It offers straightforward and clear Python APIs, aiming to lower the entry barrier for using vector databases.

LynseDB focuses on achieving 100% recall, prioritizing recall accuracy over high-speed search performance. This approach ensures that users can reliably retrieve all relevant vector data, making LynseDB particularly suitable for applications that require responses within hundreds of milliseconds.

  • Now supports HTTP API and Python local code API and Docker deployment.
  • Now supports transaction management; if a commit fails, it will automatically roll back.

⚠️ WARNING

Not yet backward compatible

LynseDB is actively being updated, and API backward compatibility is not guaranteed. You should use version numbers as a strong constraint during deployment to avoid unnecessary feature conflicts and errors.

Data size constraints

Although our goal is to enable brute force search or inverted indexing on billion-scale vectors, we currently still recommend using it on a scale of millions of vectors or less for the best experience.

python's native api is not process-safe

The Python native API is recommended for use in single-process environments, whether single-threaded or multi-threaded; for ensuring process safety in multi-process environments, please use the HTTP API.

Prerequisite

  • python version >= 3.9
  • Owns one of the operating systems: Windows, macOS, or Ubuntu (or other Linux distributions). The recommendation is for the latest version of the system, but non-latest versions should also be installable, although they have not been tested.
  • Memory >= 4GB, Free Disk >= 4GB.

Install Client API package (Mandatory)

pip install LynseDB

If you wish to use Docker (Optional)

You must first install Docker on the host machine.

docker pull birchkwok/LynseDB:latest

Qucik Start

import lynse

print("LynseDB version is: ", lynse.__version__)
LynseDB version is:  0.0.1

Initialize Database

LynseDB now supports HTTP API and Python native code API.

The HTTP API mode requires starting an HTTP server beforehand. You have two options:

  • start directly.

    For direct startup, the default port is 7637. You can run the following command in the terminal to start the service:

lynse run --host localhost --port 7637
  • within Docker

    In Docker, You can run the following command in the terminal to start the service:

docker run -p 7637:7637 birchkwok/LynseDB:latest
  • Remote deploy

    If you want to deploy remotely, you can bind the image to port 80 of the remote host, or allow the host to open access to port 7637. such as:

docker run -p 80:7637 birchkwok/LynseDB:latest
  • test if api available

    You can directly request in the browser http://localhost:7637

    For port 80, you can use this url: http://localhost

    If the image is bound to port 80 of the host in remote deployment, you can directly access it http://your_host_ip

# If you are in a Jupyter environment, you can use this method to start the backend server
# Ignore this code if you are using docker
lynse.launch_in_jupyter()
Server running at http://127.0.0.1:7637
# Use the HTTP API mode, it is suitable for use in production environments.
client = lynse.VectorDBClient("http://127.0.0.1:7637")  # If no url is passed, the native api is used.
# Create a database named "test_db", if it already exists, delete it and rebuild it.
my_db = client.create_database("test_db", drop_if_exists=True)

create a collection

WARNING

When using the require_collection method to request a collection, if the drop_if_exists parameter is set to True, it will delete all content of the collection if it already exists.

A safer method is to use the get_collection method. It is recommended to use the require_collection method only when you need to reinitialize a collection or create a new one.

collection = my_db.require_collection("test_collection", dim=4, drop_if_exists=True, scaler_bits=None, description="demo collection")
2024-06-16 19:49:44 - LynseDB - INFO - Creating collection test_collection with: 
//    dim=4, collection='test_collection', 
//    chunk_size=100000, distance='cosine', 
//    dtypes='float32', use_cache=True, 
//    scaler_bits=None, n_threads=10, 
//    warm_up=False, drop_if_exists=True, 
//    description=demo collection, 

show database collections

my_db.show_collections_details()
dim chunk_size dtypes distance use_cache scaler_bits n_threads warm_up initialize_as_collection description buffer_size
test_collection 4 100000 float32 cosine True None 10 False True demo collection 20

update description

collection.update_description("Hello World")
my_db.show_collections_details()
dim chunk_size dtypes distance use_cache scaler_bits n_threads warm_up initialize_as_collection description buffer_size
test_collection 4 100000 float32 cosine True None 10 False True Hello World 20

Add vectors

When inserting vectors, the collection requires manually running the commit function or inserting within the insert_session function context manager, which will run the commit function in the background.

It is strongly recommended to use the insert_session context manager for insertion, as this provides more comprehensive data security features during the insertion process.

with collection.insert_session() as session:
    id = session.add_item(vector=[0.01, 0.34, 0.74, 0.31], id=1, field={'field': 'test_1', 'order': 0})   # id = 0
    id = session.add_item(vector=[0.36, 0.43, 0.56, 0.12], id=2, field={'field': 'test_1', 'order': 1})   # id = 1
    id = session.add_item(vector=[0.03, 0.04, 0.10, 0.51], id=3, field={'field': 'test_2', 'order': 2})   # id = 2
    id = session.add_item(vector=[0.11, 0.44, 0.23, 0.24], id=4, field={'field': 'test_2', 'order': 3})   # id = 3
    id = session.add_item(vector=[0.91, 0.43, 0.44, 0.67], id=5, field={'field': 'test_2', 'order': 4})   # id = 4
    id = session.add_item(vector=[0.92, 0.12, 0.56, 0.19], id=6, field={'field': 'test_3', 'order': 5})   # id = 5
    id = session.add_item(vector=[0.18, 0.34, 0.56, 0.71], id=7, field={'field': 'test_1', 'order': 6})   # id = 6
    id = session.add_item(vector=[0.01, 0.33, 0.14, 0.31], id=8, field={'field': 'test_2', 'order': 7})   # id = 7
    id = session.add_item(vector=[0.71, 0.75, 0.91, 0.82], id=9, field={'field': 'test_3', 'order': 8})   # id = 8
    id = session.add_item(vector=[0.75, 0.44, 0.38, 0.75], id=10, field={'field': 'test_1', 'order': 9})  # id = 9

# If you do not use the insert_session function, you need to manually call the commit function to submit the data
# collection.commit()

# or use the bulk_add_items function
# with collection.insert_session():
#     ids = collection.bulk_add_items([([0.01, 0.34, 0.74, 0.31], 0, {'field': 'test_1', 'order': 0}), 
#                                      ([0.36, 0.43, 0.56, 0.12], 1, {'field': 'test_1', 'order': 1}), 
#                                      ([0.03, 0.04, 0.10, 0.51], 2, {'field': 'test_2', 'order': 2}),
#                                      ([0.11, 0.44, 0.23, 0.24], 3, {'field': 'test_2', 'order': 3}), 
#                                      ([0.91, 0.43, 0.44, 0.67], 4, {'field': 'test_2', 'order': 4}), 
#                                      ([0.92, 0.12, 0.56, 0.19], 5, {'field': 'test_3', 'order': 5}),
#                                      ([0.18, 0.34, 0.56, 0.71], 6, {'field': 'test_1', 'order': 6}), 
#                                      ([0.01, 0.33, 0.14, 0.31], 7, {'field': 'test_2', 'order': 7}), 
#                                      ([0.71, 0.75, 0.91, 0.82], 8, {'field': 'test_3', 'order': 8}),
#                                      ([0.75, 0.44, 0.38, 0.75], 9, {'field': 'test_1', 'order': 9})])
# print(ids)  # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2024-06-16 19:49:44 - LynseDB - INFO - Saving data...
2024-06-16 19:49:44 - LynseDB - INFO - Writing chunk to storage...
2024-06-16 19:49:44 - LynseDB - INFO - Task status: {'status': 'Processing'}
2024-06-16 19:49:44 - LynseDB - INFO - Writing chunk to storage done.
2024-06-16 19:49:46 - LynseDB - INFO - Task status: {'result': {'collection_name': 'test_collection', 'database_name': 'test_db'}, 'status': 'Success'}

Find the nearest neighbors of a given vector

The default similarity measure for query is Inner Product (IP). You can specify cosine or L2 to obtain the similarity measure you need.

ids, scores, fields = collection.search(vector=[0.36, 0.43, 0.56, 0.12], k=3, distance="cosine", return_fields=True)
print("ids: ", ids)
print("scores: ", scores)
print("fields: ", fields)
ids:  [2 9 1]
scores:  [1.         0.92355633 0.86097705]
fields:  [{':id:': 2, 'field': 'test_1', 'order': 1}, {':id:': 9, 'field': 'test_3', 'order': 8}, {':id:': 1, 'field': 'test_1', 'order': 0}]

The query_report_ attribute is the report of the most recent query. When multiple queries are conducted simultaneously, this attribute will only save the report of the last completed query result.

print(collection.search_report_)
* - MOST RECENT SEARCH REPORT -
| - Collection Shape: (10, 4)
| - Search Time: 0.01578 s
| - Search Distance: cosine
| - Search K: 3
| - Top 3 Results ID: [2 9 1]
| - Top 3 Results Similarity: [1.         0.92355633 0.86097705]

List data

collection.head(10)
(array([[0.01      , 0.34      , 0.74000001, 0.31      ],
        [0.36000001, 0.43000001, 0.56      , 0.12      ],
        [0.03      , 0.04      , 0.1       , 0.50999999],
        [0.11      , 0.44      , 0.23      , 0.23999999],
        [0.91000003, 0.43000001, 0.44      , 0.67000002],
        [0.92000002, 0.12      , 0.56      , 0.19      ],
        [0.18000001, 0.34      , 0.56      , 0.70999998],
        [0.01      , 0.33000001, 0.14      , 0.31      ],
        [0.70999998, 0.75      , 0.91000003, 0.81999999],
        [0.75      , 0.44      , 0.38      , 0.75      ]]),
 array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
 [{':id:': 1, 'field': 'test_1', 'order': 0},
  {':id:': 2, 'field': 'test_1', 'order': 1},
  {':id:': 3, 'field': 'test_2', 'order': 2},
  {':id:': 4, 'field': 'test_2', 'order': 3},
  {':id:': 5, 'field': 'test_2', 'order': 4},
  {':id:': 6, 'field': 'test_3', 'order': 5},
  {':id:': 7, 'field': 'test_1', 'order': 6},
  {':id:': 8, 'field': 'test_2', 'order': 7},
  {':id:': 9, 'field': 'test_3', 'order': 8},
  {':id:': 10, 'field': 'test_1', 'order': 9}])
collection.tail(5)
(array([[0.92000002, 0.12      , 0.56      , 0.19      ],
        [0.18000001, 0.34      , 0.56      , 0.70999998],
        [0.01      , 0.33000001, 0.14      , 0.31      ],
        [0.70999998, 0.75      , 0.91000003, 0.81999999],
        [0.75      , 0.44      , 0.38      , 0.75      ]]),
 array([ 6,  7,  8,  9, 10]),
 [{':id:': 6, 'field': 'test_3', 'order': 5},
  {':id:': 7, 'field': 'test_1', 'order': 6},
  {':id:': 8, 'field': 'test_2', 'order': 7},
  {':id:': 9, 'field': 'test_3', 'order': 8},
  {':id:': 10, 'field': 'test_1', 'order': 9}])

Use Filter

Using the Filter class for result filtering can maximize Recall.

The Filter class now supports must, any, and must_not parameters, all of which only accept list-type argument values.

The filtering conditions in must must be met, those in must_not must not be met.

After filtering with must and must_not conditions, the conditions in any will be considered, and at least one of the conditions in any must be met.

If there is a conflict between the conditions in any and those in must or must_not, the conditions in any will be ignored.

import operator

from lynse.core_components.kv_cache.filter import Filter, FieldCondition, MatchField, MatchID, MatchRange

collection.search(
    vector=[0.36, 0.43, 0.56, 0.12],
    k=10,
    search_filter=Filter(
        must=[
            FieldCondition(key='field', matcher=MatchField('test_1')),  # Support for filtering fields
        ],
        any=[
            FieldCondition(key='order', matcher=MatchRange(start=0, end=8, inclusive=True)),
            FieldCondition(key=":match_id:", matcher=MatchID([1, 2, 3, 4, 5])),  # Support for filtering IDs
        ],
        must_not=[
            FieldCondition(key=":match_id:", matcher=MatchID([8])),
            FieldCondition(key='order', matcher=MatchField(8, comparator=operator.ge)),
        ]
    )
)

print(collection.search_report_)
* - MOST RECENT SEARCH REPORT -
| - Collection Shape: (10, 4)
| - Search Time: 0.00729 s
| - Search Distance: cosine
| - Search K: 10
| - Top 10 Results ID: [2 1 7]
| - Top 10 Results Similarity: [1.         0.86097705 0.7741583 ]

Query existing text in fields

Query via Filter

query_filter=Filter(
    must=[
        FieldCondition(key='field', matcher=MatchField('test_1')),  # Support for filtering fields
    ],
    any=[
        FieldCondition(key='order', matcher=MatchRange(start=0, end=8, inclusive=True)),
        FieldCondition(key=":match_id:", matcher=MatchID([1, 2, 3, 4, 5])),  # Support for filtering IDs
    ],
    must_not=[
        FieldCondition(key=":match_id:", matcher=MatchID([8])),
        FieldCondition(key='order', matcher=MatchField(8, comparator=operator.ge)),
    ]
)

collection.query(query_filter)
[{':id:': 1, 'field': 'test_1', 'order': 0},
 {':id:': 2, 'field': 'test_1', 'order': 1},
 {':id:': 7, 'field': 'test_1', 'order': 6}]

Precision Query

collection.query({':id:': 1, 'field': 'test_1', 'order': 0})
[{':id:': 1, 'field': 'test_1', 'order': 0}]

Fuzzy Query

collection.query({'field': 'test_1'})
[{':id:': 1, 'field': 'test_1', 'order': 0},
 {':id:': 2, 'field': 'test_1', 'order': 1},
 {':id:': 7, 'field': 'test_1', 'order': 6},
 {':id:': 10, 'field': 'test_1', 'order': 9}]

Drop a collection

WARNING: This operation cannot be undone

print("Collection list before dropping:", my_db.show_collections())
status = my_db.drop_collection("test_collection")
print("Collection list after dropped:", my_db.show_collections())
Collection list before dropping: ['test_collection']
Collection list after dropped: []

Drop the database

WARNING: This operation cannot be undone

my_db.drop_database()
my_db
Database `test_db` does not exist on the LynseDB remote server.

What's Next

lynsedb's People

Contributors

birchkwok avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lynsedb's Issues

File not found error

I have crated a VectorDB and I am trying to open the db and collection to run some searches on it. When I do my_db.list_databases() I get this following error

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'C:\Users\\.LynseDB\databases\databases'

The file exists in the folder, I have checked it. I tried resaving it as databases.json, the error goes but the list returned is empty.

What is the proper way of opening a collection once it has been created?

Request for capability

Thanks a lot for making this- for people on systems that aren't allowed to install C++ compilers, having a pure-Python vector database is wonderful.

I'm not seeing a means of retrieving the metadata (e.g. "fields" in the nomenclature you use in your readme example) associated with a given record at this time; for my database, I stored the original string the vector represents, in the hope of retrieving whatever string most-closely matches my query. If it's possible to do this, can you please add that to your tutorials? If it's not a feature would you please consider adding it?

Thanks again for building this cool project.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.