Giter Site home page Giter Site logo

maheskrishnan / vectorflow Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dgarnitz/vectorflow

0.0 1.0 0.0 187 KB

VectorFlow is a high volume vector embedding pipeline that ingests raw data, transforms it into vectors and writes it to a vector DB of your choice.

Home Page: https://www.getvectorflow.com/

License: Apache License 2.0

Shell 8.75% Python 87.37% Dockerfile 3.88%

vectorflow's Introduction

Open source, high-throughput, fault-tolerant vector embedding pipeline

Simple API endpoint that ingests large volumes of raw data, processes, and stores or returns the vectors quickly and reliably

IMAGE ALT TEXT HERE

Introduction

VectorFlow is an open source, high throughput, fault tolerant vector embedding pipeline. With a simple API request, you can send raw data that will be embedded and stored in any vector database or returned back to you.

This current version is an MVP and should not be used in production yet. Right now the system only supports uploading single TXT or PODF files at a time, up to 2GB.

Run it Locally

Docker-Compose

The best way to run VectorFlow is via docker compose.

1) Set Environment Variables

First create a folder in the root for all the environment variables:

mkdir env_scripts
cd env_scripts
touch env_vars.env

This creates a file called env_vars.env in the env_scripts folder to add all the environment variables mentioned below.

INTERNAL_API_KEY=your-choice
POSTGRES_USERNAME=postgres
POSTGRES_PASSWORD=your-choice
POSTGRES_DB=your-choice
POSTGRES_HOST=postgres
RABBITMQ_USERNAME=guest
RABBITMQ_PASSWORD=guest
RABBITMQ_HOST=rabbitmq
RABBITMQ_QUEUE=your-choice

You can choose a variable for INTERNAL_API_KEY, POSTGRES_PASSWORD, POSTGRES_DB, and RABBITMQ_QUEUE freely.

2) Run Docker-Compose

If you are running locally, make sure you pull Rabbit MQ and Postgres into your local docker repo:

docker pull rabbitmq
docker pull postgres

Then run:

docker-compose build --no-cache
docker-compose up -d

Note that that db-init container is running a script that sets up the database schema will stop after the script completes.

Using VectorFlow

To use VectorFlow in a live system, make an HTTP request to your API's URL at port 8000 - for example, localhost:8000 from your development machine, or vectorflow_api:8000 from within another docker container.

Request & Response Payload

All requests require an HTTP Header with Authorization key which is the same as your INTERNAL_API_KEY env var that you defined before (see above). You must pass your vector database api key with the HTTP Header X-VectorDB-Key and the embedding api key with X-EmbeddingAPI-Key.

VectorFlow currently support OpenAI ADA embeddings and Pinecone, Qdrant, Weaviate and Milvus vector databases.

To check the status of a job, make a GET request to this endpoint: /jobs/<int:job_id>/status. The response will be in the form:

{
    'JobStatus': job_status.value
}

To submit a job for embedding, make a POST request to this endpoint: /embed with the following payload and the 'Content-Type: multipart/form-data' header:

{
    'SourceData=path_to_txt_file'
    'LinesPerBatch=4096'
    'EmbeddingsMetadata={
        "embeddings_type": "open_ai",
        "chunk_size": 512,
        "chunk_overlap": 128
    }'
    'VectorDBMetadata={
        "vector_db_type": "pinecone",
        "index_name": "index_name",
        "environment": "env_name"
    }'
}

You will get the following payload back:

{
    message': f"Successfully added {batch_count} batches to the queue",
    'JobID': job_id
}

Sample Curl Request

The following request will embed a TXT document with OpenAI's ADA model and upload the results to a Pinecone index called test. Make sure that your Pinecone index is called test. If you run the curl command from the root directory the path to the test_text.txt is ./src/api/tests/fixtures/test_text.txt, changes this if you want to use another TXT document to embed.

curl -X POST -H 'Content-Type: multipart/form-data' -H "Authorization: INTERNAL_API_KEY" -H "X-EmbeddingAPI-Key: your-key-here" -H "X-VectorDB-Key: your-key-here" -F 'EmbeddingsMetadata={"embeddings_type": "open_ai", "chunk_size": 256, "chunk_overlap": 128}' -F 'SourceData=@./src/api/tests/fixtures/test_text.txt' -F 'VectorDBMetadata={"vector_db_type": "pinecone", "index_name": "test", "environment": "us-east-1-aws"}'  http://localhost:8000/embed

To check the status of the job,

curl -X GET -H "Authorization: INTERNAL_API_KEY" http://localhost:8000/jobs/<job_id>/status

Vector Database Schema Standard

VectorFlow enforces a standardized schema for uploading data to a vector store:

id: int
source_data: string
embeddings: float array

The id can be used for deduplication and idempotency. Please note for Weaviate, the id is called vectorflow_id. We plan to support dynamically detect and/or configurable schemas down the road.

Contributing

We love feedback from the community. If you have an idea of how to make this project better, we encourage you to open an issue or join our Discord. Please tag dgarnitz and danmeier2.

Our roadmap is outlined in the section below and we would love help in building it out. We recommend you open an issue with a proposed approach in mind before submitting a PR.

Please tag dgarnitz on all PRs.

Roadmap

  • Connectors to other vector databases
  • Support for more files types such as csv, word, xls, etc
  • Support for multi-file, directory data ingestion from sources such as Salesforce, Google Drive, etc
  • Support open source embeddings models
  • Retry mechanism
  • Langchain & Llama Index integrations
  • Support callbacks for writing object metadata to a separate store
  • Dynamically configurable vector DB schemas
  • Deduplication capabilities

vectorflow's People

Contributors

dgarnitz avatar danmeier2 avatar david-vectorflow avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.