Giter Site home page Giter Site logo

template-vector-ingestion-redis's Introduction

Continuous Vector Ingestion

This template shows you how to continuously ingest documents into a vector store using Apache Kafka. For simplicity, this use case is illustrated by streaming data from small CSV files that represent updates to a book catalog. The descriptive text from the catalog entries is then embedded and then ingested it into a vector store for semantic search. In a production scenario, you might use Change Data Capture (CDC) to ensure that the vector store is in sync with the book catalog database. For more information on the production use cases that is template supports, see the accompanying blog article.

This template uses the following open source libraries:

  • Quix Streams to produce data to, and consume data from, Apache Kafka.

  • Qdrant Client to create a database to store embeddings and for basic similarity search

The following screenshot illustrates the architecture of the resulting pipeline in Quix Cloud: Pipeline sscreenshot

You can also try out a minimal version of this pipeline in a standalone Jupyter notebook.

  • To run it Google Colab, click Open in Colab .

Trying it out

To try out the pipeline, first clone the vector ingestion template (for more information on how to clone a project template, see the article "How to create a project from a template in Quix).

Before you clone the pipeline, you’ll also need to sign up for a free trial account with Qdrant Cloud (you can sign up with your GitHub or Google account). When you clone the project template in Quix, you’ll be asked for your Qdrant Cloud credentials.

When running the project, you'll ingest content in two passes,

  • In the first pass, you'll add some initial entries to a "book-catalog" vector store via Kafka, then search the vector store (we've used the example query "book like star wars") to check that the data was ingested correctly.
  • In the second round you'll go through the whole process again (albeit faster) with new data, and see how the matches change for the same search query .

Run the first ingestion test

  1. Press play on the first job (with the name that starts with “PT1…”)—hover your mouse over the “stopped” button to press play.

    This will ingest the first part of the same “sci-fi books” sample dataset that we used in the notebook.

  2. On the “Streamlit Dashboard service”, click the blue “launch” icon to open the search UI.

  3. Search for “book like star wars” — the top result should be “Dune”.

    We can assume it matched because the words in the description are semantically similar to the query: “planet" is semantically close to "star" and "struggles" is semantically close to "wars".

Run the second ingestion test

  1. Press play on the second job (with the name that starts with “PT2…”)

    This will ingest the second part of the dataset with more relevant matches.

  2. In the Streamlit-based search UI, search for “books like star wars” again—the top result should now be “Old man’s war”, and the second result should be “Dune”.

    We can assume that Dune has been knocked off the top spot because the new addition has a more semantically relevant description: the "term" war is almost a direct hit, and "interstellar" is probably semantically closer to the search term "star" than "planet".

template-vector-ingestion-redis's People

Contributors

steverosam avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.