Giter Site home page Giter Site logo

qdrant-nlp's Introduction

QDrant-NLP

Keeping the Human in the LOOP. I am not a developer at QDrant, nor directly associated with them, but I think they've built something excellent, and thus far under-appreciated. This repo is here to act as a demo more than anything else.

https://github.com/qdrant/qdrant

I'll call it done once it's tidy and available on DockerHub. Just enough that you could maybe use it to run your own POC without any additional code.

You can achieve almost half of this tool just via their swagger UI, but obviously, that's designed for hitting simple APIs, not data-centric AI workflows, so it's missing a few useful components. This work is written up in more depth here https://medium.com/@george.pearse (vector databases part 2).

image

The toy logo is somewhere between a magnifying glass for how the tooling enables you to really focus in on a specific data subset, and a classic bayesian graph for if I get carried away enough to try to add active learning in.

Finding the documentation for hugging-face sentence-transformers via Google Search drove me mad, it lives here https://www.sbert.net/docs/hugging_face.html

Quick labelling with hugging-face, streamlit and QDrant. First I'll support NLP, then I'll think about adding image support (which is where this idea came from).

Features

  • Supports interactively creating and storing queries for the QDrant Vector Database for an NLP dataset.
  • For each query, show the positives, show the negatives, then display the results.
  • Maybe support Active Learning (eventually). Can have a two part system, one part using Active Learning to optimise the similarity search, the other to optimize downstream finetuning. Or one to update which datapoint a nearest neighbour approach is least certain about (because this can be almost instantly updated) and another to correct the model which generates the embeddings.
  • Enable the downloading of datasets direct from hugging-face (to embeddings)
  • Loading sign while generating embeddings.
  • SQLiteDB to store the query results, and the names of the queries + maybe run heuristics based stuff like you did on the MIMIC Dataset.

Improving Deployment Experience

  • One docker-compose file for streamlit, QDrant and FastAPI
  • Make the docker images available via DockerHub

image

See Kern.AI for a full blown solution which uses QDrant behind the scenes. This tool is meant to be simple enough to act as an intro to vector databases. You can write and see the requests, just as you would via the python API.

Similarly, koaning/bulk is excellent, but what if UMAP (insert alternative dimensionality reduction technique here) loses all of the nuance, and high-level visualizations fail to provide value for your dataset?

I also wanted to give FastAPI a tiny test run, so for each query (post request) you save, you can receive its results by hitting the FastAPI endpoint with the name of the query.

To apply these tools to a multi-modal dataset you would only need to concatenate the embeddings for each component and away you go with all the same technqiues.

NB: Other names

  • Consider calling this thing grouper if you take it more seriously and upgrade the components
  • Or, carve-n-serve (if people actually liked the fastapi component). Carving up the data into small chunks.

Might make sense to apply a similarity cut off instead of the nearest K.

To Run:

To get started, just run

docker-compose up

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.