Giter Site home page Giter Site logo

eulersearch / embedding_studio Goto Github PK

View Code? Open in Web Editor NEW
365.0 6.0 5.0 10.4 MB

Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.

Home Page: https://embeddingstud.io/

License: Apache License 2.0

Python 98.87% Shell 0.22% Dockerfile 0.91%
embeddings embeddings-similarity fine-tuning llm-inference query-parser search-algorithm search-engine semantic-similarity unstructured-data unstructured-search

embedding_studio's Introduction

Embedding Studio

version Python 3.9 CUDA 11.7.1 Docker Compose Version

WebsiteDocumentationChallenges & SolutionsUse Cases

Embedding Studio is an innovative open-source framework designed to seamlessly convert a combined embedding model and vector database into a comprehensive search engine. With built-in functionalities for clickstream collection, continuous improvement of search experiences, and automatic adaptation of the embedding model, it offers an out-of-the-box solution for a full-cycle search engine.

Community Support
Embedding Studio grows with our team's enthusiasm. Your star on the repository helps us keep developing.
Join us in reaching our goal:

Progress

Features

  1. 🔄 Turn your vector database into a full-cycle search engine
  2. 🖱️ Collect users feedback like clickstream
  3. 🚀 (*) Improve search experience on-the-fly without frustrating wait times
  4. 📊 (*) Monitor your search quality
  5. 🎯 Improve your embedding model through an iterative metric fine-tuning procedure
  6. 🆕 (*) Use the new version of the embedding model for inference
  7. 🛠️ (*) Priorly fine-tune your Embedding on your catalogue data.
  8. 🔍 (*) Use and improve the Zero-Shot Query Parser to mix your structured database with unstructured search.

(*) - features in development

Embedding Studio is highly customizable, so you can bring your own:

  1. Data source
  2. Vector database
  3. Clickstream database
  4. Embedding model

When is Embedding Studio the best fit?

More about it here.

  • 📚💼 Businesses with extensive catalogs and rich unstructured data.
  • 🛍️🤝 Customer-centric platforms prioritizing personalized experiences.
  • 🔄📊 Dynamic content platforms with evolving content and user preferences.
  • 🔍🧠 Platforms handling nuanced and multifaceted search queries.
  • 🔄📊 Integration of mixed data types in search processes.
  • 🔄🚀 Platforms seeking ongoing optimization through user interactions.
  • 💵💡 Budget-conscious organizations seeking powerful yet affordable solutions.

Challenges can be solved

Disclaimer: Embedding Studio is not a yet another Vector Database, it's a framework which allows you transform your Vector Database into a Search Engine with all nuances.

  • Nothing but a catalogue, but you want a quick demo
  • Static search quality, but you want it to be improved with time
  • User experience improvement takes too long, and your users feel themselves frustrated
  • Slow and resource exhausted index updating
  • Mix of structured and unstructured search, and you don't know how to combine them
  • Structured search with unstructured queries, and you want to parse them properly
  • Fresh items are getting lost

More about challenges and solutions here

Overview

Our framework enables you to continuously fine-tune your model based on user experience, allowing you to form search results for user queries faster and more accurately.

$\color{red}{\textsf{RED:}}$ On the graph, typical search solutions without enhancements, such as Full Text Searching (FTS), Nearest Neighbor Search (NNS), and others, are marked in red. Without the use of additional tools, the search quality remains unchanged over time.

$\color{orange}{\textsf{ORANGE:}}$ Solutions are depicted that accumulate some feedback (clicks, reviews, votes, discussions, etc.) and then initiate a full model retraining. The primary issue with these solutions is that full model retraining is a time-consuming and expensive procedure, thus lacking reactive adjustments (for example, when a product suddenly experiences increased demand, and the search system has not yet adapted to it).

$\color{#6666ff}{\textsf{INDIGO:}}$ We propose a solution that allows collecting user feedback and rapidly retraining the model on the difference between the old and new versions. This enables a smoother and more relevant search quality curve for your system.

Embedding Studio Chart

Documentation

View our official documentation.

Getting Started

Hello, Unstructured World!

To try out Embedding Studio, you can launch the pre-configured demonstration project. We've prepared a dataset stored in a public S3 bucket, an emulator for user clicks, and a basic script for fine-tuning the model. By adapting it to your requirements, you can initiate fine-tuning for your model.

Ensure that you have the docker compose version command working on your system:

Docker Compose version v2.23.3

You can also try the docker-compose version command. Moving forward, we will use the newer docker compose version command, but the docker-compose version command may also work successfully on your system.

Firstly, bring up all the Embedding Studio services by executing the following command:

docker compose up -d

Once all services are up, you can start using Embedding Studio. Let's simulate a user search session. We'll run a pre-built script that will invoke the Embedding Studio API and emulate user behavior:

docker compose --profile demo_stage_clickstream up -d

After the script execution, you can initiate model fine-tuning. Execute the following command:

docker compose --profile demo_stage_finetuning up -d

This will queue a task processed by the fine-tuning worker. To fetch all tasks in the fine-tuning queue, send a GET request to the endpoint /api/v1/fine-tuning/task:

curl -X GET http://localhost:5000/api/v1/fine-tuning/task

The answer will be something like:

[
  {
    "fine_tuning_method": "Default Fine Tuning Method",
    "status": "processing",
    "created_at": "2023-12-21T14:30:25.823000",
    "updated_at": "2023-12-21T14:32:16.673000",
    "batch_id": "65844a671089823652b83d43",
    "id": "65844c019fa7cf0957d04758"
  }
]

Once you have the task ID, you can directly monitor the fine-tuning progress by sending a GET request to the endpoint /api/v1/fine-tuning/task/{task_id}:

curl -X GET http://localhost:5000/api/v1/fine-tuning/task/65844c019fa7cf0957d04758

The result will be similar to what you received when querying all tasks. For a more convenient way to track progress, you can use Mlflow at http://localhost:5001.

It's also beneficial to check the logs of the fine_tuning_worker to ensure everything is functioning correctly. To do this, list all services using the command:

docker logs embedding_studio-fine_tuning_worker-1

If everything completes successfully, you'll see logs similar to:

Epoch 2: 100%|██████████| 13/13 [01:17<00:00,  0.17it/s, v_num=8]
[2023-12-21 14:59:05,931] [PID 7] [Thread-6] [pytorch_lightning.utilities.rank_zero] [INFO] `Trainer.fit` stopped: `max_epochs=3` reached.
Epoch 2: 100%|██████████| 13/13 [01:17<00:00,  0.17it/s, v_num=8]
[2023-12-21 14:59:05,975] [PID 7] [Thread-6] [embedding_studio.workers.fine_tuning.finetune_embedding_one_param] [INFO] Save model (best only, current quality: 8.426392069685529e-05)
[2023-12-21 14:59:05,975] [PID 7] [Thread-6] [embedding_studio.workers.fine_tuning.experiments.experiments_tracker] [INFO] Save model for 2 / 9a9509bf1ed7407fb61f8d623035278e
[2023-12-21 14:59:06,009] [PID 7] [Thread-6] [embedding_studio.workers.fine_tuning.experiments.experiments_tracker] [WARNING] No finished experiments found with model uploaded, except initial
[2023-12-21 14:59:16,432] [PID 7] [Thread-6] [embedding_studio.workers.fine_tuning.experiments.experiments_tracker] [INFO] Upload is finished
[2023-12-21 14:59:16,433] [PID 7] [Thread-6] [embedding_studio.workers.fine_tuning.finetune_embedding_one_param] [INFO] Saving is finished
[2023-12-21 14:59:16,433] [PID 7] [Thread-6] [embedding_studio.workers.fine_tuning.experiments.experiments_tracker] [INFO] Finish current run 2 / 9a9509bf1ed7407fb61f8d623035278e
[2023-12-21 14:59:16,445] [PID 7] [Thread-6] [embedding_studio.workers.fine_tuning.experiments.experiments_tracker] [INFO] Current run is finished
[2023-12-21 14:59:16,656] [PID 7] [Thread-6] [embedding_studio.workers.fine_tuning.experiments.experiments_tracker] [INFO] Finish current iteration 2
[2023-12-21 14:59:16,673] [PID 7] [Thread-6] [embedding_studio.workers.fine_tuning.experiments.experiments_tracker] [INFO] Current iteration is finished
[2023-12-21 14:59:16,673] [PID 7] [Thread-6] [embedding_studio.workers.fine_tuning.worker] [INFO] Fine tuning of the embedding model was completed successfully!

Congratulations! You've successfully improved the model!

To download the best model you can use Embedding Studio API:

curl -X GET http://localhost:5000/api/v1/fine-tuning/task/65844c019fa7cf0957d04758

If everything is OK, you will see following output:

{
  "fine_tuning_method": "Default Fine Tuning Method", 
  "status": "done", 
  "best_model_url": "http://localhost:5001/get-artifact?path=model%2Fdata%2Fmodel.pth&run_uuid=571304f0c330448aa8cbce831944cfdd", 
  ...
}

And best_model_url field contains HTTP accessible model.pth file.

You can download *.pth file by executing following command:

wget http://localhost:5001/get-artifact?path=model%2Fdata%2Fmodel.pth&run_uuid=571304f0c330448aa8cbce831944cfdd

Contributing

We welcome contributions to Embedding Studio!

License

Embedding Studio is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.

embedding_studio's People

Contributors

andrey-kostin avatar chillymagician avatar oyaso avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

embedding_studio's Issues

Where is "Zero-Shot Query Parser"?

I've read your readme, and you mentioned about some query parsing tools, but can't find anything in your repo. Where can I read about it more? What is the release data of the feature? Thanks.

Clickhouse instead of Mongo

Hey!
I’m looking into code, what was behind the decision to use MongoDB as a Clickstream storage backend?
We are using ClickHouse as the part of technical stack, it’s more convenient for this purpose. Will you add ClickHouse support?
Best regards.

how to inference finetuned embedding?

Guys, I’ve tried the code and look into it. I can’t see any inference related stuff. Looks interesting, but unless I can easily redeploy it in my infrastructure, it’s hard to be adopted.
Are you going to implement something like this? Or can you suggest any workaround?

Encountering ClientError trying to use my own dataset

I'm trying to use my dataset for model training, but I'm encountering the following error:

ClientError                               Traceback (most recent call last)
/tmp/ipykernel_199154/1286885123.py in <module>
----> 1 response = s3_client.get_object(Bucket='embedding-studio-experiments', Key='remote-lanscapes/clickstream/f6816566-cac3-46ac-b5e4-0d5b76757c93/sessions.json')
~/anaconda3/lib/python3.9/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    528                 )
    529             # The "self" in this scope is referring to the BaseClient.
--> 530             return self._make_api_call(operation_name, kwargs)
    531 
    532         _api_call.__name__ = str(py_operation_name)
~/anaconda3/lib/python3.9/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    958             error_code = parsed_response.get("Error", {}).get("Code")
    959             error_class = self.exceptions.from_code(error_code)
--> 960             raise error_class(parsed_response, operation_name)
    961         else:
    962             return parsed_response
ClientError: An error occurred (InvalidAccessKeyId) when calling the GetObject operation: The AWS Access Key Id you provided does not exist in our records.

I attempted to set my AWS Access Key ID in the .env file under the variables MINIO_ACCESS_KEY and MINIO_SECRET_KEY. However, as I understand it, these variables are used for artifact storage, not for my datasets. Can you advise on how I can resolve this error?

Data loading is quiet slow

Hey, trying to run the fine tuning , found out that image downloading is quite slow. Checked it out, seems that it’s even not multithreaded or async. Are you going to speed it up somehow?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.