Giter Site home page Giter Site logo

retrieval-finetune-harness's Introduction

retrieval-finetune-harness

Small Gradio app for fine-tuning document retrieval models

Changes from the OG Income

  • Modified loss function from pairwise subtraction to mean subtraction w. margin
  • Using Transformers latest tokenizer for Roberta (like it matters tho)

What IF...

Instead of using separate, fine-tuned models for each dataset, we have a single model and train a separate LoRA for each task? Pros:

  • A LoRA is smaller than a fine-tuned model
  • Training LoRA can be faster & less intense

Cons:

  • How TF do you train a Bert LoRA?
  • This is getting complicated, bro

OK Here's the plan:

  1. Take info from the Genshin Impact wikia https://genshin-impact.fandom.com/wiki/Genshin_Impact_Wiki, embed it using cohere multilingual, and put it in weaviate db.
  2. Build a basic RAG chat with langchain using cohere embeddings for retrieval and chatGPT for synthesis
  3. Take those same multilingual wikipedia embeddings from cohere but yeet the embeddings and build your own IVF_PQ database on weaviate
  4. Use https://huggingface.co/doc2query model for the specific language (they have a lot of them) to generate 3 synthetic queries per passage. Should probably just use a random sample of the passages TBH
  5. For each query, use a pretrained retriever (msmarco-distilbert-base-tas-b) to pull 25 passages from the corpus.
  6. Make triplets out of those passages like (synthetic query, base passage, retrieved passage n) and then run them through a pretrained cross encoder. Use MSE with the difference between encoder loss and cross encoder loss to train.
  7. Now you have your fine-tuned model and can run evals!

Evals!

Since the wikipedia datasets come with queries, you can just take a sample of those for your evals!

  1. Sample ~3,000 or so queries --- Start the timer! ---
  2. Encode them:
    • Baseline uses cohere multilingual encoding
    • Experiment uses your finetuned model
  3. Do a similarity search in your weaviate database
  4. Return top 10 results
  5. Eval MAP or MPP at 10, 5, and 1
  6. See who is better and cry when it is cohere lol

To-do

  • Find a dataset that you can use as an example
  • reject sentence transformers, embrace pytorch
  • Modify BiEncoder Pipeline to accept param for top-K
  • Use HF Trainer class
  • Modify Evals to check top-k results, too
  • Modify rerank pipeline to accept biencoder pipeline as param instead of re-defining its methods
  • Chinchilla scaling checker
  • Make a basic Gradio interface โœ… Change hyperparameters
    • View evals during training
    • View training data and results data
    • View final test dataset
    • Save hyperparams and results to a csv
  • Better system to save & load models
  • Auto-generate copy/paste custom Langchain retriever class
  • Make it a huggingface space

retrieval-finetune-harness's People

Contributors

gordon-bp avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.