gordon-bp / retrieval-finetune-harness Goto Github PK

0.0 1.0 0.0 11.22 MB

Small Gradio app for fine-tuning document retrieval models

License: MIT License

Python 100.00%

retrieval-finetune-harness's Introduction

retrieval-finetune-harness

Small Gradio app for fine-tuning document retrieval models

Instead of using separate, fine-tuned models for each dataset, we have a single model and train a separate LoRA for each task? Pros:

Cons:

Take info from the Genshin Impact wikia https://genshin-impact.fandom.com/wiki/Genshin_Impact_Wiki, embed it using cohere multilingual, and put it in weaviate db.
Build a basic RAG chat with langchain using cohere embeddings for retrieval and chatGPT for synthesis
Take those same multilingual wikipedia embeddings from cohere but yeet the embeddings and build your own IVF_PQ database on weaviate
Use https://huggingface.co/doc2query model for the specific language (they have a lot of them) to generate 3 synthetic queries per passage. Should probably just use a random sample of the passages TBH
For each query, use a pretrained retriever (msmarco-distilbert-base-tas-b) to pull 25 passages from the corpus.
Make triplets out of those passages like (synthetic query, base passage, retrieved passage n) and then run them through a pretrained cross encoder. Use MSE with the difference between encoder loss and cross encoder loss to train.
Now you have your fine-tuned model and can run evals!

Since the wikipedia datasets come with queries, you can just take a sample of those for your evals!

Sample ~3,000 or so queries --- Start the timer! ---
Encode them:
- Baseline uses cohere multilingual encoding
- Experiment uses your finetuned model
Do a similarity search in your weaviate database
Return top 10 results
Eval MAP or MPP at 10, 5, and 1
See who is better and cry when it is cohere lol