Small Gradio app for fine-tuning document retrieval models
- Modified loss function from pairwise subtraction to mean subtraction w. margin
- Using Transformers latest tokenizer for Roberta (like it matters tho)
Instead of using separate, fine-tuned models for each dataset, we have a single model and train a separate LoRA for each task? Pros:
- A LoRA is smaller than a fine-tuned model
- Training LoRA can be faster & less intense
Cons:
- How TF do you train a Bert LoRA?
- This is getting complicated, bro
- Take info from the Genshin Impact wikia https://genshin-impact.fandom.com/wiki/Genshin_Impact_Wiki, embed it using cohere multilingual, and put it in weaviate db.
- Build a basic RAG chat with langchain using cohere embeddings for retrieval and chatGPT for synthesis
- Take those same multilingual wikipedia embeddings from cohere but yeet the embeddings and build your own IVF_PQ database on weaviate
- Use https://huggingface.co/doc2query model for the specific language (they have a lot of them) to generate 3 synthetic queries per passage. Should probably just use a random sample of the passages TBH
- For each query, use a pretrained retriever (msmarco-distilbert-base-tas-b) to pull 25 passages from the corpus.
- Make triplets out of those passages like (synthetic query, base passage, retrieved passage n) and then run them through a pretrained cross encoder. Use MSE with the difference between encoder loss and cross encoder loss to train.
- Now you have your fine-tuned model and can run evals!
Since the wikipedia datasets come with queries, you can just take a sample of those for your evals!
- Sample ~3,000 or so queries --- Start the timer! ---
- Encode them:
- Baseline uses cohere multilingual encoding
- Experiment uses your finetuned model
- Do a similarity search in your weaviate database
- Return top 10 results
- Eval MAP or MPP at 10, 5, and 1
- See who is better and cry when it is cohere lol
- Find a dataset that you can use as an example
- use https://huggingface.co/datasets/climate_fever
- [HARDCORE] semantic wikipedia search https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings
- reject sentence transformers, embrace pytorch
- Modify BiEncoder Pipeline to accept param for top-K
- Use HF Trainer class
- Modify Evals to check top-k results, too
- Modify rerank pipeline to accept biencoder pipeline as param instead of re-defining its methods
- Chinchilla scaling checker
- Make a basic Gradio interface
โ
Change hyperparameters
- View evals during training
- View training data and results data
- View final test dataset
- Save hyperparams and results to a csv
- Better system to save & load models
- Auto-generate copy/paste custom Langchain retriever class
- Make it a huggingface space