Retrieval augmented generation (RAG) demos with Llama-2-7b, Mistral-7b, Zephyr-7b, Gemma
The demos use quantized models and run on CPU with acceptable inference time. They can run offline without Internet access, thus allowing deployment in an air-gapped environment.
The demos also allow user to
- apply propositionizer to document chunks
- perform reranking upon retrieval
- perform hypothetical document embedding (HyDE)
You will need to set up your development environment using conda, which you can install directly.
conda env create --name rag python=3.11
conda activate rag
pip install -r requirements.txt
Activate the environment.
conda activate rag
Download and save the models in ./models
and update config.yaml
. The models used in this demo are:
- Embeddings
- LLMs
- Rerankers:
- facebook/tart-full-flan-t5-xl: save in
models/tart-full-flan-t5-xl/
- BAAI/bge-reranker-base: save in
models/bge-reranker-base/
- facebook/tart-full-flan-t5-xl: save in
- Propositionizer
- chentong00/propositionizer-wiki-flan-t5-large save in
models/propositionizer-wiki-flan-t5-large/
- chentong00/propositionizer-wiki-flan-t5-large save in
Since each model type has its own prompt format, include the format in ./src/prompt_templates.py
. For example, the format used in openbuddy
models is
_openbuddy_format = """{system}
User: {user}
Assistant:"""
Refer to the file for more details.
We use Streamlit as the interface for the demos. There are two demos:
- Conversational Retrieval
streamlit run app_conv.py
- Retrieval QA
streamlit run app_qa.py
To get started, upload a PDF and click on Build VectorDB
. Creating vector DB will take a while.