Testing langchain rag
The branch focus the use of ollama and langchain to RAG from your documents. (see list of supported file extensions at the end)
cd $HOME
git clone -b ollama-rag --single-branch [email protected]:W-Wuxian/NANIKA.git
Install ollama
After a successful installation run:
ollama pull nomic-embed-text
ollama pull phi3
ollama list
nomic-embed-text is mandatory but phi3 can be replaced with any model name at (ollama.com/library)[https://ollama.com/library]
After installing ollama materials you need to do the following:
conda env create -f langchain_rag_env.yml
conda activate langchain_rag_env
pip install "unstructured[all-docs]"
pip install chromadb langchain-text-splitters
conda install conda-forge::pytesseract
conda install conda-forge::tesseract
python -m venv langchain_rag_venv
pip install --upgrade unstructured langchain "unstructured[all-docs]"
pip install --upgrade chromadb langchain-text-splitters
pip install --upgrade pytesseract
pip install --upgrade tesseract
Once ollama and langchain stuff are done (see previous sections) you can use RAG. Here is two python scripts nanika.py to do so.
The nanika.py script is used to create a database from your documents as follow:
python nanika.py --help
options or long_options are:
-m or --model_name model name
-e or --embedding_name embedding name
-i or --inputdocs_path path given between " " to folders or files to be used at RAG step
-v or --vdb_path vector data base path
-c or --collection_name collection name
-r or --reuse reuse previous vdb and collection
-d or --display-doc whether or not to display given documents
So for example using phi3 llm model, with nomic-embed-text as an embedding model to create a database from my documents at /path/to/my/folder/ one can use the following command:
python nanika.py -m phi3 -e nomic-embed-text -i "/path/to/my/folder1 /path/to/my/folder2 /path/to/my/file1"
In order to run several database we need to specify the database storing location via -v and the collection name via -c, as follow:
python nanika.py -m phi3 -e nomic-embed-text -i /path/to/my/folder1/ -v ./database1 -c collection1
python nanika.py -m phi3 -e nomic-embed-text -i /path/to/my/folder2/ -v ./database2 -c collection2
The nanika.py script will also ask you to enter questions (RAG), to end this phase enter q or quit.
To reuse a database you need the corresponding -v and -c and run the nanika.py script with -r True as follow:
python nanika.py -m phi3 -e nomic-embed-text -v ./database1 -c collection1 -r True
python nanika.py -m phi3 -e nomic-embed-text -v ./database2 -c collection2 -r True
file extension | Coverage |
---|---|
✔️ | |
txt | ✔️ |
py | ✔️ |
png | ✔️ |
jpg | ✔️ |
xlsx | ✔️ |
xls | ✔️ |
odt | ✔️ |
csv | ✔️ |
pptx | ✔️ |
md | ✔️ |
org | ✔️ |