RAG pipeline to identify useful web pages and reports on the internet, scrape and ingest them to collect better energy and climate data
Acknowledgements:
- Pixegami for initial RAG workflow
- Greg Kamradt for chunking strategies
Create a virtual environment and install the required packages using the following commands:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Add all the relevant API keys to a .env file in the root directory.
cp .env.example .env
To run the RAG pipeline, first start a ChromaDB server:
chroma run --path chroma/
Then run the following command:
python query_data.py "Give me a list of coal power plants in Vietnam"
To add new urls to the database, run the following command:
python populate_database.py --url "https://www.example.com"
To add local markdown files to the database, put your files in data/ and run the following command:
python populate_database.py