Welcome to the Flexi-DataPipeline project! This repository hosts a robust and adaptable data pipeline designed to convert images and PDFs, including unstructured data, into a defined JSON schema. Leveraging the power of Azure OCR and Azure AI (LLM: ChatGPT), this pipeline efficiently processes diverse image types, transforming them into structured data stored in MongoDB. The flexibility of the code allows easy adaptation to different input formats and schemas, making it a versatile solution for various data processing needs.
- Azure OCR Integration: Utilize Azure's Optical Character Recognition (OCR) capabilities to accurately extract text from images and PDFs.
- Azure AI (ChatGPT) Integration: Process extracted text using Azure AI (LLM: ChatGPT) to ensure accurate and context-aware data transformation.
- MongoDB Storage: Store the structured data in MongoDB, a NoSQL database known for its scalability and flexibility.
- Customizable JSON Schema: Easily adapt the output JSON schema to fit your specific requirements.
- Support for Various Input Formats: Handle a wide range of image and PDF formats, ensuring broad applicability.
To get started with the FlexibleDataPipeline, follow these steps:
-
Clone the repository:
git clone https://github.com/yoAeroA00/Flexi-DataPipeline.git cd FlexibleDataPipeline_OCR-LLM-MongoDB
-
Install dependencies:
pip install -r requirements.txt
-
Set up Azure OCR and Azure AI (ChatGPT):
- Ensure you have an Azure account and the necessary API keys.
- Update the
.env
file with your Azure OCR and Azure AI credentials.
-
Configure MongoDB:
- Install MongoDB and start the MongoDB server.
- Update the
.env
file with your MongoDB connection details.
-
Prepare your input files:
- Place your images and PDFs in the
input
directory.
- Place your images and PDFs in the
-
Run the pipeline:
python main.py
-
Check the output:
- The processed data will be stored in MongoDB as per the defined JSON schema.
The pipeline can be customized through various files and directories:
- Azure OCR and Azure AI (ChatGPT): Configure the API keys and endpoints in the
.env
file. - MongoDB Configuration: Update MongoDB connection details in the
.env
file. You should specify the MongoDB connection string and database settings in the environment variables, which the application will read to connect to MongoDB. - LLM Prompts and JSON Schema's: Customize or update the LLM prompts, including the JSON schema definitions, by modifying files located in the
app/core/config/prompts/
directory. This directory contains JSON files that define the prompts used by the Azure AI (ChatGPT) integration, and the schema is embedded within these prompt files.
Make sure to check these configuration files and directories to tailor the pipeline to your specific requirements.
We welcome contributions to enhance the FlexibleDataPipeline. To contribute:
- Fork the repository.
- Create a new branch.
git checkout -b feature-new-feature
- Make your changes and commit them.
git commit -m "Add new feature"
- Push to the branch.
git push origin feature-new-feature
- Create a pull request.
This project is licensed under the GNU Affero General Public License (AGPL). See the LICENSE file for details.
For any questions or support, please open an issue or contact the repository owner.