Flexi-DataPipeline

Introduction

Welcome to the Flexi-DataPipeline project! This repository hosts a robust and adaptable data pipeline designed to convert images and PDFs, including unstructured data, into a defined JSON schema. Leveraging the power of Azure OCR and Azure AI (LLM: ChatGPT), this pipeline efficiently processes diverse image types, transforming them into structured data stored in MongoDB. The flexibility of the code allows easy adaptation to different input formats and schemas, making it a versatile solution for various data processing needs.

Features

Azure OCR Integration: Utilize Azure's Optical Character Recognition (OCR) capabilities to accurately extract text from images and PDFs.
Azure AI (ChatGPT) Integration: Process extracted text using Azure AI (LLM: ChatGPT) to ensure accurate and context-aware data transformation.
MongoDB Storage: Store the structured data in MongoDB, a NoSQL database known for its scalability and flexibility.
Customizable JSON Schema: Easily adapt the output JSON schema to fit your specific requirements.
Support for Various Input Formats: Handle a wide range of image and PDF formats, ensuring broad applicability.

Installation

To get started with the FlexibleDataPipeline, follow these steps:

Clone the repository:

git clone https://github.com/yoAeroA00/Flexi-DataPipeline.git
cd FlexibleDataPipeline_OCR-LLM-MongoDB

Install dependencies:
```
pip install -r requirements.txt
```
Set up Azure OCR and Azure AI (ChatGPT):
- Ensure you have an Azure account and the necessary API keys.
- Update the .env file with your Azure OCR and Azure AI credentials.
Configure MongoDB:
- Install MongoDB and start the MongoDB server.
- Update the .env file with your MongoDB connection details.

Usage

Prepare your input files:
- Place your images and PDFs in the input directory.
Run the pipeline:
```
python main.py
```
Check the output:
- The processed data will be stored in MongoDB as per the defined JSON schema.

Configuration

The pipeline can be customized through various files and directories:

Azure OCR and Azure AI (ChatGPT): Configure the API keys and endpoints in the .env file.
MongoDB Configuration: Update MongoDB connection details in the .env file. You should specify the MongoDB connection string and database settings in the environment variables, which the application will read to connect to MongoDB.
LLM Prompts and JSON Schema's: Customize or update the LLM prompts, including the JSON schema definitions, by modifying files located in the app/core/config/prompts/ directory. This directory contains JSON files that define the prompts used by the Azure AI (ChatGPT) integration, and the schema is embedded within these prompt files.

Make sure to check these configuration files and directories to tailor the pipeline to your specific requirements.

Contributing

We welcome contributions to enhance the FlexibleDataPipeline. To contribute:

Fork the repository.
Create a new branch.
```
git checkout -b feature-new-feature
```
Make your changes and commit them.
```
git commit -m "Add new feature"
```
Push to the branch.
```
git push origin feature-new-feature
```
Create a pull request.

License

This project is licensed under the GNU Affero General Public License (AGPL). See the LICENSE file for details.

Contact

For any questions or support, please open an issue or contact the repository owner.

yoaeroa00 / flexi-datapipeline Goto Github PK