This is a machine learning application designed for predicting insurance premiums. The project leverages a variety of tools and frameworks to streamline data management, experiment tracking, and model deployment.
- DVC (Data Version Control): Used for managing and versioning data pipeline.
- Git: Version control system for tracking code changes.
- MLflow: Used for tracking the model training and model evaluation.
- GitHub Actions Server: Used for continuous integration and deployment.
- Dagshub: Facilitates MLflow experiment tracking and DVC data pipeline.
The application ingests insurance premium data from the data/insurance.csv data path and saves it into artifacts/DataIngestionArtifacts
.
Data undergoes transformation to prepare it for model training. Transformed data and preprocessing artifacts are saved into artifacts/DataTransformationArtifacts
. Preprocessors are also stored in models/
.
Multiple machine learning models are trained:
Linear Regression, Ridge Regression, Lasso Regression, Polynomial Regression, Random Forest,
Gradient Boosting, XGBoost, LightGBM, Catboost.
The top 4 performing models based on training metrics are selected. Both models and associated metrics are saved into artifacts/ModelTrainerArtifacts
. MLflow is used to track model parameters and metrics throughout this process.
The best-performing model on test data is selected and saved into artifacts/ModelEvaluationArtifacts
and models/
. Model evaluation metrics are tracked using MLflow.
A Streamlit application is developed to allow users to input data and receive predictions from the trained model.
The model is deployend on the AWS EC2 using Docker and Github Action Server.
π.github/
βββ πworkflows/
βββ main.yaml
πdocs/
βββ πdocs/
β βββ index.md
β βββ getting-started.md
βββ mkdocs.yml
βββ README.md
πsrc/
βββ init.py
βββ πcomponents/
β βββ init.py
β βββ data_ingestion.py
β βββ data_transformation.py
β βββ model_trainer.py
β βββ model_evaluation.py
βββ πconstants/
β βββ init.py
βββ πentity/
β βββ init.py
β βββ config_entity.py
β βββ artifact_entity.py
βββ πpipeline/
β βββ init.py
β βββ training_pipeline.py
β βββ prediction_pipeline.py
βββ πutils/
β βββ init.py
β βββ utils.py
βββ πlogger/
β βββ init.py
βββ πexception/
βββ init.py
πdata/
βββ insurance.csv
πexperiment/
βββ experiments.ipynb
requirements.txt
requirements_app.txt
setup.py
app.py
main.py
README.md
implement.md
.gitignore
template.py
prediction.py
init_setup.ps1
dvc.yaml
Dockerfile
demo.py
config.json
.dockerignore
.dvcignore
- Linear Regression
- Ridge Regression
- Lasso Regression
- Polynomial Regression
- Random Forest
- Gradient Boosting
- XGBoost
- LightGBM
- Catboost
- Python 3.10
- mkdocs
- dvc
- numpy
- pandas
- colorama
- mlflow==2.2.2
- dagshub
- scikit-learn
- xgboost
- lightgbm
- catboost
- streamlit
To reproduce the model and run the application:
-
Clone the repository:
git clone <repository_url>
cd <repository_name>
-
Set up the virtual environment and install the requirements:
./init_setup.ps1
-
Execute the whole pipeline:
python main.py
Now run the streamlit app.
-
Run the Streamlit app:
streamlit run app.py
-
Enter the input values and get prediction
- Ravi Kumar