About • Installation • Configuration • How To Use • App Demo • How To Test
The goal of creating this project was to learn how to implement end-to-end MLOps in practice. As a result, the machine learning pipeline was built that fine-tunes a pre-trained model and the web application for detecting house sparrows in a photo that interacts with the model via an API. This repository contains the source code of the pipeline, including the source code of the web app and API, and some of its run results, which are needed to reproduce the pipeline and demonstrate its work. The following diagram shows more clearly how MLOps is implemented in the project:
Its extended version can be viewed in docs/project-mlops-diagram-extended.svg
.
More information about the dataset and model used can be found in
docs/dataset-card.md
anddocs/model-card.md
, respectively.
The source code was developed on Windows. All the Python modules in the repository, except those in the deployment
folder, were also run and tested on Linux (Ubuntu) (in Google Colab and using GitHub Actions, respectively).
First, clone this repo and go to its root directory. Then, create a virtual environment with Python 3.9 (not tested on other versions) and activate it. After that, install either all the dependencies of the project by running from the command line:
$ python -m pip install -r requirements/dev-requirements.txt
or only those needed for a specific MLOps task (see table below).
Note The repo contains two types of ML pipelines: DVC pipelines (including
pipelines/dvc.yaml
andpipelines/new_data_pipeline/dvc.yaml
) and a Metaflow workflow (pipelines/mlworkflow.py
) (not available for Windows!). If you plan to use the second one, then uncomment the appropriate tool in thepipe-requirements.txt
file to install it.
Show requirements table
MLOps Component/Task | Requirements | Used Files & Folders | Output Files & Folders |
---|---|---|---|
EDA | eda-requirements.txt | notebooks/EDA.ipynb, data/raw | notebooks/EDA.ipynb (outputs), data/prepared (optional) |
Data Checking | data-check-requirements.txt | data_checks, great_expectations, data, configs/params.yaml | data_checks/data_check_results, great_expectations/uncommitted, pipe.log |
Model Training | train-requirements.txt | src, data, configs | hyper_opt, configs/best_params.yaml, mlruns, models, outputs, reports/model_report.md, pipe.log |
Pipeline/Workflow | pipe-requirements.txt | pipelines, data_checks, great_expectations, src, data, configs | .dvc, pipelines (/dvc.lock & /dvc_dag.md) or .metaflow, data_checks/data_check_results, great_expectations, hyper_opt, configs/best_params.yaml, mlruns, models, outputs, reports/model_report.md, pipe.log |
Model Deployment / API & App | deployment-requirements.txt | deployment (except /demo), src/train/train_inference_fns.py, src/utils.py, mlruns, configs/params.yaml, .streamlit | monitoring/current_deployed_model.yaml, monitoring/data |
Model Monitoring | monitoring-requirements.txt | monitoring, data, configs/params.yaml | monitoring/deployed_model_check_results, reports/deployed_model_performance_report.html, mon.log |
Continuous Integration (CI) | ci-requirements.txt | .github, tests (except /webapi), pytest.ini, data_checks, src | - |
Web App Demo | requirements.txt (in deployment/demo) | deployment/demo | - |
If you used dev-requirements.txt
, run pre-commit install
to install git hook scripts from the .pre-commit-config.yaml
file (including for the DVC project in this repo). If you want to use your own Great Expectations/DVC projects, ensure that they are initialized in the root directory of the repo, or do it by running the great_expectations init
/ dvc init
commands. For details, refer to the documentation for the respective tools.
The dataset to reproduce the ML pipeline of this project can be found here. To run the pipeline on your own data, they must be organized as described in docs/dataset-card.md
. If necessary, configure items of the Great Expectations project according to the new data. See samples of the used data in the tests/data_samples
folder.
The project is configured to run on a local machine, although this can be changed if necessary. The main settings of the MLOps for this project are held in the configs/params.yaml
file.
Note To use remote storages or advanced features, some installed Python packages, DVC and MLflow for example, require additional dependencies to be installed. See their documentation.
Below are the CLI commands for MLOps components, which are executed manually in this implementation. Their order matters because the following commands depend on the results of the previous ones. Other components are already included in the pipelines/workflow, such as data verification/validation, hyperparameter optimization, and model stage transition to production, or are triggered when code is pushed to GitHub, such as tests.
- Run either the pipeline or the workflow (both contain similar stages/steps) to train (fine-tune) a object detection model:
-
or with the
# (Optional) Generate a Python script for the 'new_data_expectation_check' step $ great_expectations checkpoint script new_image_info_and_bbox_ckpt.yml # Generate a Python script for the 'raw_data_expectation_check' step $ great_expectations checkpoint script image_info_and_bbox_ckpt.yml # Run the model training workflow $ python pipelines/mlworkflow.py run
--production
flag if the trained model will be used in production, regardless of its performance.Warning The workflow is created using Metaflow, which is not available on Windows.
-
or use the
# (Optional) Reproduce the new data check pipeline $ dvc repro pipelines/new_data_pipeline/dvc.yaml # Reproduce the model training pipeline without including new data checks $ dvc repro pipelines/dvc.yaml
--all-pipelines
flag to reproduce all the pipelines for all thedvc.yaml
files present in the repo. DAGs of the pipelines can be viewed in thepipelines/dvc_dag.md
file.
- (Optional) View a history of model training runs:
-
$ mlflow ui --backend-store-uri sqlite:///mlruns/mlruns.db
Note Change the value for
--backend-store-uri
to match the tracking server URI set for MLflow.
- Use a deployed model via the API in the web app to get its performance data:
-
# Run the API on uvicorn server $ python deployment/api.py # Run the web app on Streamlit server $ streamlit run deployment/app.py
- Monitor the performance of the deployed model:
-
$ python monitoring/monitor_deployed_model.py
As a result of executing the commands, the project directory will have the structure similar to that presented in the docs/project-directory-structure.md
file.
Notebooks in this repo, except EDA.ipynb
, contain trial runs of the data checks and initial experiments to build the model training pipeline.
You can try out the web application from this project on .
If the demo is not available or not displayed, then it can be seen as a static image in docs/app-image.pdf
.
If pytest and its required plugins are not installed, run from the command line:
$ python -m pip install -r requirements/test-requirements.txt
Test configurations are held in the pytest.ini
file.
# Run all the tests in the repo except API ones
$ pytest --ignore=tests/webapi/ tests/
# Run uvicorn server and the API
$ python deployment/api.py
# Run API tests
$ pytest tests/webapi/
Note Some of the tests take a long time, they are marked as "slow".
# Skip slow tests $ pytest -m "not slow" tests/
Warning Sometimes the
integration
tests fail. This is due to the stochastic nature of machine learning algorithms. Try running the test again.