🎲 Your Best Bet

MLOps demo with Python models in dbt on the European Soccer Database

About The Project

Welcome to the high-octane world of production ML pipelines! We're thrilled to present an epic demonstration showcasing numerous MLOps concepts packed into a single dbt project. Strap in as we unveil this treasure trove of tools, tailored to empower data teams within organizations, speeding up the journey of ML models to production!

Imagine a scenario of daily (or weekly) sports betting where you're on a quest to outsmart the bookies. This project houses the code for a data warehouse powered by the European Soccer Database. Utilizing team and player statistics, performance metrics, FIFA stats, and bookie odds, we'll hunt down opportunities where our model paints a more accurate picture than at least one bookie. When our odds stack up better against theirs, it's our chance to strike gold! 💰

Within our pipeline, you can:

Version Your Dataset: run preprocessing to (re)generate your ML dataset
Experiment & Store: run and save experiments
Model Management: save and compare models
Reproducibility: ensure inference pipelines run without train/serving skew (run simulations)
Feature Store: house all input features with the available KPIs at that time
Prediction Audit: maintain a log of all predictions

(back to top)

Getting Started

Prerequisites

This thrilling adventure requires:

Python
Access to a Databricks cluster (e.g., Azure free account)
A firm grasp on dbt for seamless execution of these examples

Installation (Azure)

Buckle up for the setup ride:

install virtual environment

virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

Download data from here -> you need a Kaggle account. Drop the resulting database.sqlite file in the data folder.
Convert data to parquet and csv files
```
python scripts/convert_data.py
```
Databricks
1. Create a SQL warehouse -> check the connection details for your profile in the next step
2. Create a personal access token, keep this token close and use to connect dbt to your sql warehouse.
3. Upload data (parquet files) to warehouse, into the default schema in the hive_metastore catalog. Your catalog should look something like this
4. Create a compute cluster
5. check the cluster id (you can find in the SparkUI), and set as env var: COMPUTE_CLUSTER_ID=...

dbt

initialise and install dependencies.

cd dbt_your_latest_bet
dbt deps

setup your dbt profile, should look something like this:

databricks:
 outputs:
     dev:
         catalog: hive_metastore 
         host: xxx.cloud.databricks.com
         http_path: /sql/1.0/warehouses/$SQL_WAREHOUSE_ID
         schema: football
         threads: 4 # max number of parallel processes
         token: $PERSONAL_ACCESS_TOKEN
         type: databricks
 target: dev

riskrover python package, managed with poetry
1. build and install the package in your local environment
```
cd riskrover
poetry build
pip install dist/riskrover-x.y.z.tar.gz 
```
1. Install the resulting riskrover whl file on your databricks compute cluster

You should now be able to run the entire pipeline without any trained models (i.e. the preprocessing):

dbt build --selector gold

(back to top)

Usage

Explore and command the powers of our pipeline.

For these examples to work -> you need to move to the root dir of the dbt project, i.e. dbt_your_best_bet.

MWE for a simulation

The default variables are stored in dbt_project.yaml. We find ourselves on 2016-01-01 in our simulation, with the option to run until 2016-05-25.

cd dbt_your_best_bet

# Preprocessing
dbt build --selector gold

# Experimentation (by default -> training set to 2015-07-31, and trains a simple logistic regression with cross validation)
dbt build --selector ml_experiment

# Inference on test set (2015-08-01 -> 2015-12-31)
dbt build --selector ml_predict_run

# moving forward in time, for example with a weekly run
dbt build --vars '{"run_date": "2016-01-08"}'
dbt build --vars '{"run_date": "2016-01-15"}'
dbt build --vars '{"run_date": "2016-01-22"}'
...

Checking the data catalog

cd dbt_your_best_bet

dbt docs generate
dbt docs serve

It's like a grand lineage tale with no models documented yet—stay tuned! We can already check the lineage:

(back to top)

Roadmap

Mostly maintenance, no plans on new features unless requested.

Documentation
Tests
Extra sql analysis models

(back to top)

Contributing

All contributions are welcome!

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

(back to top)

License

Distributed under the MIT License.

Contact

[email protected] / [email protected]

(back to top)

datarootsio / your-best-bet Goto Github PK

your-best-bet's Introduction

🎲 Your Best Bet

About The Project

Getting Started

Prerequisites

Installation (Azure)

Usage

MWE for a simulation

Checking the data catalog

Roadmap

Contributing

License

Contact

your-best-bet's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org