Giter Site home page Giter Site logo

your-best-bet's Introduction


Logo

🎲 Your Best Bet

MLOps demo with Python models in dbt on the European Soccer Database

About The Project

Welcome to the high-octane world of production ML pipelines! We're thrilled to present an epic demonstration showcasing numerous MLOps concepts packed into a single dbt project. Strap in as we unveil this treasure trove of tools, tailored to empower data teams within organizations, speeding up the journey of ML models to production!

Imagine a scenario of daily (or weekly) sports betting where you're on a quest to outsmart the bookies. This project houses the code for a data warehouse powered by the European Soccer Database. Utilizing team and player statistics, performance metrics, FIFA stats, and bookie odds, we'll hunt down opportunities where our model paints a more accurate picture than at least one bookie. When our odds stack up better against theirs, it's our chance to strike gold! 💰

Within our pipeline, you can:

  • Version Your Dataset: run preprocessing to (re)generate your ML dataset
  • Experiment & Store: run and save experiments
  • Model Management: save and compare models
  • Reproducibility: ensure inference pipelines run without train/serving skew (run simulations)
  • Feature Store: house all input features with the available KPIs at that time
  • Prediction Audit: maintain a log of all predictions

(back to top)

Getting Started

Prerequisites

This thrilling adventure requires:

  • Python
  • Access to a Databricks cluster (e.g., Azure free account)
  • A firm grasp on dbt for seamless execution of these examples

Installation (Azure)

Buckle up for the setup ride:

  1. install virtual environment
    virtualenv venv
    source venv/bin/activate
    pip install -r requirements.txt
  2. Download data from here -> you need a Kaggle account. Drop the resulting database.sqlite file in the data folder.
  3. Convert data to parquet and csv files
    python scripts/convert_data.py
  4. Databricks
    1. Create a SQL warehouse -> check the connection details for your profile in the next step
    2. Create a personal access token, keep this token close and use to connect dbt to your sql warehouse.
    3. Upload data (parquet files) to warehouse, into the default schema in the hive_metastore catalog. Your catalog should look something like this
    4. Create a compute cluster
    5. check the cluster id (you can find in the SparkUI), and set as env var: COMPUTE_CLUSTER_ID=...
  5. dbt
    1. initialise and install dependencies.
    cd dbt_your_latest_bet
    dbt deps
    1. setup your dbt profile, should look something like this:
    databricks:
     outputs:
         dev:
             catalog: hive_metastore 
             host: xxx.cloud.databricks.com
             http_path: /sql/1.0/warehouses/$SQL_WAREHOUSE_ID
             schema: football
             threads: 4 # max number of parallel processes
             token: $PERSONAL_ACCESS_TOKEN
             type: databricks
     target: dev
  6. riskrover python package, managed with poetry
    1. build and install the package in your local environment
    cd riskrover
    poetry build
    pip install dist/riskrover-x.y.z.tar.gz 
    1. Install the resulting riskrover whl file on your databricks compute cluster

You should now be able to run the entire pipeline without any trained models (i.e. the preprocessing):

dbt build --selector gold

(back to top)

Usage

Explore and command the powers of our pipeline.

For these examples to work -> you need to move to the root dir of the dbt project, i.e. dbt_your_best_bet.

MWE for a simulation

The default variables are stored in dbt_project.yaml. We find ourselves on 2016-01-01 in our simulation, with the option to run until 2016-05-25.

cd dbt_your_best_bet

# Preprocessing
dbt build --selector gold

# Experimentation (by default -> training set to 2015-07-31, and trains a simple logistic regression with cross validation)
dbt build --selector ml_experiment

# Inference on test set (2015-08-01 -> 2015-12-31)
dbt build --selector ml_predict_run

# moving forward in time, for example with a weekly run
dbt build --vars '{"run_date": "2016-01-08"}'
dbt build --vars '{"run_date": "2016-01-15"}'
dbt build --vars '{"run_date": "2016-01-22"}'
...

Checking the data catalog

cd dbt_your_best_bet

dbt docs generate
dbt docs serve

It's like a grand lineage tale with no models documented yet—stay tuned! We can already check the lineage:

(back to top)

Roadmap

Mostly maintenance, no plans on new features unless requested.

  • Documentation
  • Tests
  • Extra sql analysis models

(back to top)

Contributing

All contributions are welcome!

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License.

Contact

(back to top)

your-best-bet's People

Contributors

devdnhee avatar

Stargazers

Ryan Anderson avatar Lode Nachtergaele avatar  avatar  avatar  avatar  avatar

Watchers

 avatar Gauthier Feuillen avatar Quinten Bruynseraede avatar Bram Vandendriessche avatar Tim Leers avatar  avatar Charlotte De Baere avatar  avatar

Forkers

hwantajee

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.