The data_science_project_workshop from marshackvb

data_science_project_workshop's Introduction

Data Science Workshop Project

Required cluster runtime version: 12.2 LTS ML. Each participant should provision his/her own single-node cluster using the required runtime. Clone this repository into a Databricks Repo.

Notebook run order:

etl: Load raw data and create a Delta table of features
eda: Compare/contrast Spark SQL and the DataFrame API.
models/xgboost: Train an XGBoost model and log to MLflow.
models/random_forest_hyperopt: Train a Random Forest model with hyperparameter tuning and MLflow logging.
compare_models: Choose the best model and register it in the Model Registry.
score: Load the production model from the Model Registry and perform inference.
No notebook: Follow along with instructor: Deploy the production model as a Rest API.
No notebook: Follow along with instructor: Create and run a multi-task job via the Databricks Jobs UI
model_registry_webhook: Watch instructor: Triggering activities base on Model Registry events.
No notebook: Follow along with instructor: Auto ML, training and comparing models automatically.

Extras if time permits:

extras/custom_mlflow_model: Creating and logging your own, custom MLflow model.
extas/feature_store: Integrating the Databricks Feature Store into the model training and inference workflows.
Notebook run order:
1. passenger_demographic_features
2. passenger_ticket_features
3. fit_model
4. model_inference

Recommend Projects