Giter Site home page Giter Site logo

airflow-metaflow-demo's Introduction

Overview

This demo provides an example of using Metaflow with Apache Airflow.

When it comes to enterprise workflow orchestration, different teams have different needs fulfilled by different tools. In this demo, consider a Data Engineering team with pre-existing, managed infrastructure for Apache Airflow workflows. Of their many workflows some create datasets to be used by a Machine Learning team. The ML team uses Metaflow to package and run model training, prediction and evaluation. The ML team can build their workflows in Python as they like and export the flows as an Airflow DAG for production orchestration running on Airflow.

In an Enterprise environment Airflow and Metaflow would likely be provided by centralized infrastructure teams or mangaged/hosted services. In order to demonstrate the Airflow/Metaflow integrations this demo builds with locally running services.

Project services

This demo includes the following services running in local Docker containers.

  • Airflow: This includes containers for the webserver, scheduler and triggerer.
  • Metaflow: This includes containers for the metadata service, UI and UI backend.
  • Minio: A local object storage to provide object versioning for Metaflow.
  • Postgres: A database to hold state for Airflow and Metaflow.

Project directory contents

Your Astro project contains the following files and folders:

  • dags: This folder contains the Python files for your Airflow DAGs.
  • Dockerfile: This file contains a versioned Astro Runtime Docker image that provides a differentiated Airflow experience. If you want to execute other commands or overrides at runtime, specify them here.
  • include: This folder contains a sample Metaflow and the Minio data repository.
  • requirements.txt: Contains Python packages to be installed in Airflow at startup.
  • .env: This environment file provides variables for the Airflow, Metaflow and Minio services listed above.

Deploy your project locally

Install dependencies

You need Docker and Kubernetes services enabled.

The Metaflow steps run as KubernetesPodOperator tasks in Airflow. Therefore a Kubernetes service is needed. This demo was built using Docker Desktop which optionally includes Kubernetes services. To enable Kubernetes:

  • Open the Docker Desktop Dashboard
  • Click on the Settings (wheel) in the top right
  • Select Kubernetes
  • Select Enable Kubernetes
  • Click on Apply & Restart

1. Install Astronomer's Astro CLI

The Astro CLI is an Apache 2.0 licensed, open-source tool for building Airflow instances and is the fastest and easiest way to be up and running with Airflow in minutes.

For MacOS

brew install astro

For Linux

curl -sSL install.astronomer.io | sudo bash -s

2. Setup the project directory

git clone https://github.com/astronomer/airflow-metaflow-demo
cd airflow-metaflow-demo
cp -R ~/.kube/config include/.kube/

This demo uses containers from Outerbounds for the Metaflow service and UI backend. The Metaflow UI container is based on the nginx image which has platform dependencies. The Docker Compose environment used by the Astro CLI will build metaflow-ui:local (based on https://github.com/Netflix/metaflow-ui).

git clone https://github.com/Netflix/metaflow-ui include/metaflow-ui

3. Start Airflow, Metaflow and Minio on your local machine

Before running this command, make sure the following ports on your local machine are available: 3000, 5432, 8080, 8081, 8082, 8083, 9000, 9001. For example, you can use this command on Linux or MacOS to check port 3000 sudo lsof -i:3000.

astro dev start

This command will spin up 8 Docker containers on your machine including:

  • Postgres: Airflow's Metadata Database
  • Webserver: The Airflow component responsible for rendering the Airflow UI
  • Scheduler: The Airflow component responsible for monitoring and triggering tasks
  • Triggerer: The Airflow component responsible for triggering deferred tasks
  • Metaflow-ui: The Metaflow frontend UI
  • Metaflow-ui-backend: The Metaflow UI backend service
  • Metaflow-metadata-service: The metadata service for Metaflow
  • Minio: An S3 compatible object storage service

Access the UIs for your local project.

4. Verify all 8 Docker containers were created

docker ps | grep -c Up

5. Setup Airflow DAGs

This demo environment has already been configured to export Metaflow Flows to Airflow DAGs.

Since metaflow is already installed in the Airflow scheduler we will run the remaining commands from the Airflow Scheduler container. Connect to the Airflow scheduler container with the Astro CLI.

astro dev bash -s

Trigger the Data Engineering and Feature Engineering DAG

airflow dags unpause data_engineering_dag
airflow dags trigger data_engineering_dag

Connect to the Airflow UI to track the status of the Data Engineering DAG

6. Run the Metaflow workflows TrainTripDurationFlow and PredictTripDurationFlow

This might be the case where ML teams are using the familiar interface of Metaflow for development before pushing to production.

cd /usr/local/airflow/include/
python train_taxi_flow.py run
python predict_taxi_flow.py run

Connect to the Metaflow UI to track the status of the Train and Predict Flows.

7. Export the training and predict Flows as Airflow DAGs for production

cd /usr/local/airflow/dags
python ../include/train_taxi_flow.py --with kubernetes:image='pod_image:local' --with environment:vars='{"AWS_ACCESS_KEY_ID": "minioadmin", "AWS_SECRET_ACCESS_KEY": "minioadmin"}' airflow create train_taxi_dag.py
python ../include/predict_taxi_flow.py --with kubernetes:image='pod_image:local' --with environment:vars='{"AWS_ACCESS_KEY_ID": "minioadmin", "AWS_SECRET_ACCESS_KEY": "minioadmin"}' airflow create predict_taxi_dag.py

Note: By default the dags directory is only scanned for new files every 5 minutes. For this demo the list interval was set to 10 seconds via AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL in the .env file. This is not advised in production.

8. Setup triggering for the TrainTripDurationFlow and PredictTripDurationFlow DAGs.

The data engineering DAG can be configured to trigger the downstream train and predict DAGs with the TriggerDagRunOperator. Now that the TrainTripDurationFlow and PredictTripDurationFlow DAGs are in place edit the dags/data_engineering_dag.py file and remove the comments for the lines at the bottom.

    _trigger_train = TriggerDagRunOperator(task_id='trigger_metaflow_train', 
                                           trigger_dag_id='TrainTripDurationFlow',
                                           reset_dag_run=True,
                                           wait_for_completion=True,
                                           deferrable=True)
    
    _trigger_pred = TriggerDagRunOperator(task_id='trigger_metaflow_predict', 
                                          trigger_dag_id='PredictTripDurationFlow',
                                          reset_dag_run=True,
                                          wait_for_completion=True,
                                          deferrable=True)
    _feature_file \
        >> _trigger_train \
            >> _trigger_pred

9. Trigger the DAG run again with data engineering, feature engineering, model training and prediction.

astro dev bash -s

Trigger the Data Engineering and Feature Engineering DAG

airflow dags trigger data_engineering_dag

Connect to the Airflow UI to track status of the DAG runs. After the Data Engineering DAG completes it will trigger the TrainTripDurationFlow DAG and then the PredictTripDurationFlow DAG.

10. Add more Metaflow features

You can use many of the typical Metaflow features with this Airflow integration, including:

The open-source Metaflow features that are not currently supported for this integration include:

The Airflow features that are not currently supported with this integration include:

airflow-metaflow-demo's People

Contributors

emattia avatar mpgreg avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.