Giter Site home page Giter Site logo

aws-samples / sagemaker-ml-workflow-with-apache-airflow Goto Github PK

View Code? Open in Web Editor NEW
135.0 10.0 86.0 544 KB

This repository shows a sample example to build, manage and orchestrate Machine Learning workflows using Amazon Sagemaker and Apache Airflow.

License: MIT No Attribution

Jupyter Notebook 72.64% Python 27.36%

sagemaker-ml-workflow-with-apache-airflow's Introduction

Build End-to-End Machine Learning (ML) Workflows with Amazon SageMaker and Apache Airflow

This repository contains the assets for the Amazon Sagemaker and Apache Airflow integration sample described in this ML blog post.

Overview

This repository shows a sample example to build, manage and orchestrate ML workflows using Amazon Sagemaker and Apache Airflow. We will build a recommender system to predict a customer's rating for a certain video based on customer's historical ratings of similar videos as well as the behavior of other similar customers. We'll use historical star ratings from over 2M Amazon customers on over 160K digital videos. More details on this dataset can be found at its AWS Public Datasets page.

Repository Structure

The repository contains

  • AWS CloudFormation Templates to launch the AWS services required to create the components
  • Airflow DAG Python Script that integrates and orchestrates all the ML tasks in a ML workflow for building a recommender system.
  • A companion Jupyter Notebook to understand the individual ML tasks in detail such as data exploration, data preparation, model training/tuning and inference.
.
├── README.md                                         About the repository
├── cfn                                               AWS CloudFormation Templates
│   └── airflow-ec2.yaml                              CloudFormation for installing Airflow instance backed by RDS
├── notebooks                                         Jupyter Notebooks
│   └── amazon-video-recommender_using_fm_algo.ipynb
└── src                                               Source code for Airflow DAG definition
    ├── config.py                                     Config file to configure SageMaker jobs and other ML tasks
    ├── dag_ml_pipeline_amazon_video_reviews.py       Airflow DAG definition for ML workflow
    └── pipeline                                      Python module used in Airflow DAG for data preparation
        ├── __init__.py
        ├── prepare.py                                Data preparation script
        └── preprocess.py                             Data pre-processing script

High Level Solution

Here is the high-level depiction of the ML workflow we will implement for building the recommender system

airflow_dag_workflow

The workflow performs the following tasks

  1. Data Pre-processing: Extract and pre-process data from S3 to prepare the training data.
  2. Prepare Training Data: To build the recommender system, we will use SageMaker's built-in algorithm - Factorization machines. The algorithm expects training data only in RecordIO Protobuf format with Float32 tensors. In this task, pre-processed data will be transformed to RecordIO Protobuf format.
  3. Training the Model: Train the SageMaker's built-in Factorization Machine model with the training data and generate model artifacts. The training job will be launched by the Airflow SageMaker operator SageMakerTrainingOperator.
  4. Tune the Model Hyper-parameters: A conditional/optional task to tune the hyper-parameters of Factorization Machine to find the best model. The hyper-parameter tuning job will be launched by the SageMaker Airflow operator SageMakerTuningOperator.
  5. Batch inference: Using the trained model, get inferences on the test dataset stored in Amazon S3 using Airflow SageMaker operator SageMakerTransformOperator.

CloudFormation Template Resources

We will set up a simple Airflow architecture with scheduler, worker and web server running on the same instance. Typically, you will not use this setup for production workloads. We will use AWS CloudFormation to launch the AWS services required to create the components in the blog post. The stack includes the following

  • Amazon EC2 instance to set up the Airflow components
  • Amazon Relational Database Service (RDS) Postgres or Aurora Serverless instance to host the Airflow metadata database.
  • Amazon S3 bucket to store the Sagemaker model artifacts, outputs and Airflow DAG with ML workflow. Template will prompt for the S3 bucket name
  • AWS IAM roles and EC2 Security Groups to allow Airflow components interact with the metadata database, S3 bucket and Amazon SageMaker

If you want to troubleshoot or add custom operators, you can connect directly to the instance through the Session Manager console. You can also launch different stable versions of Airflow (1.10.12 and 2.0.2).

  • Airflow 1.10.12 RDS: cfn-launch-stack
  • Airflow 1.10.12 Aurora Serverless: cfn-launch-stack
  • Airflow 2.0.2 RDS: cfn-launch-stack
  • Airflow 2.0.2 Aurora Serverless: cfn-launch-stack

It might take up to 10 minutes for the CloudFormation stack to create the resources. After the resource creation is completed, you should be able to log in to Airflow web UI with the credentials specified in the parameters of the CloudFormation stack. The Airflow web server runs on port 8080 by default. To open the Airflow web UI, open any browser, and type in the http://ec2-public-dns-name:8080. The public DNS name of the EC2 instance can be found on the Outputs tab of CloudFormation stack on the AWS CloudFormation console.

Airflow DAG for ML Workflow

Airflow DAG integrates all the ML tasks in a ML workflow. Airflow DAG is a python script where you express individual tasks as Airflow operators, set task dependencies and associate the tasks to the DAG to run either on demand or scheduled interval. The Airflow DAG script is divided into following sections

  1. Set DAG with parameters such as schedule_interval to run the workflow at scheduled time
  2. Set up training, tuning and inference configurations for each operators using Sagemaker Python SDK for Airflow operators.
  3. Create individual tasks as Airflow operators defining trigger rules and associating them with the DAG object. Refer previous section for defining the individual tasks
  4. Specify task dependencies

airflow_dag

You can find the Airflow DAG code here in the repo.

Cleaning Up the Stack Resources

The final step is to clean up. To avoid unnecessary charges,

  1. You should destroy all of the resources created by the CloudFormation stack in Airflow set up by deleting the stack after you’re done experimenting with it. You can follow the steps here to delete the stack.
  2. You have to manually delete the S3 bucket created because AWS CloudFormation cannot delete non-empty S3 bucket.

References

License Summary

This sample code is made available under a modified MIT license. See the LICENSE file.

sagemaker-ml-workflow-with-apache-airflow's People

Contributors

fernbach avatar jpeddicord avatar mpdominguez avatar pymia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sagemaker-ml-workflow-with-apache-airflow's Issues

Content of container

I see here that you are using a container to build a custom model.

Would it be possible to have more details on how this container is structured and understand the input/output of the estimator for different channels (e.g. train, tune, etc.)?

Sagemaker error when deployed on EC2

Hi guys,

I install sagemaker==v1.40.0

After I launch CloudFormation, I get error on EC2 log
ClientError: An error occurred (ValidationException) when calling the CreateModel operation: 2 validation errors detected: Value '{{ ti.xcom_pull(task_ids='model_training')['Training']['TrainingJobName'] }}' at 'modelName' failed to satisfy constraint: Member must have length less than or equal to 63; Value '{{ ti.xcom_pull(task_ids='model_training')['Training']['TrainingJobName'] }}' at 'modelName' failed to satisfy constraint: Member must satisfy regular expression pattern: ^a-zA-Z0-9*

Note: I checked with sagemaker==v1.39.2 ==> it worked

sagemaker version - get_image_uri() not resolved

I am trying to replicate your way of defining the DAG for a sagemaker training job. But, for this line:

from sagemaker.amazon.amazon_estimator import get_image_uri

I get this error:

Cannot find reference 'get_image_uri' in 'amazon_estimator.py' .

Is it defined in any specific version? Or would you have any suggestions for resolving this error, please?

Thanks

DAG Import Errors

image

Hello, I just set up airflow environment using Airflow 2.0.2 RDS CloudFormation.
When I access and login the airflow UI page, I got a above error.

Do you know what is problem?

Thanks,
gwangjin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.