Giter Site home page Giter Site logo

bankdemo-build_vypf's Introduction

Purpose

The purpose of this repo is to provide sample codes to create a demo to illustrate MLOps (preprocessing, training, deploying etc) with the data source coming from RedShift. Other RedShift features like RedShift Spectrum, RedShift ML are also demostrated.

Disclaimer: The focus of this demo is to showcase the above and for simplicity, full IAM permissions are assigned. In production setting, best practices like least IAM permission should be used instead.

Services used

Below list the AWS services used in this demo:

  • SageMaker
  • Code* (Part of SageMaker Pipelines)
  • RedShift
  • S3
  • Glue
  • Athena
  • Secrets Manager

Prerequisite prior to running the notebooks

Before deploying the CloudFormation script, ensure the region that are you deploying to meets the following requirements:

  • Able to create new VPC. i.e. the VPC limit is not reached
  • SageMaker Studio has not been created in the region.

One workaround is to use a different region that supports SageMaker Studio.

Deploy the CloudFormation script bankdm-cloudformation.yaml. The script will do the following:

  • Create a new VPC.
  • SageMaker Studio to be created and attached to the VPC (not the default option). Another way would be to allow connection from the Internet to RedShift which is not recommended.
  • Add IAM roles to SageMaker Execution role.

The following steps are to be done manually:

  • Enable SageMaker jumpstart in SageMaker Studio
  • Git clone this repo.
  • Create a SageMaker project for building, training and deployment.
  • Overwrite the files from this repo to the modelbuild repo.

Detailed instructions are located in instructions.md

High-level architecture diagram (after prerequistie steps)

The following diagram shows the high-level architecture diagram after completing the prerequisite steps. Do note that this is not the final architecture.

diagram

To avoid any unexpected issues, a new VPC is created with one public subnet and one private subnet. The private subnet contains RedShift and VPC endpoints for SageMaker Studio and EFS storage. Later in the notebooks, the SageMaker Studio will access RedShift. With this architecture, all traffic will be within the VPC and RedShift does not need to be exposed to the Internet.

CodeCommit, CodeBuild, CodePipeline, and SageMaker Pipelines are used for MLOps and it is described later.

High-level description of the demo

Note: Please complete the prerequisite steps above first.

Notebook 01

  • Create the necessary IAM roles and policies.
  • Create RedShift cluster, secret in Secret Manager and Lambda function.

Notebook 02 (optional)

  • Explore the data.

Notebook 03

  • Copy the CSV file to S3. Create table in Glue Data Catalog (Glue table) and reference the CSV file.
  • Use Athena to query the Glue table.
  • Create RedShift schema and external table referencing the Glue table.
  • Create RedShift table. Insert CSV data to RedShift using Athena.
  • Manually save and commit the notebook in order to trigger the MLOps workflow.
  • It takes ~12 minutes to run the pipeline and ~5 minutes to deploy the SageMaker endpoint.

MLOps Workflow

The following diagram shows the MLOps workflow after manually committing the code: pipeline

The diagrams below describe the workflow in more detail: pipeline

pipeline

High-level architecture diagram (after MLOps workflow executed successfully)

diagram

Notebook 04

  • Once the SageMaker staging endpoint has been created, run predictions to the endpoint.

Notebook 05

  • You can also use RedShift ML to create a model directly in RedShift using SQL statements. This leverages on SageMaker AutoPilot to create another model (different from the staging SageMaker endpoint).
  • Predictions can also be done directly in RedShift using SQL statements to the RedShift ML model. For this demo, SQL statements are provided in the notebook but you can also run the same in the RedShift query editer.

Roles

There are four roles used in this demo:

  • SageMaker Execution role: For SageMaker Studio to create/access resources
  • RedShift role (BankDM-RedShift): For RedShift cluster to access resources and for unloading data from RedShift to S3
  • Lambda execution role (BankDM-Lambda): For Lambda function to access resources
  • AmazonSageMakerServiceCatalogProductsUseRole (Default role): For SageMaker Pipelines to create/access resources

Notes

  • The notebooks do not store any variables. In other words, there is no transferring of variables between notebooks.
  • If the secret already exists and you are creating the RedShift cluster again in notebook 01, the secret will not be updated to the new password. Please update the password manually in Secrets Manager. This is to prevent accidential update of the secret when you rerun the notebook while the RedShift cluster is still running.
  • The Security Group used for the RedShift and SageMaker Studio is the default one. If you are using another security group, please change the security_group_id in notebook 01.
  • If you change any names such as secret/role name, you may have to edit the SageMaker Pipelines code under 'pipelines/bankdm'.

Clean up

Notebook06 does not delete VPC, SageMaker Studio, SageMaker Pipelines, CodePipelines, S3, EFS etc. You can delete the SageMaker project with the AWS CLI command aws sagemaker delete-project --project-name X. This will remove the MLOps components like CodePipeline.

Before deleting the CloudFormation, the following components needs to be deleted manually:

  • In SageMaker Studio, shutdown SageMaker Studio by going to File -> Shutdown -> Shut down all
  • EFS

If the CloudFormation has issues deleting the VPC, you can do so manually.

Possible enhancement

  • Error handling
  • Feature store
  • Cloudformation to create other resources

References

Some codes were taken from the following sources and edited from there:

bankdemo-build_vypf's People

Contributors

laikw avatar trellixvulnteam avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.