The bankdemo-build_vypf from trellixvulnteam

Purpose

The purpose of this repo is to provide sample codes to create a demo to illustrate MLOps (preprocessing, training, deploying etc) with the data source coming from RedShift. Other RedShift features like RedShift Spectrum, RedShift ML are also demostrated.

Disclaimer: The focus of this demo is to showcase the above and for simplicity, full IAM permissions are assigned. In production setting, best practices like least IAM permission should be used instead.

Services used

Below list the AWS services used in this demo:

SageMaker
Code* (Part of SageMaker Pipelines)
RedShift
S3
Glue
Athena
Secrets Manager

Prerequisite prior to running the notebooks

Before deploying the CloudFormation script, ensure the region that are you deploying to meets the following requirements:

Able to create new VPC. i.e. the VPC limit is not reached
SageMaker Studio has not been created in the region.

One workaround is to use a different region that supports SageMaker Studio.

Deploy the CloudFormation script bankdm-cloudformation.yaml. The script will do the following:

Create a new VPC.
SageMaker Studio to be created and attached to the VPC (not the default option). Another way would be to allow connection from the Internet to RedShift which is not recommended.
Add IAM roles to SageMaker Execution role.

The following steps are to be done manually:

Enable SageMaker jumpstart in SageMaker Studio
Git clone this repo.
Create a SageMaker project for building, training and deployment.
Overwrite the files from this repo to the modelbuild repo.

Detailed instructions are located in instructions.md

High-level architecture diagram (after prerequistie steps)

The following diagram shows the high-level architecture diagram after completing the prerequisite steps. Do note that this is not the final architecture.

To avoid any unexpected issues, a new VPC is created with one public subnet and one private subnet. The private subnet contains RedShift and VPC endpoints for SageMaker Studio and EFS storage. Later in the notebooks, the SageMaker Studio will access RedShift. With this architecture, all traffic will be within the VPC and RedShift does not need to be exposed to the Internet.

CodeCommit, CodeBuild, CodePipeline, and SageMaker Pipelines are used for MLOps and it is described later.

High-level description of the demo

Note: Please complete the prerequisite steps above first.

Notebook 01

Create the necessary IAM roles and policies.
Create RedShift cluster, secret in Secret Manager and Lambda function.

Notebook 02 (optional)

Explore the data.

Notebook 03

Copy the CSV file to S3. Create table in Glue Data Catalog (Glue table) and reference the CSV file.
Use Athena to query the Glue table.
Create RedShift schema and external table referencing the Glue table.
Create RedShift table. Insert CSV data to RedShift using Athena.
Manually save and commit the notebook in order to trigger the MLOps workflow.
It takes ~12 minutes to run the pipeline and ~5 minutes to deploy the SageMaker endpoint.

MLOps Workflow

The following diagram shows the MLOps workflow after manually committing the code:

The diagrams below describe the workflow in more detail:

High-level architecture diagram (after MLOps workflow executed successfully)

Notebook 04

Once the SageMaker staging endpoint has been created, run predictions to the endpoint.

Notebook 05

You can also use RedShift ML to create a model directly in RedShift using SQL statements. This leverages on SageMaker AutoPilot to create another model (different from the staging SageMaker endpoint).
Predictions can also be done directly in RedShift using SQL statements to the RedShift ML model. For this demo, SQL statements are provided in the notebook but you can also run the same in the RedShift query editer.

Roles

There are four roles used in this demo:

SageMaker Execution role: For SageMaker Studio to create/access resources
RedShift role (BankDM-RedShift): For RedShift cluster to access resources and for unloading data from RedShift to S3
Lambda execution role (BankDM-Lambda): For Lambda function to access resources
AmazonSageMakerServiceCatalogProductsUseRole (Default role): For SageMaker Pipelines to create/access resources

Notes

The notebooks do not store any variables. In other words, there is no transferring of variables between notebooks.
If the secret already exists and you are creating the RedShift cluster again in notebook 01, the secret will not be updated to the new password. Please update the password manually in Secrets Manager. This is to prevent accidential update of the secret when you rerun the notebook while the RedShift cluster is still running.
The Security Group used for the RedShift and SageMaker Studio is the default one. If you are using another security group, please change the security_group_id in notebook 01.
If you change any names such as secret/role name, you may have to edit the SageMaker Pipelines code under 'pipelines/bankdm'.

Clean up

Notebook06 does not delete VPC, SageMaker Studio, SageMaker Pipelines, CodePipelines, S3, EFS etc. You can delete the SageMaker project with the AWS CLI command aws sagemaker delete-project --project-name X. This will remove the MLOps components like CodePipeline.

Before deleting the CloudFormation, the following components needs to be deleted manually:

In SageMaker Studio, shutdown SageMaker Studio by going to File -> Shutdown -> Shut down all
EFS

If the CloudFormation has issues deleting the VPC, you can do so manually.

Possible enhancement

Error handling
Feature store
Cloudformation to create other resources

References

Some codes were taken from the following sources and edited from there:

trellixvulnteam / bankdemo-build_vypf Goto Github PK

bankdemo-build_vypf's Introduction