MachineLearning-SageMaker-Challenge

Objectives

In this challenge, we are going to be building a sentiment analysis API using SageMaker and LambdaSharp. To do this we will be:

Cleaning and transforming amazon reviews dataset
Training a Machine Learning model using Amazons BlazingText classification algorithm
Evaluate the accuracy of our model (Boss Level )
Deploying the infrastructure required to host the model
Create an API to call the SageMaker endpoint

Requirements

You or your team will need the following requirements:

Docker installed on your machine
LambdaSharpTool V0.5.0
AWS Account

Installation details

Install Lambda Sharp

dotnet tool install -g LambdaSharp.Tool
lash config
lash init --tier Sandbox

NOTE:

I'm using the tier name Sandbox, but you can choose any name you want.

About this repo

Clone

git clone [email protected]:lambdasharp/MachineLearning-SageMaker-Challenge.git
cd MachineLearning-SageMaker-Challenge/

Details

This repo contains 4 important parts:

amazon-reviews-sentiment-analysis.ipynb

This is a Jupyter notebook, use it to transform the original dataset to a format that can be used by the BlazingText algorithm.

TrainingJob

This is a LambdaSharp module to create a SageMaker training job. The module will create a lambda function that is used as a custom resource in the MachineLearningInfrastructure stack. The job creates a data model. This model will later be used to classify text sentiment.

MachineLearningInfrastructure

Deploy this stack after the Training Job has been published. This stack will

Trigger a training job
Create a model
Create an endpoint configuration
Create a SageMaker endpoint

The endpoint can be used to make inferences about any text.

SentimentAnalysis

This is the stack that will deploy the API using API gateway and lambda.

Level 00 - Publish "TrainingJob" Module

The TrainingJob module is a LambdaSharp module that provides a custom resource.

cd TrainingJob
lash publish
lash deploy --tier Sandbox

NOTE:

This module creates an S3 bucket that will be used to publish the datasets (test.txt and train.txt) in the following section. Take note of the S3 bucket generated by the stack!

Level 01 - Data Preparation/Transformation - BOSS LEVEL

In every machine learning project, one of the most important steps is to properly prepare the data that will be used to generate an ML model. In this step, you will use Spark to take the original Amazon Reviews dataset and transform it into something that can be understood by the BlazingText algorithm.

To begin, start the Jupyter/all-spark-notebook using docker-compose and the provided docker-compose.yml file:

cd MachineLearning-SageMaker-Challenge
docker-compose up

Once the container starts up it will print out information to log in to Jupyter

jupyter_1  |     Copy/paste this URL into your browser when you connect for the first time,
jupyter_1  |     to login with a token:
jupyter_1  |         http://(a76443ebe45b or 127.0.0.1):8888/?token=cf19970f58a43f16e1084df7075631c85dba7e8d20e027fc

COPY THE TOKEN and then visit localhost:8765

From the menu select amazon-reviews-sentiment-analysis.ipynb and follow the instructions in the notebook.

REMEMBRE:

Transform the data for the train dataset

Transform the data for the test dataset

Upload the dataset files to the S3 bucket from level 0 as soon as possible, these files are BIG!

BazingText documentation

About the train and test data sets

The Amazon Reviews data sets have been obtained from the course.fast.ai/datasets website.

It includes 34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). This full dataset contains 600,000 training samples and 130,000 testing samples in each class.

Why do we have a train and test datasets?

The train data set is used to create a model that will be able to make inferences about the sentiment of new data.
The test data set is used to verify the accuracy of the model.

In the next step, the training job will give us an accuracy score.

BOSS LEVEL

Data preparation is one of the most important steps, the quality of the data will determine the accuracy of the ML model. The boss level is all about cleaning the data and carefully choosing the classifications.

After the training job has finished, you will see how accurate is the resulting model. After completing the challenge, try to get the validation accuracy to be over 0.9!

Level 02 - Deploy the Machine Learning Infrastructure

The MachineLearningInfrastructure directory has a LambdaSharp module that defines the following resources:

TrainingJob (custom resource defined in Level 00)

cd ..
cd MachineLearningInfrastructure
lash deploy --tier Sandbox

This step will ask you for two parameters TrainKey and TestKey. These are the keys of the train and test files in the S3 bucket e.g. train.txt and test.txt.

Force a model update

To force a model update using new data, make sure that the new files have different names. Upload them to S3 and then update the stack parameter values.

Open SageMaker and go to Training jobs, select the blazingtext job and keep an eye on the MONITOR section for the validation:accuracy and train:accuracy values.

NOTE:

The first team who gets the validation:accuracy over 0.9 wins a prize!

Level 03 - Deploy API

cd..
cd SentimentAnalysis
lash deploy --tier Sandbox

The MachineLearningInfrastructure module makes a few values public including the model EndpointName. These public values are imported by the SentimentAnalysis module and passed to the lambda function.

Go to the SentimentAnalysis directory and open the project. Use the SageMakerRuntime SDK to InvokeEndpointAsync Update code to invoke the SageMaker endpoint and return the results.

An example of the payload expected by the SageMaker endpoint is like this:

{
    "instances": [
        "hello world!",
        "This show sucks!",
        "I had a great day",
        "Why are you pushing me?",
        ":)",
        "(:",
        "=)",
        ":-)",
        ":(",
        "):",
        "=(",
        ":-("
    ]
}

The result from the endpoint is like this:

[
    {
        "prob": [
            0.9781392216682434
        ],
        "label": [
            "__label__positive"
        ]
    },  
    {
        "prob": [
            0.8243443443446654
        ],
        "label": [
            "__label__positive"
        ]
    } ...
]

Where each object in the array corresponds to the instances element.

Level 04 - Improve the API response

The response is vague, create a response that clearly identifies the text and the sentiment of the text.

lambdasharp / machinelearning-sagemaker-challenge Goto Github PK