In this challenge, we are going to be building a sentiment analysis API using SageMaker and LambdaSharp. To do this we will be:
- Cleaning and transforming amazon reviews dataset
- Training a Machine Learning model using Amazons BlazingText classification algorithm
- Evaluate the accuracy of our model (Boss Level )
- Deploying the infrastructure required to host the model
- Create an API to call the SageMaker endpoint
You or your team will need the following requirements:
- Docker installed on your machine
- LambdaSharpTool V0.5.0
- AWS Account
Installation details
Install Lambda Sharp
dotnet tool install -g LambdaSharp.Tool
lash config
lash init --tier Sandbox
NOTE:
I'm using the tier name
Sandbox
, but you can choose any name you want.
git clone [email protected]:lambdasharp/MachineLearning-SageMaker-Challenge.git
cd MachineLearning-SageMaker-Challenge/
This repo contains 4 important parts:
amazon-reviews-sentiment-analysis.ipynb
This is a Jupyter notebook, use it to transform the original dataset to a format that can be used by the BlazingText algorithm.
TrainingJob
This is a LambdaSharp module to create a SageMaker training job. The module will create a lambda function that is used as a custom resource in the MachineLearningInfrastructure stack. The job creates a data model. This model will later be used to classify text sentiment.
MachineLearningInfrastructure
Deploy this stack after the Training Job has been published. This stack will
- Trigger a training job
- Create a model
- Create an endpoint configuration
- Create a SageMaker endpoint
The endpoint can be used to make inferences about any text.
SentimentAnalysis
This is the stack that will deploy the API using API gateway and lambda.
The TrainingJob module is a LambdaSharp module that provides a custom resource.
cd TrainingJob
lash publish
lash deploy --tier Sandbox
NOTE:
This module creates an S3 bucket that will be used to publish the datasets (test.txt and train.txt) in the following section. Take note of the S3 bucket generated by the stack!
In every machine learning project, one of the most important steps is to properly prepare the data that will be used to generate an ML model. In this step, you will use Spark to take the original Amazon Reviews dataset and transform it into something that can be understood by the BlazingText algorithm.
To begin, start the Jupyter/all-spark-notebook using docker-compose
and the provided docker-compose.yml
file:
cd MachineLearning-SageMaker-Challenge
docker-compose up
Once the container starts up it will print out information to log in to Jupyter
jupyter_1 | Copy/paste this URL into your browser when you connect for the first time,
jupyter_1 | to login with a token:
jupyter_1 | http://(a76443ebe45b or 127.0.0.1):8888/?token=cf19970f58a43f16e1084df7075631c85dba7e8d20e027fc
COPY THE TOKEN and then visit localhost:8765
From the menu select amazon-reviews-sentiment-analysis.ipynb and follow the instructions in the notebook.
REMEMBRE:
- Transform the data for the train dataset
- Transform the data for the test dataset
About the train and test data sets
The Amazon Reviews data sets have been obtained from the course.fast.ai/datasets website.
It includes 34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). This full dataset contains 600,000 training samples and 130,000 testing samples in each class.
Why do we have a train and test datasets?
- The train data set is used to create a model that will be able to make inferences about the sentiment of new data.
- The test data set is used to verify the accuracy of the model.
In the next step, the training job will give us an accuracy score.
BOSS LEVEL
Data preparation is one of the most important steps, the quality of the data will determine the accuracy of the ML model. The boss level is all about cleaning the data and carefully choosing the classifications.
After the training job has finished, you will see how accurate is the resulting model. After completing the challenge, try to get the validation accuracy to be over 0.9
!
The MachineLearningInfrastructure
directory has a LambdaSharp module that defines the following resources:
- TrainingJob (custom resource defined in Level 00)
cd ..
cd MachineLearningInfrastructure
lash deploy --tier Sandbox
This step will ask you for two parameters TrainKey
and TestKey
. These are the keys of the train and test files in the S3 bucket e.g. train.txt
and test.txt
.
Force a model update
To force a model update using new data, make sure that the new files have different names. Upload them to S3 and then update the stack parameter values.Open SageMaker and go to Training jobs
, select the blazingtext job and keep an eye on the MONITOR section for the validation:accuracy
and train:accuracy
values.
NOTE:
The first team who gets the
validation:accuracy
over0.9
wins a prize!
cd..
cd SentimentAnalysis
lash deploy --tier Sandbox
The MachineLearningInfrastructure module makes a few values public including the model EndpointName
. These public values are imported by the SentimentAnalysis module and passed to the lambda function.
Go to the SentimentAnalysis directory and open the project. Use the SageMakerRuntime
SDK to InvokeEndpointAsync
Update code to invoke the SageMaker endpoint and return the results.
An example of the payload expected by the SageMaker endpoint is like this:
{
"instances": [
"hello world!",
"This show sucks!",
"I had a great day",
"Why are you pushing me?",
":)",
"(:",
"=)",
":-)",
":(",
"):",
"=(",
":-("
]
}
The result from the endpoint is like this:
[
{
"prob": [
0.9781392216682434
],
"label": [
"__label__positive"
]
},
{
"prob": [
0.8243443443446654
],
"label": [
"__label__positive"
]
} ...
]
Where each object in the array corresponds to the instances
element.
The response is vague, create a response that clearly identifies the text and the sentiment of the text.