Giter Site home page Giter Site logo

machinelearning-sagemaker-challenge's Introduction

MachineLearning-SageMaker-Challenge

Objectives

In this challenge, we are going to be building a sentiment analysis API using SageMaker and LambdaSharp. To do this we will be:

  1. Cleaning and transforming amazon reviews dataset
  2. Training a Machine Learning model using Amazons BlazingText classification algorithm
  3. Evaluate the accuracy of our model (Boss Level )
  4. Deploying the infrastructure required to host the model
  5. Create an API to call the SageMaker endpoint

Requirements

You or your team will need the following requirements:

Installation details

Install Lambda Sharp

dotnet tool install -g LambdaSharp.Tool
lash config
lash init --tier Sandbox

NOTE:

I'm using the tier name Sandbox, but you can choose any name you want.

About this repo

Clone

git clone [email protected]:lambdasharp/MachineLearning-SageMaker-Challenge.git
cd MachineLearning-SageMaker-Challenge/

Details

This repo contains 4 important parts:

amazon-reviews-sentiment-analysis.ipynb

This is a Jupyter notebook, use it to transform the original dataset to a format that can be used by the BlazingText algorithm.

TrainingJob

This is a LambdaSharp module to create a SageMaker training job. The module will create a lambda function that is used as a custom resource in the MachineLearningInfrastructure stack. The job creates a data model. This model will later be used to classify text sentiment.

MachineLearningInfrastructure

Deploy this stack after the Training Job has been published. This stack will

  • Trigger a training job
  • Create a model
  • Create an endpoint configuration
  • Create a SageMaker endpoint

The endpoint can be used to make inferences about any text.

SentimentAnalysis

This is the stack that will deploy the API using API gateway and lambda.

Level 00 - Publish "TrainingJob" Module

The TrainingJob module is a LambdaSharp module that provides a custom resource.

cd TrainingJob
lash publish
lash deploy --tier Sandbox 

NOTE:

This module creates an S3 bucket that will be used to publish the datasets (test.txt and train.txt) in the following section. Take note of the S3 bucket generated by the stack!

Level 01 - Data Preparation/Transformation - BOSS LEVEL

In every machine learning project, one of the most important steps is to properly prepare the data that will be used to generate an ML model. In this step, you will use Spark to take the original Amazon Reviews dataset and transform it into something that can be understood by the BlazingText algorithm.

To begin, start the Jupyter/all-spark-notebook using docker-compose and the provided docker-compose.yml file:

cd MachineLearning-SageMaker-Challenge
docker-compose up

Once the container starts up it will print out information to log in to Jupyter

jupyter_1  |     Copy/paste this URL into your browser when you connect for the first time,
jupyter_1  |     to login with a token:
jupyter_1  |         http://(a76443ebe45b or 127.0.0.1):8888/?token=cf19970f58a43f16e1084df7075631c85dba7e8d20e027fc

COPY THE TOKEN and then visit localhost:8765

From the menu select amazon-reviews-sentiment-analysis.ipynb and follow the instructions in the notebook.

REMEMBRE:

  • Transform the data for the train dataset
  • Transform the data for the test dataset

Upload the dataset files to the S3 bucket from level 0 as soon as possible, these files are BIG!

BazingText documentation

About the train and test data sets

The Amazon Reviews data sets have been obtained from the course.fast.ai/datasets website.

It includes 34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). This full dataset contains 600,000 training samples and 130,000 testing samples in each class.

Why do we have a train and test datasets?
  • The train data set is used to create a model that will be able to make inferences about the sentiment of new data.
  • The test data set is used to verify the accuracy of the model.

In the next step, the training job will give us an accuracy score.

BOSS LEVEL

Data preparation is one of the most important steps, the quality of the data will determine the accuracy of the ML model. The boss level is all about cleaning the data and carefully choosing the classifications.

After the training job has finished, you will see how accurate is the resulting model. After completing the challenge, try to get the validation accuracy to be over 0.9!

Level 02 - Deploy the Machine Learning Infrastructure

The MachineLearningInfrastructure directory has a LambdaSharp module that defines the following resources:

  • TrainingJob (custom resource defined in Level 00)
cd ..
cd MachineLearningInfrastructure
lash deploy --tier Sandbox

This step will ask you for two parameters TrainKey and TestKey. These are the keys of the train and test files in the S3 bucket e.g. train.txt and test.txt.

Force a model update To force a model update using new data, make sure that the new files have different names. Upload them to S3 and then update the stack parameter values.

Open SageMaker and go to Training jobs, select the blazingtext job and keep an eye on the MONITOR section for the validation:accuracy and train:accuracy values.

NOTE:

The first team who gets the validation:accuracy over 0.9 wins a prize!

Level 03 - Deploy API

cd..
cd SentimentAnalysis
lash deploy --tier Sandbox

The MachineLearningInfrastructure module makes a few values public including the model EndpointName. These public values are imported by the SentimentAnalysis module and passed to the lambda function.

Go to the SentimentAnalysis directory and open the project. Use the SageMakerRuntime SDK to InvokeEndpointAsync Update code to invoke the SageMaker endpoint and return the results.

An example of the payload expected by the SageMaker endpoint is like this:

{
    "instances": [
        "hello world!",
        "This show sucks!",
        "I had a great day",
        "Why are you pushing me?",
        ":)",
        "(:",
        "=)",
        ":-)",
        ":(",
        "):",
        "=(",
        ":-("
    ]
}

The result from the endpoint is like this:

[
    {
        "prob": [
            0.9781392216682434
        ],
        "label": [
            "__label__positive"
        ]
    },  
    {
        "prob": [
            0.8243443443446654
        ],
        "label": [
            "__label__positive"
        ]
    } ...
]

Where each object in the array corresponds to the instances element.

Level 04 - Improve the API response

The response is vague, create a response that clearly identifies the text and the sentiment of the text.

machinelearning-sagemaker-challenge's People

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.