PyTorch is a deep learning framework that is becoming popular especially for rapid prototyping of new models. You can get started easily with PyTorch using Amazon SageMaker, a fully managed service to build, train, and deploy machine learning models at scale. In this workshop, we build a few natural language processing models to analyze & generate text.
NLP solutions based on PyTroch have several principle components, text preprocessing, text representation using techniques such as GloVe: Global Vectors for Word Representation, state of the art general purpose, pre-trained models such as Bidirectional Encoder Representations from Transformers or BERT.
In addition to these principle components, we need to plan for a full end-to-end model training and deployment lifecycle.
This set of labs aim to enable to develop end-to-end NLP solutions with PyTorch using Amazon SageMaker for the full Machine Learning Lifecycle. Each lab is independent, but during the presentation we would go sequentially. We wish to thank the authors and code committers of projects listed in the reference section, the labs would not be possible without this excellent material.
Pull requests for improvements and additions most welcome.
We are using event engine to setup the infrastructure needed for our labs today, so please follow the instructions to get setup with Event Engine first and then come back to this repo.
$ cd src
$ ./boot.sh
$ source venv/bin/activate
If asked to select a kernel - please select - conda_pytorch_36 everytime.
In lab 1 we exercise all the various components of SageMaker that have native Pytorch support. We start with a notebook that has a PyTorch kernel. Then we use sagemaker.pytorch.estimator.PyTorch to train a multi-layer LSTM RNN model. This model predicts the probability of the next word in the sequence, based on the words already observed in the sequence. We later in the notebook use this trained model to generate a legible text from a random seed. At each stage of the model we use SageMaker to train, create model and finally deploy the model as a real time endpoint.
In lab 2 we use SageMaker to fine tune Hugging Face's pre-trained BERT based on PyTorch. We use FastBert framework for this, we fine-tune BERT for multi-class classification of comments. This lab is based on a Kaggle competition that is aimed at automatically labelling comments to be in one or more of the six possible types of comments - toxic, severetoxic, obscene, threat, insult, identityhate.
We start off with first create a container, use SageMaker to train this container on a corpus of comments data. Finally we deploy a real time endpoint and run it on a set of comments.
TBD
In lab 4 we work with newly released Captum.ai. In this lab we load a CNN model that has been pre-trained on IMDB dataset for sentiment analysis. Then we use this model to detect sentiments of each statement before using the integrated gradients method to interpret those predictions.
Note: these steps were tested on Safari browser using Mac OS
Open a browser and navigate to https://dashboard.eventengine.run/login
Enter a 12-character "hash" provided to you by workshop organizer.
Click on "Accpet Terms & Login"
Click on "AWS Console"
Please, log off from any other AWS accounts you are currently logged into
Click on "Open AWS Console"
You should see a screen like this.
We now need select the correct Identity Role for the workshop type "IAM" into the search bar and click on IAM.
Click on "Roles"
Scroll down past "Create Role" and Click on "TeamRole".
Copy "Role ARN" by selecting the copy icon on the right.
You may want to temporarily paste this role ARN into a notepad.
Once you copied TeamRole ARN, click on "Services" in the upper left corner.
Enter "SageMaker" in the search bar and click on it
You should see a screen like this.
Click on the orange button "Create Notebook Instance".
On the next webpage,
- Give your notebook a name (no underscores, please).
- Under Notebook instance type, select "ml.p2.xlarge".
- Under "Permission and encryption" select "Enter a custom IAM role ARN".
- Paste your TeamRole ARN in the cell below labeled "Custom IAM role ARN". Note: your TeamRole ARN will have different AWS account number than what you see here.
- Scroll down to the bottom of the page and click on "Create Notebook instance".
You should see your notebook being created. In a couple of minutes, its status will change from "Pending" to "In Service", at which point, please click on "Open Jupyter".
In Jupyter Lab console, please, open 'Terminal'.
A new tab will open displaying a command prompt terminal.
In the terminal tap, please, issue these two commands:
$ cd SageMaker $ git clone https://github.com/C24IO/Reinvent2019Pytorch.git
You should see output similar to this:
You may now navigate to the lab notebooks and get started, best of luck :-) !
We have and will continue to add a few extras to this set of labs. bonus is a directory that contains some helper scripts. We have also included an alternative way to solve the toxic comments classification challenge using the The fastai deep learning library. Also - the folder impatient contains the solved labs with outputs.
- SageMaker Examples
- Introducing FastBert โ A simple Deep Learning library for BERT Models
- Kaushal Trivedi's FastBert
- ๐ค Hugging Face - Solving NLP, one commit at a time!
- GloVe: Global Vectors for Word Representation
- Kaggle Competetion - Toxic Comment Classification Challenge
- Jeffrey Hsu, Susannah Klanecek: A Deep Dive into NLP with PyTorch | PyData London 2019
- A Deep Dive into NLP with PyTorch
- A Deep Dive into NLP with PyTorch - Slides
- Captum - Interpreting text models: IMDB sentiment analysis
- Ben Trevett's - PyTorch Sentiment Analysis
- Large Movie Review Dataset
- Nikita Sharma's Python notebook using data from Toxic Comment Classification Challenge
- Nikita Sharma's blog - Deep Learning on multi-label text classification with FastAi
- SageMaker PyTorch Container
- SageMaker PyTorch Estimators and Models
- Amazon SageMaker Workshop about Using Custom Algorithms
- An ImageNet for language