microsoft / aksdeploymenttutorial Goto Github PK

View Code? Open in Web Editor NEW

76.0 35.0 52.0 6.74 MB

Tutorial on how to deploy Deep Learning models on GPU enabled Kubernetes cluster

License: MIT License

Jupyter Notebook 99.57% Python 0.43%

deep-learning tensorflow kubernetes gpu python flask

aksdeploymenttutorial's Introduction

This repo is no longer actively maintained, please see newer version available using Azure Machine Learning here.

Authors: Mathew Salvaris and Fidan Boylu Uz

Deploy Deep Learning CNN on Kubernetes Cluster with GPUs

Overview

In this repository there are a number of tutorials in Jupyter notebooks that have step-by-step instructions on how to deploy a pretrained deep learning model on a GPU enabled Kubernetes cluster. The tutorials cover how to deploy models from the following deep learning frameworks:

For each framework, we go through the following steps:

Model development where we load the pretrained model and test it by using it to score images
Developing the interface our Flask app will use to load and call the model
Building the Docker Image with our Flask REST API and model
Testing our Docker image before deployment
Creating our Kubernetes cluster and deploying our application to it
Testing the deployed model
Testing the throughput of our model
Cleaning up resources

Design

The application we will develop is a simple image classification service, where we will submit an image and get back what class the image belongs to. The application flow for the deep learning model is as follows:

The client sends a HTTP POST request with the encoded image data.
The Flask app extracts the image from the request.
The image is then appropriately preprocessed and sent to the model for scoring.
The scoring result is then piped into a JSON object and returned to the client.

If you already have a Docker image that you would like to deploy you can skip the first four notebooks.

NOTE: The tutorial goes through step by step how to deploy a deep learning model on Azure; it does not include enterprise best practices such as securing the endpoints and setting up remote logging etc.

Deploying with GPUS: For a detailed comparison of the deployments of various deep learning models, see the blog post here which provides evidence that, at least in the scenarios tested, GPUs provide better throughput and stability at a lower cost.

Prerequisites

Linux(Ubuntu). The tutorial was developed on an Azure Linux DSVM
Docker installed. NOTE: Even with docker installed you may need to set it up so that you don't require sudo to execute docker commands see "Manage Docker as a non-root user"
Dockerhub account
Port 9999 open: Jupyter notebook will use port 9999 so please ensure that it is open. For instructions on how to do that on Azure see here

Setup

Clone the repo:

git clone <repo web URL>

docker login

Go to the framework folder you would like to run the notebooks for.
Create a conda environment:

conda env create -f environment.yml

Activate the environment:

source activate <environment name>

Run:

jupyter notebook

Start the first notebook and make sure the kernel corresponding to the above environment is selected.

Steps

After following the setup instructions above, run the Jupyter notebooks in order. The same basic steps are followed for each deep learning framework.

Cleaning up

To remove the conda environment created see here. The last Jupyter notebook within each folder also gives details on deleting Azure resources associated with this repo.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

aksdeploymenttutorial's People

Contributors

Stargazers

Watchers

Forkers

velascosi pymia tloy1966 kelvinson jjdblast lenisha lalithakishore rwanjohi thomasht86 marabout2015 rafnijs jaymathe bikash pankajmehar jmussach azurementor warvito ploverman simonzhaoms daflrezf azeltov plooploops nileshgule mozamani r1manepalli chusri ashishpatel26 anjaniksharma sd3g14 prathamesh99 hyunjoors dsnoor ujjwalmsft jrdeco560 sandy1811 debajyotiguha78 klawrawkz igorkretov bhaskers-blu-org2 taffywrinkle claudiusgonzo jfabriciocp sureshr1 python-repository-hub test-mass-forker-org-1

aksdeploymenttutorial's Issues

Notebook 03_testlocally

In my case, it took around 4-5 minutes to spin up the application. Perhaps the time varies by the type of VM used.

kubectl get node debugging for notebook-4

If the user runs into an issue with !kubectl get node, you could add in a note to have them look into the config file in the .kube folder on their machine to ensure the config file looks accurate and does not have duplicate entries.

Docker EE vs CE

Could we add a note about why we are specifically recommending Docker EE over CE? If there are no technical limitations to using either, could we add a note accordingly instead?

Add in link for installing Docker on a Linux DSVM

Add DSVM requirements

Can we use Windows/Linux

docker permission issues

Might be helpful to check if the docker is not in the /root location, otherwise there are issues for the user.

new AKS cluster provisioned with NC series VMs do not come with the connector that allows scheduling GPU resources

Add HTTPS example

Change the markdown 04_DeployOnAKS to match Keras

define _base64img_to_numpy function in 01_DevelopModelDriver

Issue with image_name

For some reason I had an issue with this cat command in )4_DeployOnAKS, it seemed to add additional quotations and they messed up the name and caused parsing error later
%dotenv
image_name = os.getenv('docker_login') + os.getenv('image_repo')

conda env remove

Might be helpful to have this link https://conda.io/docs/commands/env/conda-env-remove.html

No module named 'tensorflow'

After following the prereqs, when I launch the first notebook for Tensorflow, I get the following error - ImportError: No module named 'tensorflow'.
I DID select the correct kernel (Microsoft ML Server Python 3.5).
This is on a fresh Ubuntu DSVM.
Cursory internet search suggests that this may be avoided by adding the --ignore-installed after the pip command for tensorflow installation but I am not sure if this is the correct solution or a workaround that may cause additional problems later. ( https://stackoverflow.com/questions/42244198/importerror-no-module-named-tensorflow ). Also I don't know how to do this in Conda.

Add Tutorial Set Up Notes

Add notes that tell people how to themselves up for running the tutorial notebooks. Examples include:

Set up a virtual environment with conda env create --file environment.yml

Should there be notes for docker set up?

Add instructions on operationalizing in commercial environment

While this is an open topic, place holder here on informing users that:

How to implement a metrics mechanism to understand API usage (current thought is App Insights).
How to add security to the endpoint
Scaling to meet demand

Current thought is that these will be covered in some other repo (maybe related to this one, maybe not). Topic at least needs to be discussed that the end result is not viable for commercial use as is (for most enterprise use cases). Final destination for this content to be decided in September.

(Topic valid for all Real Time Endpoints)

Fix img_url_to_json in testing_utilities.

Remove duplicate code on 01_DevelopModelDriver flow and take out main

Convert payload to same format as Keras example

Notebook 07_TearDown typo

When I run the command:
!az aks delete -n $aks_name -g $resource_group -y

Once the cell finishes executing it states 'inished' instead of 'Finished'. Not sure if this is an issue with az commands.

NVIDIA GPU resource change

AKS/Kubernetes moved Nvidia GPU resources from being an ‘alpha’ resource to a stable release, and changed the name of the resource on the cluster. Instead of requesting ‘alpha.kubernetes.io/nvidia-gpu’ now must request ‘nvidia.com/gpu’. I believe this issue affects anybody with a Kubernetes v1.10 or above cluster.