Giter Site home page Giter Site logo

microsoft / aksdeploymenttutorial Goto Github PK

View Code? Open in Web Editor NEW
76.0 35.0 52.0 6.74 MB

Tutorial on how to deploy Deep Learning models on GPU enabled Kubernetes cluster

License: MIT License

Jupyter Notebook 99.57% Python 0.43%
deep-learning tensorflow kubernetes gpu python flask

aksdeploymenttutorial's Introduction

This repo is no longer actively maintained, please see newer version available using Azure Machine Learning here.

Authors: Mathew Salvaris and Fidan Boylu Uz

Deploy Deep Learning CNN on Kubernetes Cluster with GPUs

Overview

In this repository there are a number of tutorials in Jupyter notebooks that have step-by-step instructions on how to deploy a pretrained deep learning model on a GPU enabled Kubernetes cluster. The tutorials cover how to deploy models from the following deep learning frameworks:

alt text

For each framework, we go through the following steps:

  • Model development where we load the pretrained model and test it by using it to score images
  • Developing the interface our Flask app will use to load and call the model
  • Building the Docker Image with our Flask REST API and model
  • Testing our Docker image before deployment
  • Creating our Kubernetes cluster and deploying our application to it
  • Testing the deployed model
  • Testing the throughput of our model
  • Cleaning up resources

Design

alt text

The application we will develop is a simple image classification service, where we will submit an image and get back what class the image belongs to. The application flow for the deep learning model is as follows:

  1. The client sends a HTTP POST request with the encoded image data.
  2. The Flask app extracts the image from the request.
  3. The image is then appropriately preprocessed and sent to the model for scoring.
  4. The scoring result is then piped into a JSON object and returned to the client.

If you already have a Docker image that you would like to deploy you can skip the first four notebooks.

NOTE: The tutorial goes through step by step how to deploy a deep learning model on Azure; it does not include enterprise best practices such as securing the endpoints and setting up remote logging etc.

Deploying with GPUS: For a detailed comparison of the deployments of various deep learning models, see the blog post here which provides evidence that, at least in the scenarios tested, GPUs provide better throughput and stability at a lower cost.

Prerequisites

  • Linux(Ubuntu). The tutorial was developed on an Azure Linux DSVM
  • Docker installed. NOTE: Even with docker installed you may need to set it up so that you don't require sudo to execute docker commands see "Manage Docker as a non-root user"
  • Dockerhub account
  • Port 9999 open: Jupyter notebook will use port 9999 so please ensure that it is open. For instructions on how to do that on Azure see here

Setup

  1. Clone the repo:
git clone <repo web URL>
  1. Login to Docker with your username and password.
docker login
  1. Go to the framework folder you would like to run the notebooks for.
  2. Create a conda environment:
conda env create -f environment.yml
  1. Activate the environment:
source activate <environment name>
  1. Run:
jupyter notebook
  1. Start the first notebook and make sure the kernel corresponding to the above environment is selected.

Steps

After following the setup instructions above, run the Jupyter notebooks in order. The same basic steps are followed for each deep learning framework.

Cleaning up

To remove the conda environment created see here. The last Jupyter notebook within each folder also gives details on deleting Azure resources associated with this repo.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

aksdeploymenttutorial's People

Contributors

danielleodean avatar fboylu avatar marabout2015 avatar microsoftopensource avatar msalvaris avatar msftgits avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aksdeploymenttutorial's Issues

Notebook 03_testlocally

In my case, it took around 4-5 minutes to spin up the application. Perhaps the time varies by the type of VM used.

kubectl get node debugging for notebook-4

If the user runs into an issue with !kubectl get node, you could add in a note to have them look into the config file in the .kube folder on their machine to ensure the config file looks accurate and does not have duplicate entries.

Docker EE vs CE

Could we add a note about why we are specifically recommending Docker EE over CE? If there are no technical limitations to using either, could we add a note accordingly instead?

docker permission issues

Might be helpful to check if the docker is not in the /root location, otherwise there are issues for the user.

Issue with image_name

For some reason I had an issue with this cat command in )4_DeployOnAKS, it seemed to add additional quotations and they messed up the name and caused parsing error later
%dotenv
image_name = os.getenv('docker_login') + os.getenv('image_repo')

No module named 'tensorflow'

After following the prereqs, when I launch the first notebook for Tensorflow, I get the following error - ImportError: No module named 'tensorflow'.
I DID select the correct kernel (Microsoft ML Server Python 3.5).
This is on a fresh Ubuntu DSVM.
Cursory internet search suggests that this may be avoided by adding the --ignore-installed after the pip command for tensorflow installation but I am not sure if this is the correct solution or a workaround that may cause additional problems later. ( https://stackoverflow.com/questions/42244198/importerror-no-module-named-tensorflow ). Also I don't know how to do this in Conda.

Add Tutorial Set Up Notes

Add notes that tell people how to themselves up for running the tutorial notebooks. Examples include:

  • Set up a virtual environment with conda env create --file environment.yml

Should there be notes for docker set up?

Add instructions on operationalizing in commercial environment

While this is an open topic, place holder here on informing users that:

  1. How to implement a metrics mechanism to understand API usage (current thought is App Insights).
  2. How to add security to the endpoint
  3. Scaling to meet demand

Current thought is that these will be covered in some other repo (maybe related to this one, maybe not). Topic at least needs to be discussed that the end result is not viable for commercial use as is (for most enterprise use cases). Final destination for this content to be decided in September.

(Topic valid for all Real Time Endpoints)

Notebook 07_TearDown typo

When I run the command:
!az aks delete -n $aks_name -g $resource_group -y

Once the cell finishes executing it states 'inished' instead of 'Finished'. Not sure if this is an issue with az commands.

NVIDIA GPU resource change

AKS/Kubernetes moved Nvidia GPU resources from being an ‘alpha’ resource to a stable release, and changed the name of the resource on the cluster. Instead of requesting ‘alpha.kubernetes.io/nvidia-gpu’ now must request ‘nvidia.com/gpu’. I believe this issue affects anybody with a Kubernetes v1.10 or above cluster.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.