Giter Site home page Giter Site logo

distributeddeeplearning's Introduction

Introduction

This repo contains a cookiecutter template for running distributed training of deep learning models using Azure Machine Learning. You can create clusters with 0 nodes which will incur no cost and scale this up to hundreds of nodes. It is also possible to use low priority nodes to reduce costs even further.

The project contains the following:

Tensorflow Benchmark

This is a demo template that allows you to easily run tf_cnn_benchmarks on Azure ML. This is a great way to test performance as well as compare to other platforms

Tensorflow Imagenet

This is another demo template that shows you how to train a ResNet50 model using Imagenet on Azure. We include scripts for processing the Imagenet data, transforming them to TF Records as well as leveraging AzCopy to quickly upload the data to the cloud.

Tensorflow Template

This is a blank template you can use for your own distributed training projects. It allows you to leverage all the tooling built around the previous two demos to speed up the time it takes to run your model in a distributed fashion on Azure.

PyTorch Benchmark

This is a demo template that allows you to easily run a simple PyTorch benchmarking script on Azure ML. This is a great way to test performance as well as compare to other platforms

PyTorch Imagenet

This is another demo template that shows you how to train a ResNet50 model using Imagenet on Azure. We include scripts for processing the Imagenet data as well as leveraging AzCopy to quickly upload the data to the cloud.

PyTorch Template

This is a blank template you can use for your own distributed training projects. It allows you to leverage all the tooling built around the previous two demos to speed up the time it takes to run your model in a distributed fashion on Azure.

Prerequisites

Before you get started you need a PC running Ubuntu and the following installed:
Docker installed
Nvidia runtime for docker [Required for local execution]
Python>=3.6
Cookiecutter installed
Git installed

Note: You will need to run docker without sudo, to do this run:

sudo usermod -aG docker $USER
newgrp docker 

Setup

Using the template

Once you have Cookiecutter installed you can either directly invoke project creation as follows:

cookiecutter gh:Microsoft/distributeddeeplearning

or clone locally and then invoke

git clone https://github.com/Microsoft/distributeddeeplearning.git
cookiecutter distributeddeeplearning

Cookiecutter will then ask you about a number of fields which it will use to construct your project. If you simply want to select the defaults don't write or select anything just press enter. Many of them can be left at the default values, the ones that are absolutely necessary are highlighted

project_title: The title of your project

project_name: The folder in which your project will be created. Make sure it is a valid linux folder name

resource_group: The name of the resource group in Azure under which all the resources will be created. It is fine if it already exists

workspace: The AML workspace that the project will use. If it doesn't already exist it will create it

sub_id: The subscription id for your project, you can look this up on the portal or run a command on the cloud shell to get it. It isn't mandatory though, the application will give you an option to select it later.

vm_size: The VM type to use for distributed training

minimum_number_nodes: The minimum number of nodes in the cluster. Set to 0 if you want it to scale down after use to reduce costs
maximum_number_nodes: The maximum number of nodes in the cluster

cluster_name: The name of the cluster to use. It will create it if it doesn't exist

container_registry: The name of your dockerhub or other account which you may want to push your control plane docker container. If you don't have one or don't want to push the container to it simply leave as default

type: The type of project you want:

  • all: All of them
  • template: Just create a template for distributed training
  • benchmark: Create project that will run the Tensorflow benchmarks
  • imagenet: Create an example project that will run against the imagenet data. (You will need to download the imagenet data)

region: Which region to create Azure resources in

experiment_name: The name of the experiment

data: The absolute path on your computer where you will store the imagenet data. The location needs to have around 400GB of space

image_name: The name to give the control plane docker image

datastore_name: Name of the datastore that will be created as part of the project

container_name: The name of the container in your storage account that will hold the data

Once the project is created you will still be able to change many of the above options as they will be present in .env file that will be created.

Building environment

Distributed training is complex and often has a number of moving parts. To reduce the overhead of installing packages and managing environments we use a docker container to encapsulate our enviroment. So once you have created the project simply navigate to the root folder created by cookiecutter and run:

make build

This will build your docker container. Inside your docker container will be an appropriately set up conda environment a number of utilities such as AzCopy as well as everything you will need to run your distributed training job. Once your container is built run:

make run

This will put you in an environment inside your container in a tmux session (for a tutorial on tmux see here). The tmux control key has been mapped to ctrl+a rather than the standard ctrl+b so as not to interfere with outer tmux session if you are already a tmux user. You can alter this in the tmux.conf file in the Docker folder. The docker container will map the location you launched it from to the location /workspace inside the docker container. Therefore you can edit files outside of the container in the project folder and the changes will be reflected inside the container.

Imagenet data

If you have selected all, tensorflow_imagenet or pytorch_imagenet in the type question during cookiecutter invocation then you will need to have ILSVRC2012_img_train.tar and ILSVRC2012_img_val.tar present in the direcotry you specified as your data directory. Go to the download page (you may need to register an account), and find the page for ILSVRC2012. You will need to download the two files mentioned earlier.

Template selection

Based on the option you selected for type during the cookiecutter invocation you will get all or one of the options below. Cookiecutter will create your project folder which will contain the tempalte folders. When inside your project folder make sure you have run the make build and make run commands as mentioned in building environment section above. Once you run the run command you will be greeted by a prompt, this is now your control plane. First you will need to set everything up. To do this run

inv setup 

It will ask you to log in so follow the prompts in the terminal. If you selected all in the template type it will also prepare the imagenet data. Now you will be ready to run the templates.

Tensorflow Benchmark

This is a demo template allows you to easily run tf_cnn_benchmarks on Azure ML. This is a great way to test performance as well as compare to other platforms. To use this you must either select benchmark or all when invoking cookiecutter. Once setup is complete then simply run:

inv tf-benchmark.submit.local.synthetic

to run things locally on a single GPU. Note that the first time you run things you will have to build the environment. To run things on a cluster simply run:

inv tf-benchmark.submit.remote.synthetic

Note that this will create the cluster if it wasn't created earlier and create the appropriate environment.

Tensorflow Imagenet

This is the second demo template that will train a ResNet50 model on imagenet. It allows the options of using synthetic data, image data as well as tfrecords. To use this you must either select tensorflow_imagenet or all when cookiecutter asks what type of project you want to create. The run things locally using synthetic data simply run:

inv tf-imagenet.submit.local.synthetic

To run things on a remote cluster with real data in tfrecords format simply run:

inv tf-imagenet.submit.remote.tfrecords

This only covers a small number of commands, to see the full list of commands simply run inv --list.

Tensorflow Experiment

This is the option that you should use if you want to run your own training script. It is up to you to add the appropriate training scripts and modify the tensorflow_experiment.py file to run the appropriate commands. If you want to see how to invoke things simply look at the other examples.

Pytorch Benchmark

This is a demo template allows you to easily run a simple PyTorch benchmarking script on Azure ML. To use this you must either select benchmark or all when invoking cookiecutter. Once setup is complete then simply run:

inv pytorch-benchmark.submit.local.synthetic

to run things locally on a single GPU. Note that the first time you run things you will have to build the environment. To run things on a cluster simply run:

inv pytorch-benchmark.submit.remote.synthetic

Note that this will create the cluster if it wasn't created earlier and create the appropriate environment.

PyTorch Imagenet

This is the second demo template that will train a ResNet50 model on imagenet. It allows the options of using synthetic data or image data. To use this you must either select pytorch_imagenet or all when cookiecutter asks what type of project you want to create. The run things locally using synthetic data simply run:

inv pytorch-imagenet.submit.local.synthetic

To run things on a remote cluster with real data in tfrecords format simply run:

inv pytorch-imagenet.submit.remote.tfrecords

Pytorch Experiment

This is the option that you should use if you want to run your own training script. It is up to you to add the appropriate training scripts and modify the pytorch_experiment.py file to run the appropriate commands. If you want to see how to invoke things simply look at the other examples.

Architecture

Below is a diagram that shows how the project is set up.

The docker container you created using make build is the control plane and from there we can invoke jobs to execute either locally or in the cloud. Local execution is meant for debugging and will run on a single GPU. The mapping of data locations is handled by the control scripts. During local execution the appropriate location is mapped to the container. During remote execution the data store created during setup will be mounted on to each of the VMs in the cluster.

Project structure

The original project structure is as shown below.

├── cookiecutter.json  <-- Cookiecutter json that holds all the variables for the projects  
├── hooks  
│  ├── post_gen_project.py  
│  └── pre_gen_project.py  
├── images  
│  └── demo.svg  
├── LICENSE  
├── README.md <-- This readme  
└── {{cookiecutter.project_name}}  
   ├── _dotenv_template <-- Template that is read and translated into .env file  
   ├── control <-- Holds all files for the control plane  
   │  ├── Docker <-- Contains the files used to build the control plane docker container
   │  │  ├── azure_requirements.txt <-- Azure python requirements
   │  │  ├── bash.completion <-- Completion script for invoke
   │  │  ├── dockerfile
   │  │  ├── environment.yml <-- Conda environment specification for control plane
   │  │  ├── jupyter_notebook_config.py 
   │  │  └── tmux.conf <-- Tmux configuration
   │  └── src
   │     ├── aml_compute.py <-- Module that holds methods for creating cluster and submitting experiments using Azure ML
   │     ├── config.py <-- Module for loading and working with .env config
   │     └── logging.conf <-- Logging configuration for control plane
   ├── Makefile <-- Makefile to build and run control plane
   ├── scripts
   │  ├── convert_imagenet_to_tf_records.py <-- Script for transforming imagenet data to tf records
   │  ├── image.py <-- Invoke module for working with images
   │  ├── imagenet_nounid_to_class.json <-- Imagenet nounid lookup
   │  ├── prepare_imagenet.py <-- Script for preparing imagenet data
   │  ├── storage.py <-- Invoke module for using Azure storage
   │  └── tfrecords.py <-- Invoke module for working with tf records
   ├── tasks.py <-- Main invoke module
   ├── PyTorch_benchmark<-- Template for running PyTorch benchmarks
   │  ├── environment_cpu.yml
   │  ├── environment_gpu.yml<-- Conda specification file used by Azure ML to create environment to run project in
   │  ├── pytorch_benchmark.py<-- Invoke module for running benchmarks
   │  └── src
   │     └── pytorch_synthetic_benchmark.py
   ├── PyTorch_imagenet
   │  ├── environment_cpu.yml
   │  ├── environment_gpu.yml<-- Conda specification file used by Azure ML to create environment to run project in
   │  ├── pytorch_imagenet.py<-- Invoke module for running benchmarks
   │  └── src
   │     ├── imagenet_pytorch_horovod.py
   │     ├── logging.conf
   │     └── timer.py
   ├── PyTorch_experiment<-- PyTorch distributed training template [Put your code here]
   │  ├── environment_cpu.yml
   │  ├── environment_gpu.yml<-- Conda specification file used by Azure ML to create environment to run project in
   │  ├── pytorch_experiment.py<-- Invoke module for running benchmarks
   │  └── src
   │     └── train_model.py
   ├── TensorFlow_benchmark <-- Template for running Tensorflow benchmarks
   │  ├── environment_cpu.yml 
   │  ├── environment_gpu.yml <-- Conda specification file used by Azure ML to create environment to run project in
   │  ├── src <-- Folder where tensorflow benchmarks code will be cloned into
   |  └── tensorflow_benchmark.py <-- Invoke module for running benchmarks
   ├── TensorFlow_experiment <-- Tensorflow distributed training template [Put your code here]
   │  ├── environment_cpu.yml
   │  ├── environment_gpu.yml <-- Conda specification file used by Azure ML to create environment to run project in
   │  ├── src
   │  │  ├── logging.conf
   │  │  └── train_model.py <-- Template file
   │  └── tensorflow_experiment.py <-- Invoke module for running template
   └── TensorFlow_imagenet
      ├── environment_cpu.yml
      ├── environment_gpu.yml <-- Conda specification file used by Azure ML to create environment to run project in
      ├── src <-- Code for training ResNet50 model on imagenet
      │  ├── data
      │  │  ├── __init__.py
      │  │  ├── images.py
      │  │  ├── synthetic.py
      │  │  └── tfrecords.py
      │  ├── defaults.py
      │  ├── imagenet_preprocessing.py
      │  ├── logging.conf
      │  ├── resnet_main.py <-- Main entry script
      │  ├── resnet_model.py
      │  ├── resnet_run_loop.py
      │  ├── timer.py
      │  └── utils.py
      └── tensorflow_imagenet.py <-- Invoke module for running imagenet experiment

Depending on the options chosen only certain branches will be moved over to your project.

Options

These are the options when using the template. These can differ depenting on the type of project you choose to create. To see this list youself simply run:

inv --list
  delete                                     Delete the resource group and all associated resources
  experiments                                Prints list of experiments
  interactive (i)                            Open IPython terminal and load in modules to work with AzureML
  login                                      Log in to Azure CLI
  runs                                       Prints information on last N runs in specified experiment
  select-subscription                        Select Azure subscription to use
  setup                                      Setup the environment and process the imagenet data
  tensorboard                                Runs tensorboard in a seperate tmux session
  pytorch-benchmark.submit.local.synthetic    Submit PyTorch training job using synthetic data for local execution
  pytorch-benchmark.submit.remote.synthetic   Submit PyTorch training job using synthetic data to remote cluster
  pytorch-imagenet.submit.local.images        Submit PyTorch training job using real imagenet data for local execution
  pytorch-imagenet.submit.local.synthetic     Submit PyTorch training job using synthetic imagenet data for local execution
  pytorch-imagenet.submit.remote.images       Submit PyTorch training job using real imagenet data to remote cluster
  pytorch-imagenet.submit.remote.synthetic    Submit PyTorch training job using synthetic imagenet data to remote cluster
  storage.create-resource-group
  storage.store-key                          Retrieves premium storage account key from Azure and stores it in .env file
  storage.image.create-container             Creates container based on the parameters found in the .env file
  storage.image.download-data                Download training and validation data from blob container specified in .env file
  storage.image.download-training            Download training data from blob container specified in .env file
  storage.image.download-validation          Download validation data from blob container specified in .env file
  storage.image.prepare-imagenet             Prepare imagenet data found in download_dir and push results to target_dir
  storage.image.upload-data                  Upload training and validation data to container specified in .env file
  storage.image.upload-training-data         Upload training data to container specified in .env file
  storage.image.upload-validation-data       Upload validation data to container specified in .env file
  storage.create-container                   Creates container based on the parameters found in the .env file
  storage.create-premium-storage             Creates premium storage account. By default the values are loaded from the local .env file
  storage.tfrecords.upload-validation-data   Upload tfrecords validation data to container specified in .env file
  tf-benchmark.submit.local.synthetic        Submits TensorFlow benchmark job using synthetic data for local execution
  tf-benchmark.submit.remote.synthetic       Submits TensorFlow benchmark job using synthetic data on remote cluster
  tf-experiment.submit.local.images          This command isn't implemented please modify to use.
  tf-experiment.submit.local.synthetic       This command isn't implemented please modify to use.
  tf-experiment.submit.remote.images         This command isn't implemented please modify to use.
  tf-experiment.submit.remote.synthetic      This command isn't implemented please modify to use.
  tf-imagenet.submit.local.images            Submit TensorFlow training job using real imagenet data for local execution
  tf-imagenet.submit.local.synthetic         Submit TensorFlow training job using synthetic imagenet data for local execution
  tf-imagenet.submit.local.tfrecords         Submit TensorFlow training job using real imagenet data as tfrecords for local execution
  tf-imagenet.submit.remote.images           Submit TensorFlow training job using real imagenet data to remote cluster
  tf-imagenet.submit.remote.synthetic        Submit TensorFlow training job using synthetic imagenet data to remote cluster
  tf-imagenet.submit.remote.tfrecords        Submit TensorFlow training job using real imagenet data as tfrecords to remote cluster

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

distributeddeeplearning's People

Contributors

danielleodean avatar jiata avatar microsoftopensource avatar miguelgfierro avatar msalvaris avatar msftgits avatar praneet22 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

distributeddeeplearning's Issues

[FEATURE] templatize node count for remote execution in tensorflow_experiment.py

Description

templatize node count for remote execution in tensorflow_experiment.py so that it equals the maximum number of nodes set during the cookie cutter setup.

By default, node_count=4, but i set my max nodes to 3, so it failed.

Expected behavior with the suggested feature

will not run into

"Requested 4 nodes but BatchAI cluster only has 3 maximum nodes.

Other Comments

[BUG]readme cookiecutter install command is outdated

Description

running this:
cookiecutter gh:Microsoft/AMLDistCC
causes this error:
"The repository https://github.com/Microsoft/AMLDistCC.git could not be found, have you made a typo?"

update readme to cookiecutter gh:Microsoft/DistributedDeepLearning

In which platform does it happen?

ubuntu 18

How do we replicate the issue?

Expected behavior (i.e. solution)

Other Comments

[QUESTION] ambiguity of setup parameters

Description

  • Not sure if i need the cookiecutter param 'data' if i do not select 'imagenet' for the type since the description says that it is the location to store imagenet
  • type 'image_name' feels ambiguous. users don't know ahead of time that the are using a docker image as a control plane. maybe specify in the cookiecutter prompt that it is a docker image? or rename to docker_image_name
  • cookiecutter prompts for datastore name as 'account name', which is ambiguous (account of what?) and not reflected in the readme
  • same issue for prompting 'account key', naming is ambiguous. Also, shouldn't this be generated? instead of prompting a user for it. (also not reflected in the readme)
  • perhaps rename 'container_name' -> 'storage_container_name' (since people may mix up the term container)

Other Comments

[QUESTION] Native docker experience vs tmux

Description

i personally like the tmux experience, but should we impose this as the default? people not familiar with tmux will surely find navigating it confusing.

An alternate experience, which may seem more neutral is one where make run doesn't start tmux by default. Users would have to use another command like make run_with_tmux or something if they wanted to use tmux as their session manager. Not really sure.. probably worth getting more opinions on this.

Other Comments

[FEATURE] templatize environment name in environment_gpu/cpu.ymls

Description

Users may want to test locally within the conda env before running is via aml (at least i do), so i build the environment locally first. However, if i build it as is, my environment name will be 'project_environment', when it should probably be one of the parameters passed into cookiecutter during the setup.

Expected behavior with the suggested feature

running conda env create -f environment_gpu.yml should create an environment named something more appropriate than project_environment

Other Comments

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.