Giter Site home page Giter Site logo

azure-vm-run's Introduction

Azure VM Run

Run your scripts on gpu powered VMs & compute resources on Azure, from scratch, in 5 minutes

az vm and az ml compute Quickstart

Q: How can I script tasks to run on GPU-enabled computing on Azure?

A: You Could:

  • Use az ml run submit-script and learn about the sequence of 8 (yes, eight) resources & files you will need to create before it works
  • Create a VM, clone a repo and/or copy data to it over ssh, and copy the results back afterwards

OR You Could:

  • Use these two scripts to do either of those two tasks for you in 5 minutes and help you learn at your leisure how it all worked.

Required:

  1. An Azure Subscription with access to create resources. NB This will work on a Free Tier subscriptions but provisioning will be slower and you won't have any GPU.
  2. PowerShell
  3. install the az cli, and login

Option 1. Virtual Machine with GPU and ML tooling

Required: ssh and some basic familiarity with it

Start-OnVM.ps1 somecommand     -copy . fetch . -location uksouth
# or as an abbreviation for pythoners:
Start-OnVM.ps1 -python main.py -copy . fetch . -location uksouth
  • The script will first create or confirm the Azure resources required:
    • a resource group (default name VMRun) in Azure location uksouth
    • a VM (default name DSVM)
      • with default size: NC6_PROMO
      • with default image: microsoft-dsvm:ubuntu-1804:1804
      • accepts the license for the image
      • sets the conda environment to python 3.7 with tensorflow
  • Then
    • copies your current working directory (without subdirectories) to the VM
    • runs the given command on the VM in a tmux session
    • tails the command until you press Ctrl+C
    • copies the VM's home folder back to your local current working directory

NB At the point of connecting to a new VM, ssh will ask you if you are ok to connect to the new host

Yes but what about … ?

See Start-OnVM.ps1 -help for more options and details including conda environments, pip, cloning repos, recursive copy and fetch etc.

To make good use of a VM to offload training, you will want to be familiar with ssh, tmux, your choice of unix shell, and/or X-windows. The GUI bells & whistles are portrayed at https://azure.microsoft.com/en-gb/services/virtual-machines/data-science-virtual-machines/

Tear Down

Keeping an N-series VM running may cost you $10-30 per day. Delete the whole resource group or just the VM with one of:

az vm delete --name DSVM
az group delete --name VMRun

Don't get stung! Check in your Azure portal that all resources have been deleted.

Option 2. Azure's managed offering for ML

NB to copy and paste into a non-powershell shell, replace the backtick line-continuation marks with backslash before pasting

Start-OnComputeTarget.ps1 ml1 ml1 ml1 ml1 -location uksouth `
        -environmentFor PyTorch `
        -submit `
        -NoConfirm
  • will provision: 1. a Resource Group with 2. a Workspace and 3. a computetarget and configure 4. an experiment, all named ml1
  • will choose 5. an Environment for PyTorch, then generate 6. an example PyTorch script and 7. an example dataset (namely mnist) to train a model and finally generate 8. a runconfig file readable by az ml run submit-script
  • will submit the script and stay attached, streaming the logs to your console. You can see progress and output at https://ml.azure.com or see status at az ml run list

Yes but what about … ?

The script can take you from the pre-canned example to defining your own datasets, using other ML frameworks (TensorFlow etc), specifying a bigger computetarget size, etc. Call the script with -? to see more options and more detail.

./Start-OnComputeTarget.ps1 -?

Show me the GUI?

The GUI way to do this is at https://ml.azure.com, and it can take you through similar initial steps as this script. You can also use the GUI as a dashboard, to see that what the script does appears as expected in your azure account, and to see experiment results.

Tear Down

Keeping a workspace running may cost $10-$30 or more per day. Delete the whole resource group or just the workspace with one of:

az ml workspace delete -w ml1 -g ml1
az group delete --name ml1

Don't get stung! Check in your Azure portal that all resources have been deleted.

In More Detail

Azure offers two viable approaches to cloud ML:

  1. A plain virtual machine with NVida GPU acceleratora. Well, plainish: it's a complete graphical workstation with X-windows & R-studio etc etc, not just a command line with Anaconda and Python.
  2. A managed service with a “devops” style dashboard that can e.g. gather metrics from your training runs, provide a shared workspace for data, results, and experiment history.

Option 1. Using Microsoft's Data Science Virtual Machine image for ML training or work

Microsoft have published several “Data Science Virtual Machine” images. The recommended image runs on Ubuntu and has NVidia CUDA support and GPU management.

  • The images are preloaded with python, R studio, and a load of python ML frameworks.
  • Pre-installed Conda environments are python 3.7 or 3.8 with tensorflow or pytorch
  • The script defaults to the cheapest VM size with a GPU
  • Advantages of a VM over a computetarget:
    • Interactive as well as batch. You can SSH to the VM or connect from X-windows, so it's a desktop experience
    • Often (in my experience) faster startup than a managed computetarget, and no waiting in a queue for resources -Usually cheaper, by a lot cheaper. You only need the VM for runs; you take responsibility for keeping copies of your data, results and history.

Option 2. Using Azure's managed infrastructure for ML training

The managed infrastructure product is more focused on GUI than on scripting. To script successfully, you need to know exactly the sequence of resources you must create & tear down managed cloud-based ML resources.

Resources created for managed ML

[Azure Subscription]
  └── ResourceGroup (at a location) : keeps AZ resources together
      └── WorkSpace : keep your ML work and resources together
          ├── Computetarget (with a vmSize which may include GPU)
          ├── Environment (simplest option is an AzureML curated one)
          ├── Dataset(s) (optional)
          └── Experiment : Keep related runs together
              └── runconfig file
                  (which references the computetarget, the optional dataset, 
                   the experiment and a script)

The workspace is the primary Machine Learning container. It offers shared access to resources, can be accessed from https://ml.azure.com and can connect to your local desktop.

  • Keeping an empty workspace alive costs about $1 per day.
  • To create and destroy a workspace each time you start work typically takes a couple of minutes, and that is the first part of what this script automates.

This Script will Take You Through These Steps

  1. Create a Resource Group. This is Azure's way to 'keep together' related resources. It is tied to an azure location and is free.

  2. In the Resource Group, create a Workspace. This will allocate some storage, and an unused workspace will cost you around $1 per day. It may take a couple of minutes to create, and slightly less time to delete.

  3. Within the Workspace, create a computetarget. This can be made to auto-scale down to 0 nodes–i.e. no cost—when idle.

  4. Choose an Experiment name. This defaults to current folder name.

  5. Choose an Environment. Use PowerShell tab-completion to see some options. An environment is typically a reference to a docker image with python and ML libraries installed e.g. TensorFlow, PyTorch, Scikit and others.

  6. Optionally, choose or register a Dataset. The script offers to create an example mnist dataset.

  7. Choose a python Script to run Defaults to ./train.py. The script offers to create an example one.

  8. Attach a local folder on your desktop to the Workspace.

  9. Create a runconfig referencing your environment, script, dataset, computetarget.

  10. Submit the runconfig


Not covered by this script:

  • Attach an Azure blob container as a Datastore for large datasets and uploads
  • Creating your own new Environment definition

Examples

Start-OnComputeTarget.ps1 ml1 ml1 ml1 ml1
  -datasetName mnist 
  -environmentFor TensorFlow 
  -script ./scripts/train.py
  -attachFolder Yes

Will do these steps:

  • Ensure or creates:
    • a resourceGroup, a workspace, a computetarget and an experiment, all called ml1
  • Ensure a dataset named mnist exists in your workspace
  • Pick the alphabetically last environment with name matching TensorFlow
  • Ensure the script ./scripts/train.py exists
  • Attach your current folder to the workspace
  • Generate a runconfig file called ml1-ml1.runconfig
  • Show you the command line to submit the run If you add the -submit flag it will also start the run
Start-OnComputeTarget.ps1 ml1 ml1 ml1 -location uksouth

Creates: -a resourceGroup named ml1 in Azure location uksouth, -a workspace named ml1 in that resourceGroup, -a computetarget ml1 of default size (nc6) in the workspace and then stops, telling you what else you must specify to proceed

Addenda

Azure GPU options

A computetarget is a VM or at least it is specified with a VM size, so the machine and GPU options are the same for both options. You'll want to be sure to use a VMSize that includes a GPU.

TL;DR: Use a N-series VM for the NVidia Tesla GPUs.

  • The oldest hardware - NC series - now has 50% off promo options
  • On a new account you may have to first request access to the larger VMs. You can do this via the GUI at https://ml.azure.com or via the URL you get in an error message telling you to request a quota change.

Choose from

  • $0.60 per hour for NC6Promo with Tesla K80

  • $1.20-$2.50 per hour for NC12 or NC24 Promo - 2 or 4 x Tesla K80 (2015 design, 2496 cores @ 560MHz-875MHz 24GB GDDR5)

  • $7-15 per hour for NC12v3 - NC24v3 - 2-4 x Tesla V100 (640TensorCores,5120Cuda Cores, 32-64GB HBM2 memory)

  • Don't pay for dual or quad GPU machines unless your code can use them

  • You will have storage charges as well as VM charges. Bigger VMs have more expensive storage

  • Tesla K80 : A 2014 Server Kepler design (One K80 = two GK210s each with 12GB GDDR5)
  • Tesla M60 : A 2015 Workstation GPU Maxwell design
  • Tesla P100: A 2016 Datacentre Pascal design
  • Tesla V100: A 2017 Datacentre Volta design
VM Size Has GPU NVida GPU Rating GPU MHz/Cores GPU RAM VCPUs VM Ram $ per hour
NC6 1x Tesla K80 8 TFlops 560-875MHz 2496 cores 24GB GDDR5 6 cpucores 56GB $1 per hr
NC12 2x Tesla K80 2x 8 TFlops 560-875MHz 2496 cores 24GB GDDR5 12 cpucores 112GB $2 per hr
NC24 4x Tesla K80 4x 8 TFlops 560-875MHz 2496 cores 24GB GDDR5 24 cpucores 224GB $5 per hr
NC24r 4x Tesla K80 4x 8 TFlops 560-875MHz 2496 cores 24GB GDDR5 24 cpucores 224GB $5 per hr
NC6 Promo 1x Tesla K80 8 TFlops 560-875MHz 2496 cores 24GB GDDR5 6 cpucores 56GB $0.60 per hr
NC12 Promo 2x Tesla K80 2x 8 TFlops 560-875MHz 2496 cores 24GB GDDR5 12 cpucores 112GB $1 per hr
NC24 Promo 4x Tesla K80 4x 8 TFlops 560-875MHz 2496 cores 24GB GDDR5 24 cpucores 224GB $2 per hr
NC24r Promo 4x Tesla K80 4x 8 TFlops 560-875MHz 2496 cores 24GB GDDR5 24 cpucores 224GB $2 per hr
NC6s v3 1x Tesla V100 15TFlops / 112 TFlops 640 TensorCores 5120 CUDA cores 16GB HBM2 6 cpucores 112GB $3 per hr
NC12s v3 2x Tesla V100 2x 15TFlops / 112 TFlops 640 TensorCores 5120 CUDA cores 32GB HBM2 12 cpucores 224GB $7 per hr
NC24s v3 4x Tesla V100 4x 15TFlops / 112 TFlops 640 TensorCores 5120 CUDA cores 64GB HBM2 24 cpucores 448GB $14 per hr
NC24rs v3 4x Tesla V100 4x 15TFlops / 112 TFlops 640 TensorCores 5120 CUDA cores 64GB HBM2 24 cpucores 448GB $15 per hr
NV6 1x Tesla M60 9 TFlops 4096 CUDA cores 16GB GDDR5 6 cpucores 56GB $1 per hr
NV12 2x Tesla M60 2x 9 TFlops 4096 CUDA cores 16GB GDDR5 12 cpucores 112GB $3 per hr
NV24 4x Tesla M60 4x 9 TFlops 4096 CUDA cores 16GB GDDR5 24 cpucores 224GB $6 per hr
NV6 Promo 1x Tesla M60 9 TFlops 4096 CUDA cores 16GB GDDR5 6 cpucores 56GB $0.60 per hr
NV12 Promo 2x Tesla M60 2x 9 TFlops 4096 CUDA cores 16GB GDDR5 12 cpucores 112GB $1 per hr
NV24 Promo 4x Tesla M60 4x 9 TFlops 4096 CUDA cores 16GB GDDR5 24 cpucores 224GB $3 per hr
NV12s v3 1x Tesla M60 9 TFlops 4096 CUDA cores 16GB GDDR5 12 cpucores 112GB $1 per hr
NV24s v3 2x Tesla M60 2x 9 TFlops 4096 CUDA cores 16GB GDDR5 24 cpucores 224GB $2 per hr
NV48s v3 4x Tesla M60 4x 9 TFlops 4096 CUDA cores 16GB GDDR5 48 cpucores 448GB $5 per hr

MS Docs on:

MacOs X-Windows connections

azure-vm-run's People

Contributors

chrisfcarroll avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.