Giter Site home page Giter Site logo

aws-automation's Introduction

Distributed Machine Learning on AWS Cloud: Computing with CPUs and GPUs

Introduction

Instead of buying, owning, and maintaining physical data centers and servers, you can access technology services (such as computing power, storage, and databases) from a cloud provider like Amazon Web Services (AWS), Microsoft Azure, Google Cloud, ...

This git help you achieve single machine computation and distributed (multiple) machine computation on AWS, with both CPU and GPU execution. Here is a table of contents including all applications we have.

Notes of Caution

  • Because cloud computing is charges by usage, please make sure to terminate or stop your virtual instances after you are done with them.
  • Because a cloud computing resource is often shared among many users, please include your name in your key file name, such as jianwu-key, so we know the creator of each virtual instance.
  • See different versions of this repository in Tags.

Table of Contents

The following is an overall instruction for all our implementations. For detailed instructions of CPU executions and GPU exectutions, please go to folders cpu-example and gpu-example.

Web based

  1. Launch instances on EC2 console:

  1. Choose an Amazon Machine Image (AMI)
    An AMI is a template that contains the software configuration (operating system, application server, and applications) required to launch your instance. For CPU applications, we use Ubuntu Server 20.04 LTS (HVM), SSD Volume Type; for GPU case, we use Deep Learning Base AMI (Ubuntu 16.04) Version 40.0.

  1. Choose an Instance Type
    Based on your purpose, AWS provides various instance types on https://aws.amazon.com/ec2/instance-types/. For CPU application, we recommand to use c5.2xlarge instance; For GPU application, we recommand to use p3.2xlarge instance.

  1. Configure Number of instances
    We use 1 instance for single machine computation, and 2 instances for distributed computation.

  1. Configure Security Group

  1. Review, Create your SSH key pair, and Launch

  1. View your Instance and wait for Initialing

  1. SSH into your instance

  1. Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo service docker start
sudo usermod -a -G docker ubuntu
sudo chmod 666 /var/run/docker.sock
  1. Download Docker imagesor build images by Dockerfile.
  • CPU example:
docker pull starlyxxx/dask-decision-tree-example
  • GPU example:
docker pull starlyxxx/horovod-pytorch-cuda10.1-cudnn7
  • or, build from Dockerfile:
docker build -t <your-image-name> .
  1. Download ML applications and data on AWS S3.
  • For privacy, we store the application code and data on AWS S3. Install aws cli and set aws credentials.
curl 'https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip' -o 'awscliv2.zip'
unzip awscliv2.zip
sudo ./aws/install
aws configure set aws_access_key_id your-access-key
aws configure set aws_secret_access_key your-secret-key
  • Download ML applications and data on AWS S3.

    • CPU example:

      Download:

    aws s3 cp s3://kddworkshop/ML_based_Cloud_Retrieval_Use_Case.zip ./

    or

    (wget https://kddworkshop.s3.us-west-.amazonaws.com/ML_based_Cloud_Retrieval_Use_Case.zip)

    Extract the files:

    unzip ML_based_Cloud_Retrieval_Use_Case.zip
    • GPU example:

      Download:

    aws s3 cp s3://kddworkshop/MultiGpus-Domain-Adaptation-main.zip ./
    aws s3 cp s3://kddworkshop/office31.tar.gz ./

    or

    wget https://kddworkshop.s3.us-west-2.amazonaws.com/MultiGpus-Domain-Adaptation-main.zip
    wget https://kddworkshop.s3.us-west-2.amazonaws.com/office31.tar.gz

    Extract the files:

    unzip MultiGpus-Domain-Adaptation-main.zip
    tar -xzvf office31.tar.gz
  1. Run docker containers for CPU applications.
  • Single CPU:
docker run -it -v /home/ubuntu/ML_based_Cloud_Retrieval_Use_Case:/root/ML_based_Cloud_Retrieval_Use_Case starlyxxx/dask-decision-tree-example:latest /bin/bash
  • Multi-CPUs:
docker run -it --network host -v /home/ubuntu/ML_based_Cloud_Retrieval_Use_Case:/root/ML_based_Cloud_Retrieval_Use_Case starlyxxx/dask-decision-tree-example:latest /bin/bash
  1. Run docker containers for GPU applications
  • Single GPU:
nvidia-docker run -it -v /home/ubuntu/MultiGpus-Domain-Adaptation-main:/root/MultiGpus-Domain-Adaptation-main -v /home/ubuntu/office31:/root/office31 starlyxxx/horovod-pytorch-cuda10.1-cudnn7:latest /bin/bash
  • Multi-GPUs:

    • Add primary worker’s public key to all secondary workers’ <~/.ssh/authorized_keys>
    sudo mkdir -p /mnt/share/ssh && sudo cp ~/.ssh/* /mnt/share/ssh
    • Primary worker VM:
    nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh -v /home/ubuntu/MultiGpus-Domain-Adaptation-main:/root/MultiGpus-Domain-Adaptation-main -v /home/ubuntu/office31:/root/office31 starlyxxx/horovod-pytorch-cuda10.1-cudnn7:latest /bin/bash
    • Secondary workers VM:
    nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh -v /home/ubuntu/MultiGpus-Domain-Adaptation-main:/root/MultiGpus-Domain-Adaptation-main -v /home/ubuntu/office31:/root/office31 starlyxxx/horovod-pytorch-cuda10.1-cudnn7:latest bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
  1. Run ML CPU application:

    • Single CPU:
    cd ML_based_Cloud_Retrieval_Use_Case/Code && /usr/bin/python3.6 ml_based_cloud_retrieval_with_data_preprocessing.py
    • Multi-CPUs:

      • Run dask cluster on both VMs in background:

        • VM 1:
        dask-scheduler & dask-worker <your-dask-scheduler-address> &
        • VM 2:
        dask-worker <your-dask-scheduler-address> &
    • One of VMs:

    cd ML_based_Cloud_Retrieval_Use_Case/Code && /usr/bin/python3.6 dask_ml_based_cloud_retrieval_with_data_preprocessing.py <your-dask-scheduler-address>
  2. Run ML GPU application

    • Single GPU:
    cd MultiGpus-Domain-Adaptation-main
    horovodrun --verbose -np 1 -H localhost:1 /usr/bin/python3.6 main.py --config DeepCoral/DeepCoral.yaml --data_dir ../office31 --src_domain webcam --tgt_domain amazon
    • Multi-GPUs:

      • Primary worker VM:
      cd MultiGpus-Domain-Adaptation-main
      horovodrun --verbose -np 2 -H <machine1-address>:1,<machine2-address>:1 -p 12345 /usr/bin/python3.6 main.py --config DeepCoral/DeepCoral.yaml --data_dir ../office31 --src_domain webcam --tgt_domain amazon
  3. Terminate all VMs on EC2 when finishing experiments.

Command line automation via Boto

Follow steps below for automating single machine computation. For distributed machine computation, see README on each example's sub-folder.

Prerequisites:

pip3 install boto fabric2 scanf IPython invoke
pip3 install Werkzeug --upgrade

Run single machine computation:

  1. Configuration

Use your customized configurations. Replace default values in <./config/config.ini>

  1. Start IPython
python3 run_interface.py
  1. Launch VMs on EC2 and wait for initializing
LaunchInstances()
  1. Install required packages on VMs
InstallDeps()
  1. Automatically run Single VM ML Computing
RunSingleVMComputing()
  1. Terminate all VMs on EC2 when finishing experiments.
TerminateAll()

For a closer look, please refer to our slides or presentation.

aws-automation's People

Contributors

sammgithub avatar starlyxxx avatar jianwuwang avatar lixy4567 avatar webzerg avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.