Distributed TensorFlow training using Kubeflow on Amazon EKS

Prerequisites

Create and activate an AWS Account
Subscribe to the EKS-optimized AMI with GPU Support from the AWS Marketplace.
Manage your service limits so you can launch at least 4 EKS-optimized GPU enabled Amazon EC2 P3 instances.
Create an AWS Service role for an EC2 instance and add AWS managed policy for Administrator access to this IAM Role.
We need a build environment with AWS CLI and Docker installed. Launch a m5.xlarge Amazon EC2 instance from an AWS Deep Learning AMI (Ubuntu) using an EC2 instance profile containing the Role created in Step 4. All steps described under Step by step section below must be executed on this build environment instance.

Step by step

While all the concepts described here are quite general, we will make these concepts concrete by focusing on distributed TensorFlow training for TensorPack Mask/Faster-RCNN model.

The high-level outline of steps is as follows:

Create GPU enabled Amazon EKS cluster
Create Persistent Volume and Persistent Volume Claim for Amazon EFS or Amazon FSx file system
Stage COCO 2017 data for training on Amazon EFS or FSx file system
Use Helm charts to manage training jobs in EKS cluster

Create GPU Enabled Amazon EKS Cluster

Quick start option

This option creates an Amazon EKS cluster with one worker node group. This is the recommended option for walking through this tutorial.

In eks-cluster directory, execute: ./install-kubectl-linux.sh to install kubectl on Linux clients.

For non-linux operating systems, install and configure kubectl for EKS, and install aws-iam-authenticator and make sure the command aws-iam-authenticator help works.
Install Terraform. Terraform configuration files in this repository are consistent with Terraform v0.13.0 syntax.
In eks-cluster/terraform/aws-eks-cluster-and-nodegroup folder, execute:

terraform init

The next command requires an Amazon EC2 key pair. If you have not already created an EC2 key pair, create one before executing the command below:

terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]' -var="k8s_version=1.19" -var="key_pair=xxx"

Advanced option

This option separates the creation of the EKS cluster from the worker node group. You can create the EKS cluster and later add one or more worker node groups to the cluster.

Install Terraform.
In eks-cluster/terraform/aws-eks-cluster folder, execute:

terraform init

terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]' -var="k8s_version=1.19"

Customize Terraform variables as appropriate. K8s version can be specified using -var="k8s_version=x.xx". Save the output of the apply command for next step below.
In eks-cluster/terraform/aws-eks-nodegroup folder, using the output of previous terraform apply as inputs into this step, execute:

terraform init

The next command requires an Amazon EC2 key pair. If you have not already created an EC2 key pair, create one before executing the command below:

terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var="efs_id=fs-xxx" -var="subnet_id=subnet-xxx" -var="key_pair=xxx" -var="cluster_sg=sg-xxx" -var="nodegroup_name=xxx"

To create more than one nodegroup in an EKS cluster, copy eks-cluster/terraform/aws-eks-nodegroup folder to a new folder under eks-cluster/terraform/ and specify a unique value for nodegroup_name variable.
In eks-cluster directory, execute: ./install-kubectl-linux.sh to install kubectl on Linux clients. For other operating systems, install and configure kubectl for EKS.
Install aws-iam-authenticator and make sure the command aws-iam-authenticator help works. In eks-cluster directory, customize set-cluster.sh and execute: ./update-kubeconfig.sh to update kube configuration.

Ensure that you have at least version 1.16.73 of the AWS CLI installed. Your system's Python version must be Python 3, or Python 2.7.9 or greater.
In eks-cluster directory, customize NodeInstanceRole in aws-auth-cm.yaml and execute: ./apply-aws-auth-cm.sh to allow worker nodes to join EKS cluster. Note, if this is not your first EKS node group, you must add the new node instance role Amazon Resource Name (ARN) to aws-auth-cm.yaml, while preserving the existing role ARNs in aws-auth-cm.yaml.
In eks-cluster directory, execute: ./apply-nvidia-plugin.sh to create NVIDIA-plugin daemon set

Create EKS Persistent Volume

We have two shared file system options for staging data for distributed training:

Below, you only need to create Persistent Volume and Persistent Volume Claim for EFS, or FSx, not both.

Persistent Volume for EFS

Execute: kubectl create namespace kubeflow to create kubeflow namespace
In eks-cluster directory, customize pv-kubeflow-efs-gp-bursting.yaml for EFS file-system id and AWS region and execute: kubectl apply -n kubeflow -f pv-kubeflow-efs-gp-bursting.yaml
Check to see the persistent-volume was successfully created by executing: kubectl get pv -n kubeflow
Execute: kubectl apply -n kubeflow -f pvc-kubeflow-efs-gp-bursting.yaml to create an EKS persistent-volume-claim
Check to see the persistent-volume was successfully bound to peristent-volume-claim by executing: kubectl get pv -n kubeflow

Persistent Volume for FSx

Install K8s Container Storage Interface (CS) driver for Amazon FSx Lustre file system:

 kubectl apply -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"

Execute: kubectl create namespace kubeflow to create kubeflow namespace
In eks-cluster directory, customize pv-kubeflow-fsx.yaml for FSx file-system id and AWS region and execute: kubectl apply -n kubeflow -f pv-kubeflow-fsx.yaml
Check to see the persistent-volume was successfully created by executing: kubectl get pv -n kubeflow
Execute: kubectl apply -n kubeflow -f pvc-kubeflow-fsx.yaml to create an EKS persistent-volume-claim
Check to see the persistent-volume was successfully bound to persistent-volume-claim by executing: kubectl get pv -n kubeflow

Build and Upload Docker Image to Amazon EC2 Container Registry (ECR)

Tensorpack Mask-RCNN

Below, we build and push the Docker images for TensorPack Mask-RCNN model. Note the ECR URI output from executing the scripts: You will need it in steps below.

Training Image

For training image, execute:

./container/build_tools/build_and_push.sh <aws-region>

Testing Image

For testing image, execute:

./container-viz/build_tools/build_and_push.sh <aws-region>

AWS Mask-RCNN

Below, we build and push the Docker images for AWS Mask-RCNN model. Note the ECR URI output from executing the scripts: You will need it in steps below.

Training Image

For training image, execute:

./container-optimized/build_tools/build_and_push.sh <aws-region>

Testing Image

For testing image, execute:

./container-optimized-viz/build_tools/build_and_push.sh <aws-region>

Stage Data

To download COCO 2017 dataset to your build environment instance and upload it to Amazon S3 bucket, customize eks-cluster/prepare-s3-bucket.sh script to specify your S3 bucket in S3_BUCKET variable and execute eks-cluster/prepare-s3-bucket.sh

Next, we stage the data on EFS or FSx file-system. We need to use either EFS or FSx below, not both.

Use EFS, or FSx

To stage data on EFS or FSx, set image in eks-cluster/stage-data.yaml to the ECR URL you noted above, customize S3_BUCKET variable and execute:

kubectl apply -f stage-data.yaml -n kubeflow

to stage data on selected persistent volume claim for EFS (default), or FSx. Customize persistent volume claim in eks-cluster/stage-data.yaml to use FSx.

Execute kubectl get pods -n kubeflow to check the status of stage-data Pod. Once the status of stage-data Pod is marked Completed, execute following commands to verify data has been staged correctly:

kubectl apply -f attach-pvc.yaml -n kubeflow
kubectl exec attach-pvc -it -n kubeflow -- /bin/bash

You will be attached to the EFS or FSx file system persistent volume. Type exit once you have verified the data.

Install Helm

Helm is package manager for Kubernetes. It uses a package format named charts. A Helm chart is a collection of files that define Kubernetes resources. Install helm version 3.x or later according to instructions here.

Install Helm charts to begin model training

In the charts folder, deploy Kubeflow MPIJob CustomResouceDefintion using mpijob chart:
```
 helm install --debug mpijob ./mpijob/  # (Helm version 3.x)
```
You have three options for training Mask-RCNN model:

a) To train TensorPack Mask-RCNN model, customize values.yaml in the charts/maskrcnn directory. At a minimum, set image to Tensorpack Mask-RCNN training image ECR URI you built and pushed in a previous step. Set shared_fs and data_fs to efs, or fsx, as applicable. Set shared_pvc to the name of the k8s persistent volume claim you created in relevant k8s namespace. To test the trained model using a Jupyter Lab notebook, customize values.yaml in the charts/maskrcnn/charts/jupyter directory. At a minimum, set image to Tensorpack Mask-RCNN testing image ECR URI you built and pushed in a previous step.

b) To train AWS Mask-RCNN optimized model, customize valuex.yaml in charts/maskrcnn-optimized directory. At a minimum, set image to the AWS Mask-RCNN ECR training image URI you built and pushed in a previous step. Set shared_fs and data_fs to efs, or fsx, as applicable. Set shared_pvc to the name of the k8s persistent volume claim you created in relevant k8s namespace. To test the trained model using a Jupyter Lab notebook, customize values.yaml in the charts/maskrcnn-optimized/charts/jupyter directory. At a minimum, set image to AWS Mask-RCNN testing image ECR URI you built and pushed in a previous step.

c) To create a brand new Helm chart for defining a new MPIJOb, copy maskrcnn folder to a new folder under charts. Update the chart name in Chart.yaml. Update the namespace global variable in values.yaml to specify a new K8s namespace.

In the charts folder, install the selected Helm chart, for example:

   helm install --debug maskrcnn ./maskrcnn/  # (Helm version 3.x)

Execute: kubectl get pods -n kubeflow to see the status of the pods
Execute: kubectl logs -f maskrcnn-launcher-xxxxx -n kubeflow to see live log of training from the launcher (change xxxxx to your specific pod name).
Model checkpoints and logs will be placed on the shared_fs file-system set in values.yaml, i.e. efs or fsx.

Visualize Tensorboard summaries

Execute: kubectl get services -n kubeflow to get Tensorboard service DNS address. Access the Tensorboard DNS service in a browser on port 80 to visualize Tensorboard summaries.

Test trained model

After model training is complete, and kubectl get pods -n kubeflow command output shows that jupyter pod is Running, execute kubectl logs -f jupyter-xxxxx -n kubeflow to display Jupyter pod log. In case you have just enough GPUs needed for training, Jupyter pod will remain Pending until training is complete, because it needs 1 GPU to run.

At the beginning of the Jupyter pod log, note the security token required to access Jupyter service in a browser.

Execute kubectl get services -n kubeflow to get Jupyter service DNS address. To test the trained model using a Jupyter Lab notebook, access the Jupyter service in a browser on port 443 using the security token provided in the pod log. Your URL to access the Jupyter service should look similar to the example below:

https://xxxxxxxxxxxxxxxxxxxxxxxxx.elb.xx-xxxx-x.amazonaws.com/lab?token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Accessing Jupyter service in a browser will display a browser warning, because the service endpoint uses a self-signed certificate. Ignore the warning and proceed to access the service. Open the notebook under notebook folder, and run it it to test the trained model.

Purge Helm charts after training

When training is complete, you may delete an installed chart by executing helm delete <chart-name>, for example helm delete maskrcnn. This will destroy all pods used in training and testing, including Tensorboard and Jupyter service pods. However, the logs and trained model will be preserved on the shared file system used for training.

Destroy GPU enabled EKS cluster

When you are done with distributed training, you can destory the EKS cluster and worker node group.

Quick start option

If you used the quick start option above to create the EKS cluster and worker node group, then in eks-cluster/terraform/aws-eks-cluster-and-nodegroup fodler, execute terraform destroy with the same arguments you used with terraform apply above.

Advanced option

In eks-cluster/terraform/aws-eks-nodegroup folder, execute terraform destroy with the same arguments you used with terraform apply above to destroy the worker node group, and then similarly execute terraform destroy in eks-cluster/terraform/aws-eks-cluster to destroy EKS cluster.

This step will not destroy the shared EFS or FSx file-system used in training.

ajinkya933 / amazon-eks-machine-learning-with-terraform-and-kubeflow Goto Github PK

amazon-eks-machine-learning-with-terraform-and-kubeflow's Introduction

Distributed TensorFlow training using Kubeflow on Amazon EKS

Prerequisites

Step by step

Create GPU Enabled Amazon EKS Cluster

Quick start option

Advanced option

Create EKS Persistent Volume

Persistent Volume for EFS

Persistent Volume for FSx

Build and Upload Docker Image to Amazon EC2 Container Registry (ECR)

Tensorpack Mask-RCNN

Training Image

Testing Image

AWS Mask-RCNN

Training Image

Testing Image

Stage Data

Use EFS, or FSx

Install Helm

Install Helm charts to begin model training

Visualize Tensorboard summaries

Test trained model

Purge Helm charts after training

Destroy GPU enabled EKS cluster

Quick start option

Advanced option

amazon-eks-machine-learning-with-terraform-and-kubeflow's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org