-
Subscribe to the EKS-optimized AMI with GPU Support from the AWS Marketplace.
-
Manage your service limits so you can launch at least 4 EKS-optimized GPU enabled Amazon EC2 P3 instances.
-
Create an AWS Service role for an EC2 instance and add AWS managed policy for Administrator access to this IAM Role.
-
We need a build environment with AWS CLI and Docker installed. Launch a m5.xlarge Amazon EC2 instance from an AWS Deep Learning AMI (Ubuntu) using an EC2 instance profile containing the Role created in Step 4. All steps described under Step by step section below must be executed on this build environment instance.
While all the concepts described here are quite general, we will make these concepts concrete by focusing on distributed TensorFlow training for TensorPack Mask/Faster-RCNN model.
The high-level outline of steps is as follows:
- Create GPU enabled Amazon EKS cluster
- Create Persistent Volume and Persistent Volume Claim for Amazon EFS or Amazon FSx file system
- Stage COCO 2017 data for training on Amazon EFS or FSx file system
- Use Helm charts to manage training jobs in EKS cluster
This option creates an Amazon EKS cluster with one worker node group. This is the recommended option for walking through this tutorial.
-
In
eks-cluster
directory, execute:./install-kubectl-linux.sh
to installkubectl
on Linux clients.For non-linux operating systems, install and configure kubectl for EKS, and install aws-iam-authenticator and make sure the command
aws-iam-authenticator help
works. -
Install Terraform. Terraform configuration files in this repository are consistent with Terraform v0.13.0 syntax.
-
In
eks-cluster/terraform/aws-eks-cluster-and-nodegroup
folder, execute:terraform init
The next command requires an Amazon EC2 key pair. If you have not already created an EC2 key pair, create one before executing the command below:
terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]' -var="k8s_version=1.19" -var="key_pair=xxx"
This option separates the creation of the EKS cluster from the worker node group. You can create the EKS cluster and later add one or more worker node groups to the cluster.
-
In
eks-cluster/terraform/aws-eks-cluster
folder, execute:terraform init
terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]' -var="k8s_version=1.19"
Customize Terraform variables as appropriate. K8s version can be specified using
-var="k8s_version=x.xx"
. Save the output of the apply command for next step below. -
In
eks-cluster/terraform/aws-eks-nodegroup
folder, using the output of previousterraform apply
as inputs into this step, execute:terraform init
The next command requires an Amazon EC2 key pair. If you have not already created an EC2 key pair, create one before executing the command below:
terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var="efs_id=fs-xxx" -var="subnet_id=subnet-xxx" -var="key_pair=xxx" -var="cluster_sg=sg-xxx" -var="nodegroup_name=xxx"
To create more than one nodegroup in an EKS cluster, copy
eks-cluster/terraform/aws-eks-nodegroup
folder to a new folder undereks-cluster/terraform/
and specify a unique value fornodegroup_name
variable. -
In
eks-cluster
directory, execute:./install-kubectl-linux.sh
to installkubectl
on Linux clients. For other operating systems, install and configure kubectl for EKS. -
Install aws-iam-authenticator and make sure the command
aws-iam-authenticator help
works. Ineks-cluster
directory, customizeset-cluster.sh
and execute:./update-kubeconfig.sh
to update kube configuration.Ensure that you have at least version 1.16.73 of the AWS CLI installed. Your system's Python version must be Python 3, or Python 2.7.9 or greater.
-
In
eks-cluster
directory, customize NodeInstanceRole inaws-auth-cm.yaml
and execute:./apply-aws-auth-cm.sh
to allow worker nodes to join EKS cluster. Note, if this is not your first EKS node group, you must add the new node instance role Amazon Resource Name (ARN) toaws-auth-cm.yaml
, while preserving the existing role ARNs inaws-auth-cm.yaml
. -
In
eks-cluster
directory, execute:./apply-nvidia-plugin.sh
to create NVIDIA-plugin daemon set
We have two shared file system options for staging data for distributed training:
Below, you only need to create Persistent Volume and Persistent Volume Claim for EFS, or FSx, not both.
-
Execute:
kubectl create namespace kubeflow
to create kubeflow namespace -
In
eks-cluster
directory, customizepv-kubeflow-efs-gp-bursting.yaml
for EFS file-system id and AWS region and execute:kubectl apply -n kubeflow -f pv-kubeflow-efs-gp-bursting.yaml
-
Check to see the persistent-volume was successfully created by executing:
kubectl get pv -n kubeflow
-
Execute:
kubectl apply -n kubeflow -f pvc-kubeflow-efs-gp-bursting.yaml
to create an EKS persistent-volume-claim -
Check to see the persistent-volume was successfully bound to peristent-volume-claim by executing:
kubectl get pv -n kubeflow
-
Install K8s Container Storage Interface (CS) driver for Amazon FSx Lustre file system:
kubectl apply -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"
-
Execute:
kubectl create namespace kubeflow
to create kubeflow namespace -
In
eks-cluster
directory, customizepv-kubeflow-fsx.yaml
for FSx file-system id and AWS region and execute:kubectl apply -n kubeflow -f pv-kubeflow-fsx.yaml
-
Check to see the persistent-volume was successfully created by executing:
kubectl get pv -n kubeflow
-
Execute:
kubectl apply -n kubeflow -f pvc-kubeflow-fsx.yaml
to create an EKS persistent-volume-claim -
Check to see the persistent-volume was successfully bound to persistent-volume-claim by executing:
kubectl get pv -n kubeflow
Below, we build and push the Docker images for TensorPack Mask-RCNN model. Note the ECR URI output from executing the scripts: You will need it in steps below.
For training image, execute:
./container/build_tools/build_and_push.sh <aws-region>
For testing image, execute:
./container-viz/build_tools/build_and_push.sh <aws-region>
Below, we build and push the Docker images for AWS Mask-RCNN model. Note the ECR URI output from executing the scripts: You will need it in steps below.
For training image, execute:
./container-optimized/build_tools/build_and_push.sh <aws-region>
For testing image, execute:
./container-optimized-viz/build_tools/build_and_push.sh <aws-region>
To download COCO 2017 dataset to your build environment instance and upload it to Amazon S3 bucket, customize eks-cluster/prepare-s3-bucket.sh
script to specify your S3 bucket in S3_BUCKET
variable and execute eks-cluster/prepare-s3-bucket.sh
Next, we stage the data on EFS or FSx file-system. We need to use either EFS or FSx below, not both.
To stage data on EFS or FSx, set image
in eks-cluster/stage-data.yaml
to the ECR URL you noted above, customize S3_BUCKET
variable and execute:
kubectl apply -f stage-data.yaml -n kubeflow
to stage data on selected persistent volume claim for EFS (default), or FSx. Customize persistent volume claim in eks-cluster/stage-data.yaml
to use FSx.
Execute kubectl get pods -n kubeflow
to check the status of stage-data
Pod. Once the status of stage-data
Pod is marked Completed
, execute following commands to verify data has been staged correctly:
kubectl apply -f attach-pvc.yaml -n kubeflow
kubectl exec attach-pvc -it -n kubeflow -- /bin/bash
You will be attached to the EFS or FSx file system persistent volume. Type exit
once you have verified the data.
Helm is package manager for Kubernetes. It uses a package format named charts. A Helm chart is a collection of files that define Kubernetes resources. Install helm version 3.x or later according to instructions here.
-
In the
charts
folder, deploy Kubeflow MPIJob CustomResouceDefintion using mpijob chart:helm install --debug mpijob ./mpijob/ # (Helm version 3.x)
-
You have three options for training Mask-RCNN model:
a) To train TensorPack Mask-RCNN model, customize
values.yaml
in thecharts/maskrcnn
directory. At a minimum, setimage
to Tensorpack Mask-RCNN training image ECR URI you built and pushed in a previous step. Setshared_fs
anddata_fs
toefs
, orfsx
, as applicable. Setshared_pvc
to the name of the k8s persistent volume claim you created in relevant k8s namespace. To test the trained model using a Jupyter Lab notebook, customizevalues.yaml
in thecharts/maskrcnn/charts/jupyter
directory. At a minimum, setimage
to Tensorpack Mask-RCNN testing image ECR URI you built and pushed in a previous step.b) To train AWS Mask-RCNN optimized model, customize
valuex.yaml
incharts/maskrcnn-optimized
directory. At a minimum, setimage
to the AWS Mask-RCNN ECR training image URI you built and pushed in a previous step. Setshared_fs
anddata_fs
toefs
, orfsx
, as applicable. Setshared_pvc
to the name of the k8s persistent volume claim you created in relevant k8s namespace. To test the trained model using a Jupyter Lab notebook, customizevalues.yaml
in thecharts/maskrcnn-optimized/charts/jupyter
directory. At a minimum, setimage
to AWS Mask-RCNN testing image ECR URI you built and pushed in a previous step.c) To create a brand new Helm chart for defining a new MPIJOb, copy
maskrcnn
folder to a new folder undercharts
. Update the chart name inChart.yaml
. Update thenamespace
global variable invalues.yaml
to specify a new K8s namespace. -
In the
charts
folder, install the selected Helm chart, for example:helm install --debug maskrcnn ./maskrcnn/ # (Helm version 3.x)
-
Execute:
kubectl get pods -n kubeflow
to see the status of the pods -
Execute:
kubectl logs -f maskrcnn-launcher-xxxxx -n kubeflow
to see live log of training from the launcher (change xxxxx to your specific pod name). -
Model checkpoints and logs will be placed on the
shared_fs
file-system set invalues.yaml
, i.e.efs
orfsx
.
Execute: kubectl get services -n kubeflow
to get Tensorboard service DNS address. Access the Tensorboard DNS service in a browser on port 80 to visualize Tensorboard summaries.
After model training is complete, and kubectl get pods -n kubeflow
command output shows that jupyter
pod is Running
, execute kubectl logs -f jupyter-xxxxx -n kubeflow
to display Jupyter pod log. In case you have just enough GPUs needed for training, Jupyter pod will remain Pending
until training is complete, because it needs 1 GPU to run.
At the beginning of the Jupyter pod log, note the security token required to access Jupyter service in a browser.
Execute kubectl get services -n kubeflow
to get Jupyter service DNS address. To test the trained model using a Jupyter Lab notebook, access the Jupyter service in a browser on port 443 using the security token provided in the pod log. Your URL to access the Jupyter service should look similar to the example below:
Accessing Jupyter service in a browser will display a browser warning, because the service endpoint uses a self-signed certificate. Ignore the warning and proceed to access the service. Open the notebook under notebook
folder, and run it it to test the trained model.
When training is complete, you may delete an installed chart by executing helm delete <chart-name>
, for example helm delete maskrcnn
. This will destroy all pods used in training and testing, including Tensorboard and Jupyter service pods. However, the logs and trained model will be preserved on the shared file system used for training.
When you are done with distributed training, you can destory the EKS cluster and worker node group.
If you used the quick start option above to create the EKS cluster and worker node group, then in eks-cluster/terraform/aws-eks-cluster-and-nodegroup
fodler, execute terraform destroy
with the same arguments you used with terraform apply
above.
In eks-cluster/terraform/aws-eks-nodegroup
folder, execute terraform destroy
with the same arguments you used with terraform apply
above to destroy the worker node group, and then similarly execute terraform destroy
in eks-cluster/terraform/aws-eks-cluster
to destroy EKS cluster.
This step will not destroy the shared EFS or FSx file-system used in training.