Giter Site home page Giter Site logo

tom5610 / aws-efa-nccl-baseami-pipeline Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aws-samples/aws-efa-nccl-baseami-pipeline

0.0 0.0 0.0 11.24 MB

EFA/NCCL base AMI build Packer and CodeBuild/Pipeline files. Also base Docker build files to enable EFA/NCCL in containers

License: MIT No Attribution

Shell 24.22% Python 67.51% JavaScript 8.27%

aws-efa-nccl-baseami-pipeline's Introduction

AWS EFA and NCCL Base AMI/Docker Build Pipeline

The base EFA/NCCL Base AMI can help you quickly get started with running distributed training workloads on AWS with our EFA enabled instances (p3dn, g4dn, and p4d) Included are sample buildspecs which you integrate with a CodeBuild/CodePipeline for automatic builds. These scripts can be used as examples for both AL2 and Ubuntu 18.04 the following stack is installed. The docker build file is an example implentation of the requirements to setup EFA/NCCL in a container context for ECS/Batch/EKS.

  • NVIDIA Driver 470.xx
  • CUDA 11.4
  • NVIDIA Fabric Manager 470.xx (version locked to the nvidia driver)
  • cuDNN 8
  • NCCL 2.10.3
  • EFA latest driver
  • AWS-OFI-NCCL
  • FSx kernel and client driver and utilities
  • Intel OneDNN
  • NVIDIA runtime Docker

Packer Instructions

In the nvidia-efa-ami_base dir you will find packer scripts for Amazon Linux 2 and Ubuntu 18.04. Generally you just need to modify the variables:{} json and execute the packer build

"variables": {
    "region": "us-east-1",
    "flag": "<flag>",
    "subnet_id": "<subnetid>",
    "security_groupids": "<security_group_id,security_group_id",
    "build_ami": "<buildami>",
    "efa_pkg": "aws-efa-installer-latest.tar.gz",
    "intel_mkl_version": "intel-mkl-2020.0-088",
    "nvidia_version": "cuda-drivers-fabricmanager-495",
    "cuda_version": "cuda-toolkit-11-5 nvidia-gds-11-5",
    "cudnn_version": "libcudnn8",
    "nccl_version": "v2.11.4-1"
  },

After filling in the variables check that the packer script is validated.

packer validate nvidia-efa-ml-al2.yml
packer build nvidia-efa-ml-al2.yml

Accelerator Metrics/Error Handling in Cloudwatch

In this repo we also have an accelerator metrics and error handling custom metric which will push key metrics into cloudwatch. This is particularly useful in situations where you have an abstracted view of the underlying accelerator and unable to monitor metrics directly. For NVIDIA GPUS the following metrics are captured: dashboard Accelerator kernel utilization Memory utilization Memory free Memory used SM clocks Memory clocks Total uncorrectable ECC Errors

The metric code is natively added to all AMIs built from this repo but you can use it directly in your AMIs as well. If interested you can extend this code to use your own metrics montitor as long as you follow this JSON schema:

{
  "Id": 1,
  "AcceleratorName": "NVIDIA A100-SXM4-40GB",
  "AcceleratorDriver": "470.42.01",
  "Metrics": {
    "<Metric_Name>": <Metric_Value>,
    "<Metric_Name>": <Metric_Value>
  }
}

Furthermore we have added error handling specifically for NVIDIA GPUs in Cloudwatch Logs. A logstream is created which will lift NVRM: ... related messages in the syslog of the instance and push them to Cloudwatch. error log

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

aws-efa-nccl-baseami-pipeline's People

Contributors

amazon-auto avatar amrragab8080 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.