Giter Site home page Giter Site logo

lifubang / pai Goto Github PK

View Code? Open in Web Editor NEW

This project forked from microsoft/pai

0.0 3.0 0.0 2.84 MB

Resource scheduling and cluster management for AI

License: MIT License

Batchfile 0.30% Shell 10.02% Java 52.79% JavaScript 19.59% Python 15.17% HTML 1.87% CSS 0.25%

pai's Introduction

Open Platform for AI (PAI) alt text

Build Status Coverage Status

Introduction

Platform for AI (PAI) is a platform for cluster management and resource scheduling. The platform incorporates the mature design that has a proven track record in Microsoft's large scale production environment.

PAI supports AI jobs (e.g., deep learning jobs) running in a GPU cluster. The platform provides PAI runtime environment support, with which existing deep learning frameworks, e.g., CNTK and TensorFlow, can onboard PAI without any code changes. The runtime environment support provides great extensibility: new workload can leverage the environment support to run on PAI with just a few extra lines of script and/or Python code.

PAI supports GPU scheduling, a key requirement of deep learning jobs. For better performance, PAI supports fine-grained topology-aware job placement that can request for the GPU with a specific location (e.g., under the same PCI-E switch).

PAI embraces a microservices architecture: every component runs in a container. The system leverages Kubernetes to deploy and manage static components in the system. The more dynamic deep learning jobs are scheduled and managed by Hadoop YARN with our GPU enhancement. The training data and training results are stored in Hadoop HDFS.

An Open AI Platform for R&D and Education

One key purpose of PAI is to support the highly diversified requirements from academia and industry. PAI is completely open: it is under the MIT license. PAI is architected in a modular way: different module can be plugged in as appropriate. This makes PAI particularly attractive to evaluate various research ideas, which include but not limited to the following components:

  • Scheduling mechanism for deep learning workload
  • Deep neural network application that requires evaluation under realistic platform environment
  • New deep learning framework
  • AutoML
  • Compiler technique for AI
  • High performance networking for AI
  • Profiling tool, including network, platform, and AI job profiling
  • AI Benchmark suite
  • New hardware for AI, including FPGA, ASIC, Neural Processor
  • AI Storage support
  • AI platform management

PAI operates in an open model. It is initially designed and developed by Microsoft Research (MSR) and Microsoft Search Technology Center (STC) platform team. We are glad to have Peking University, Xi'an Jiaotong University, Zhejiang University, and University of Science and Technology of China join us to develop the platform jointly. Contributions from academia and industry are all highly welcome.

System Deployment

Prerequisite

The system runs in a cluster of machines each equipped with one or multiple GPUs. Each machine in the cluster runs Ubuntu 16.04 LTS and has a statically assigned IP address. To deploy services, the system further relies on a Docker registry service (e.g., Docker hub) to store the Docker images for the services to be deployed. The system also requires a dev machine that runs in the same environment that has full access to the cluster. And the system need NTP service for clock synchronization.

Deployment process

To deploy and use the system, the process consists of the following steps.

  1. Build the binary for Hadoop AI and place it in the specified path*
  2. Deploy kubernetes and system services
  3. Access web portal for job submission and cluster management

* If step 1 is skipped, a standard Hadoop 2.7.2 will be installed instead.

Kubernetes deployment

The platform leverages Kubernetes (k8s) to deploy and manage system services. To deploy k8s in the cluster, please refer to k8s deployment readme for details.

Service deployment

After Kubernetes is deployed, the system will leverage built-in k8s features (e.g., configmap) to deploy system services. Please refer to service deployment readme for details.

Job management

After system services have been deployed, user can access the web portal, a Web UI, for cluster management and job management. Please refer to this tutorial for details about job submission.

Cluster management

The web portal also provides Web UI for cluster management.

System Architecture

System Architecture

The system architecture is illustrated above. User submits jobs or monitors cluster status through the Web Portal, which calls APIs provided by the REST server. Third party tools can also call REST server directly for job management. Upon receiving API calls, the REST server coordinates with FrameworkLauncher (short for Launcher) to perform job management. The Launcher Server handles requests from the REST Server and submits jobs to Hadoop YARN. The job, scheduled by YARN with GPU enhancement, can leverage GPUs in the cluster for deep learning computation. Other type of CPU based AI workloads or traditional big data job can also run in the platform, coexisted with those GPU-based jobs. The platform leverages HDFS to store data. All jobs are assumed to support HDFS. All the static services (blue-lined box) are managed by Kubernetes, while jobs (purple-lined box) are managed by Hadoop YARN.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

pai's People

Contributors

ydye avatar abuccts avatar hwuu avatar yanjiegao avatar wangcan0329 avatar yqwang-ms avatar fanyangcs avatar yitongfeng avatar qinchen123 avatar yitongfeng-git avatar xiaoxinla avatar asakuri avatar kant avatar microsoftopensource avatar xwzheng1020 avatar msftgits avatar qiyc avatar lifubang avatar

Watchers

 avatar James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.