Giter Site home page Giter Site logo

oobleck's Introduction

Oobleck
Resilient Distributed Training Framework

Oobleck is a large-model training framework with fast fault recovery support utilizing the concept of pipeline templates.

It is the first training framework that realizes:

  • Dynamic reconfiguration: Oobleck can reconfigure distributed training configurtation without restart after failures.
  • Pipeline template instantiation: Oobleck pre-generates a set of pipeline templates, and then combine their instantiated pipelines to form a distributed execution plan. The same set of pipeline templates is reused and different pipelines are instantiated after failures.

Getting Started

Install

Use pip to install Oobleck:

pip install oobleck

Oobleck relies on cornstarch for pipeline template and Colossal-AI for training backend. Optionally, install apex, xformers and flash-attn to boost throughput (follow instructions in each README).

Run

Please refer to this README.

Cluster Management

Oobleck provides a command line interface (CLI) that manages the cluster. Use oobleck to access the master agent:

$ oobleck --ip <master_ip> --port <master_port> <command> <command_options>

where master port can be found in stdout of running:

| INFO     | __main__:serve:430 - Running master service on port 45145

Currently you can see the list of agents and send a request to gracefully terminate an agent:

$ oobleck --ip <master_ip> --port <master_port> get_agent_list
=== Agents ===
[0] IP: node1:10000 Status: up (device indices: 0,1)
[1] IP: node1:10000 Status: up (device indices: 2,3)
[2] IP: node2:10000 Status: up (device indices: 0,1)
[3] IP: node2:10000 Status: up (device indices: 2,3)
==============

$ oobleck --ip <master_ip> --port <master_port> kill_agent --agent_index 2
| INFO     | __main__:KillAgent:340 - Terminating agent 2 on node1:10000

Citation

@inproceedings{oobleck-sosp23,
    title     = {Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates},
    author    = {Jang, Insu and Yang, Zhenning and Zhang, Zhen and Jin, Xin and Chowdhury, Mosharaf},
    booktitle = {ACM SIGOPS 29th Symposium of Operating Systems and Principles (SOSP '23)},
    year      = {2023},
}

oobleck's People

Contributors

insujang avatar zyang37 avatar

Watchers

ChonLam Lao avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.