Giter Site home page Giter Site logo

zuston / raytf Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 2.0 48 KB

Distributed Deep Learning Framework on Ray, including tensorflow/pytorch/mxnet

Python 100.00%
tensorflow tensorflow2 tensorflow-estimator tensorflow-estimator-api ray tensorflow-on-ray ray-tensorflow ray-tf distributed-tensorflow

raytf's Introduction

Distributed Deep Learning Framework on Ray

The raytf framework provides a simple interface to support distributed training on ray, including tensorflow/pytorch/mxnet. Now tensorflow has been supported, others will be included in later.

Quick Start

Only tested under Python3.6 version

  1. Install the latest ray version: pip install ray
  2. Install the latest raytf: pip install raytf
  3. Git clone this project: git clone https://github.com/zuston/raytf.git
  4. Enter the example folder and execute the python script file, like the following command.
cd raytf
cd example
python mnist.py

How to Use

from raytf.raytf_driver import Driver
# When you using it in local single machine
# ray.init()
tf_cluster = Driver.build(resources=
    {
        'ps': {'cores': 2, 'memory': 2, 'gpu': 2, 'instances': 2},
        'worker': {'cores': 2, 'memory': 2, 'gpu': 2, 'instances': 6},
        'chief': {'cores': 2, 'memory': 2, 'gpu': 2, 'instances': 1}
    },
    event_log='/tmp/opal/4',
    resources_allocation_timeout=10
)
tf_cluster.start(model_process=process, args=None)

This training code will be attached to the existed on-prem Ray cluster. If debug, you can use ray.init() to init Ray cluster in local.

When you specify the event_log in tf builder, sidecar tensorboard will be started on one worker.

GANG scheduler has been supported. Besides raytf provides the configuration of timeout for waiting resources which is shown in above code, and the option of resources_allocation_timeout unit is sec.

How to build and deploy

<Requirement> python -m pip install twine

  1. python setup.py bdist\_wheel --universal
  2. python -m pip install xxxxxx.whl
  3. twine upload dist/*

Tips

  1. To solve the problem of Python module importing on Ray on-prem cluster, this project must use Ray 1.5+ version, refer to this RFC(ray-project/ray#14019)
  2. This project is only be tested by Tensorflow estimator training

raytf's People

Contributors

zuston avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

raytf's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.