Giter Site home page Giter Site logo

deeplearningsprint / distributed-tensorflow-guide Goto Github PK

View Code? Open in Web Editor NEW

This project forked from tmulc18/distributed-tensorflow-guide

0.0 2.0 0.0 123 KB

Distributed TensorFlow basics and examples of training algorithms

License: MIT License

Python 59.23% Shell 2.15% Jupyter Notebook 38.62%

distributed-tensorflow-guide's Introduction

Distributed TensorFlow Guide

This guide is a collection of distributed training examples (that can act as boilerplate code) and a tutorial of basic distributed TensorFlow. Many of the examples focus on implementing well-known distributed training schemes, such as those available in dist-keras which were discussed in the author's blog post.

Almost all the examples can be run on a single machine with a CPU, and all the examples only use data-parallelism (i.e. between-graph replication).

The motivation for this guide stems from the current state of distributed deep learning. Deep learning papers typical demonstrate successful new architectures on some benchmark, but rarely show how these models can be trained with 1000x the data which is usually the requirement in industy. Furthermore, most successful distributed cases use state-of-the-art hardware to bruteforce massive effective minibatches in a synchronous fashion across high-bandwidth networks; there has been little research showing the potential of asynchronous training (which is why there are a lot of those examples in this guide). Finally, the lack of documenation for distributed TF was the real reason this project was started. TF is a great tool that prides itself on its scalability, but unfortunately there are few examples that show how to make your model scale with datasize.

The aim of this guide is to aid all interested in distributed deep learning, from beginners to researchers.

Basics Tutorial

See the Basics-Tutorial folder for notebooks demonstrating core concepts used in distributed TensorFlow. The rest of the examples assume understanding of the basics tutorial.

  • Servers.ipynb -- basics of TensorFlow servers
  • Parameter-Server.ipynb -- everything about parameter servers
  • Local-then-Global-Variables.ipynb -- creates a graph locally then make global copies of the variables Useful for graphs that do local updates before pushing global updates (e.g. DOWNPOUR, ADAG, etc.)
  • Multiple-Workers -- contains three notebooks: one parameter server notebook and two worker notebooks The exercise shows how global variables are communicated via the parameter server and how local updates can be made by explicitly placing ops on local devices

Training Algorithm Examples

The complete list of examples is below. The first example, Non-Distributed-Setup, shows the basic learning problem we want to solve distributively; this example should be familiar to all since it doesn't use any distributed code. The second example, Distributed-Setup shows the same problem being solved with distributed code (i.e. with one parameter server and one worker). The remaining examples are a mix of synchronous and non-synchronous training schemes.

1This is the same as the DOWNPOUR example except that is uses SGD on the workers instead of Adagrad.

Running Training Algorithm Examples

All the training examples (except the non-distributed example) live in a folder. To run them, move to the example directory and run the bash script.

cd <example_name>/
bash run.sh

In order to completely stop the example, you'll need to kill the python processes associated with it. If you want to stopped training early, then there will be python processes for each of the workers in addition to the parameter server processes. Unfortunately, the parameter server processes continue to run even after the workers are finished--these will always need to be killed manually. To kill all python processes, run pkill.

sudo pkill python

Requirements

  • Python 2.7
  • TensorFlow >= 1.2

Links

Glossary

  • Server -- encapsulates a Session target and belongs to a cluster
  • Coordinator -- coordinates threads
  • Session Manager -- restores session and initialized variables and coordinates threads
  • Supervisor -- good for threads. Coordinater, Saver, and Session Manager. > Session Manager
  • Session Creator -- Factory for creating a session?
  • Monitored Session -- Session. initialization, hooks, recovery.
  • Monitored Training Session -- only distributed solution for sync optimization
  • Sync Replicas -- wrapper of optimizer for synchronous optimization
  • Scaffold -- holds lots of meta training settings and passed to Session creator

Hooks

Algorithm References

distributed-tensorflow-guide's People

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.