Is there any distributed version of TensorFlow that could work on multiple machines?</

Thanks for the question! To reiterate what I said <a href="https://github.com/tensorf

I just wanted to draw everyone's attention to <a class="commit-link" data-hovercard-ty

Distributed Version,about tensorflow/tensorflow

jeffreyadean commented on April 19, 2024 28

Our current internal distributed extensions are somewhat entangled with Google internal infrastructure, which is why we released the single-machine version first. The code is not yet in GitHub, because it has dependencies on other parts of the Google code base at the moment, most of which have been trimmed, but there are some remaining ones.

We realize that distributed support is really important, and it's one of the top features we're prioritizing at the moment.

from tensorflow.

mrry commented on April 19, 2024 8

Since 0.8 is now released, I think it's time to close this issue. Please create new issues for anything that arises with the distributed version, and thanks for all of your input!

from tensorflow.

mrry commented on April 19, 2024 5

I just pushed an initial version of the distributed runtime, based on gRPC. Currently, tensorflow/core/distributed_runtime contains most of the C++ implementation needed to get started, and we're still working on the higher-level libraries that will use multiple processes. However, you can get started today by taking a look at the readme.

from tensorflow.

vrv commented on April 19, 2024 3

Thanks for the question! To reiterate what I said here, we are working on making a distributed implementation available, it's currently not in the initial release. Please stay tuned, and take a look at the cifar multi-gpu tutorial for a flavor of how we handle multiple 'devices': http://tensorflow.org/tutorials/deep_cnn/index.md

from tensorflow.

mrry commented on April 19, 2024 3

Heh, looks like merging another code base can inadvertently close issues :). Reopening this, because there's still work to do.

from tensorflow.

ctn commented on April 19, 2024 2

Sorry I completely missed the mention; have been heads down on various matters. This looked just like the other hundreds of commit notifications in my mailbox.

Yes we'll be talking about a distributed implementation of TensorFlow on Spark, pyspark in particular. Some pretty interesting results on scaling and GPU vs CPU etc. I'll see you there if you'll be in NYC for SparkSummit, else on live streaming.

The primary motivation is from the Spark perspective---e.g., easily adding another useful workload to an existing Spark deployment.

For distributed TensorFlow in the abstract, Google will release a distributed implementation "soon".

HTH.

Update on the above (Distributed Tensorflow on Spark):

from tensorflow.

mrry commented on April 19, 2024 2

I just wanted to draw everyone's attention to 6d83874, which modifies the interface to some of the distributed runtime methods. In particular, tf.GrpcServer becomes tf.train.Server, and its constructor is more ergonomic, so you no longer need to construct a (now-renamed-to) tf.train.ServerDef proto to instantiate a server. The docs in the repository are now updated, but haven't yet made it onto the website.

Let me know if there are any questions!

from tensorflow.

jesuisnicolasdavid commented on April 19, 2024 1

From what i understand, tensorflow-serving takes only "trained model". So, for the question "how to train model on multiple machine", the answer isn't tensorflow-serving yet.

from tensorflow.

LiorZ commented on April 19, 2024 1

+1

from tensorflow.

rusenask commented on April 19, 2024

I think that's the point - Google hasn't open sourced "scalable" version :)

from tensorflow.

saraswat commented on April 19, 2024

would appreciate any insight on the availability of the distributed version. Is the distributed code that is being worked on in github? That is one place where some of us who are interested can contribute

from tensorflow.

zh4ngx commented on April 19, 2024

👍

from tensorflow.

edwardyoon commented on April 19, 2024

Hello,

After reading these plans and ideas, I'm somewhat surprised. According to http://static.googleusercontent.com/media/research.google.com/en//people/jeff/BayLearn2015.pdf, both data and model parallel are needed to train large and powerful models quickly. BTW, GPUs transferring data takes time as described in http://tensorflow.org/tutorials/deep_cnn/index.md. Then, how it's possible to efficiently support both model parallelism and heterogeneous multi-devices (of a single node) on distributed cluster? Could you please roughly explain how different it from DistBelief?

Thanks!

from tensorflow.

edwardyoon commented on April 19, 2024

P.S., GPU acceleration also could be limited by model partition strategies.

from tensorflow.

edwardyoon commented on April 19, 2024

Awesome.

After reading the whitepaper, I just realized that large neural network model can be partitioned into sub-graphs by layer (horizontal partitioning) and executed in a serial way.

One thing not clear is the performance for fully connected network on multi node equipped with GPUs cluster ..

from tensorflow.

kdunn926 commented on April 19, 2024

In theory, something like Dask could be layered on top for handling this - at least for the Python front-end.

from tensorflow.

saraswat commented on April 19, 2024

Any update on timeline?

from tensorflow.

edwardyoon commented on April 19, 2024

Dask looks interesting project but the drawback of the blocking algorithm is that it's not memory optimal. Since a large amount of memory is required for fully-connected layers, I was thought that Pregel-like model parallelism on CPUs w/ vertical partitioning is more attractive for fully connected layers (blocking mat-mult on GPU also appears to me slow and memory demanding). Of course, I maybe wrong but that's why I launched Apache Horn project recently. Since layers can be pipelined, I hope we can collaborate each projects in complementary way.

from tensorflow.

bhack commented on April 19, 2024

I don't know if you can give us some anticipation on the framework choice but..

https://github.com/cloudera/spark-dataflow

https://spark-summit.org/east-2016/events/distributed-tensor-flow-on-spark-scaling-googles-deep-learning-library/

from tensorflow.

saudet commented on April 19, 2024

@bhack I wonder what their Java/Scala interface looks like...

from tensorflow.

martinwicke commented on April 19, 2024

We've started work on this using gRPC. We hopefully have something to show soon.

from tensorflow.

RobotiAi commented on April 19, 2024

Really in desperate need for the distribute version.. Dream to see it to be released early and contribute my efforts in this great project..

from tensorflow.

bhack commented on April 19, 2024

@saudet /cc @ctn

from tensorflow.

shendiaomo commented on April 19, 2024

Any update on timeline? Can't wait any longer... @martinwicke

from tensorflow.

andykitchen commented on April 19, 2024

Bump. I'm really glad this is your top priority. Thanks for all your amazing work so far.

from tensorflow.

krzysztof-magosa commented on April 19, 2024

+1

from tensorflow.

grillermo commented on April 19, 2024

+1

from tensorflow.

mschonwe commented on April 19, 2024

+1

from tensorflow.

jesuisnicolasdavid commented on April 19, 2024

+1

from tensorflow.

shendiaomo commented on April 19, 2024

TensorFlow-serving released today, it seems networking in gRPC has been proved to be a mature solution, great news!

from tensorflow.

shendiaomo commented on April 19, 2024

Yeah of course tensorflow serving itself is not the answer. My point is that the core of distributed training is how to pass tensors (activations/weights/gradients) efficiently between machines, which is half done with a proved decent network framework as gRPC.

from tensorflow.

SolalPirelli commented on April 19, 2024

Is there a (rough) estimate of how soon this'll happen? I know estimates like these are difficult to do, but it'd be really helpful to know whether "soon" is closer to a few weeks or a few months. 😄

from tensorflow.

YanTangZhai commented on April 19, 2024

+1

from tensorflow.

bhack commented on April 19, 2024

In the meantime, if anyone wants to experiment there is a initial PR for Spark by @amplab at amplab/SparkNet#91

from tensorflow.

ffmpbgrnn commented on April 19, 2024

Awesome! I can't wait to try it!

from tensorflow.

shendiaomo commented on April 19, 2024

Cong! Great & impressive work!

[email protected]:[email protected]

在 2016年2月26日，22:02，Derek Murray <[email protected]mailto:[email protected]> 写道：

I just pushed an initial version of the distributed runtime, based on gRPC. Currently, tensorflow/core/distributed_runtime contains most of the C++ implementation needed to get started, and we're still working on the higher-level libraries that will use multiple processes. However, you can get started today by taking a look at the readmehttps://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/distributed_runtime/README.md.

―
Reply to this email directly or view it on GitHubhttps://github.com//issues/23#issuecomment-189285409.

from tensorflow.

jesuisnicolasdavid commented on April 19, 2024

Well done @mrry !

from tensorflow.

hsaputra commented on April 19, 2024

Thanks for re-opening it, @mrry .

Since the initial effort already merged, would it be easier to track open tasks as individual issues?

from tensorflow.

javadba commented on April 19, 2024

It appears there were two parallel approaches here: (a) SparkNet and (b) gRpc using C++. Is that correct? And both are also currently active?

from tensorflow.

JinXinDeep commented on April 19, 2024

In my opinion, spark+tensorflow is mainly for data parallel, and are useful for samll scale networks. gRPC and tensorflow's distributed support would be more helpful for large networks.

from tensorflow.

javadba commented on April 19, 2024

@JinXinDeep Would you please clarify? Spark allows access to large clusters - potentially for each machine having GPU or many CPU's. So why do you say for small scale networks?

from tensorflow.

JinXinDeep commented on April 19, 2024

@javadba sorry for the confusion, with "small networks" I mean the deep neural networks can not have too many model parameters, otherwise the communication time between machines will exceed the computation time, which means speedup is not high.

from tensorflow.

javadba commented on April 19, 2024

@JinXinDeep Thx for the clarification. Are model parameter sizes really that large - as in many MB?

from tensorflow.

rafaljozefowicz commented on April 19, 2024

E.g. in this paper arxiv.org/abs/1602.02410 we trained LSTMs that have 200M+ parameters and 3B+ total parameters in some cases.

from tensorflow.

bhack commented on April 19, 2024

/cc @robertnishihara that probably could give you a better overview of SparkNet target.

/cc @maxpumperla if interested in the thread

from tensorflow.

maxpumperla commented on April 19, 2024

Thanks @bhack, I was indeed missing out on the discussion.

from tensorflow.

robertnishihara commented on April 19, 2024

Thanks @bhack! Indeed SparkNet is designed for data parallel training or data parallel use cases like featurizing data with an already trained net.

from tensorflow.

kanwar2preet commented on April 19, 2024

Pardon my ignorance but did anyone succeeded running Tensorflow on multiple machines?

Unable to find how to run tensorflow on multiple machines.

I was successful in starting a local server.

I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:199] Initialize HostPortsGrpcChannelCache for job worker -> {localhost:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:203] Started server with target: grpc://localhost:2222

I tried to give Ip for another machine instead of localhost but it still starts localhost only.

Please help.

from tensorflow.

mrry commented on April 19, 2024

@kanwar2preet: Can you open another issue with details of (i) the TensorFlow program that you ran, and (ii) the configuration you used for the servers? Thanks!

from tensorflow.

jramapuram commented on April 19, 2024

The docs for distribution seem out of date when I tried to do this. tf.make_cluster_def is now private (i.e. _make_cluster_def)? I put together a simple example for testing purposes. I had a few segfaults along the way when I mismatched things, eg: if you only have one worker specified and set task_index=1 will SEGFAULT.

Here is what I did:
Worker Nodes setup:

import tensorflow as tf

cluster_spec = tf.ClusterSpec({"worker": ["tensorflow-worker0:2222",
                                           "tensorflow-worker1:2222"],
                                "ps": ["tensorflow-master0:2222"]})
server_def = tf.ServerDef(cluster=cluster_spec.as_cluster_def(),
                          job_name="worker",
                          task_index=1,  # switch this to 0 for worker1
                          protocol="grpc")
server = tf.GrpcServer(server_def)
server.join()

ps Node Setup:

import tensorflow as tf

cluster_spec = tf.ClusterSpec({"worker": ["tensorflow-worker0:2222",
                                           "tensorflow-worker1:2222"],
                                "ps": ["tensorflow-master0:2222"]})
server_def = tf.ServerDef(cluster=cluster_spec.as_cluster_def(),
                          job_name="ps",
                          task_index=0,
                          protocol="grpc")
server = tf.GrpcServer(server_def)
server.join()

How to deploy a job:

import tensorflow as tf

with tf.device("/job:ps/task:0"):
    weights0 = tf.Variable(tf.random_normal(shape=[1024, 512]))
    bias0 = tf.Variable(tf.zeros(shape=[512]))

with tf.device("/job:worker/task:1"):
    inputs = tf.random_normal(shape=[10, 1024])
    l0 = tf.nn.relu(tf.matmul(inputs, weights0) + bias0)
    l1 = tf.nn.relu(tf.matmul(l0, tf.transpose(weights0)))
    loss = tf.nn.l2_loss(l1-inputs)
    train_op = tf.train.AdamOptimizer().minimize(loss)

with tf.Session("grpc://tensorflow-master0:2222") as sess:
   for _ in range(1000):
       sess.run(tf.initialize_all_variables())
       _, l = sess.run([train_op, loss])
       print l

Regarding deployment systems: it would be nice to have a simple ansible deployment for this.

from tensorflow.

mrry commented on April 19, 2024

@jramapuram: Thanks for pointing this out. We have a fix to the docs coming later today. (Also, these interfaces are subject to a little bit of churn before they freeze in the next release, so please bear with us!)

The segfault is rather embarrassing, so we should fix that. Can you please create an issue with the exact code that reproduces it?

Regarding Ansible, I don't have any experience with that platform, but we would be glad to accept contributions... feel free to open another issue to suggest that. (#1686 suggests adding support for Slurm, for example.)

from tensorflow.

LiorZ commented on April 19, 2024

Great! Amazing work

On Fri, Apr 15, 2016 at 10:00 PM Derek Murray [email protected]
wrote:

Since 0.8 is now released, I think it's time to close this issue. Please
create new issues for anything that arises with the distributed version,
and thanks for all of your input!

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#23 (comment)

from tensorflow.

JinXinDeep commented on April 19, 2024

@mrry Thanks for the distributed version of TensorFlow since v0.8. In my opinion, TensorFlow is very flexible, it can do model parallel, data parallel, or mixed parallel, although the examples are for data parallel.

For example, for model parallel, typically there is a cluster consists of a number of distributed nodes (e.g. machines) for model training. If we use the In-graph mode, a main program can be used to define the tasks for all the nodes. To reduce communication overhead and to do model parallel efficiently, each node includes a parameter server (ps) task contains the sub-model parameters and a worker task that corresponding to the computations of the sub-model; each node contains a different sub-model, which is assigned by the main program, and these sub-models collectively form the whole model.
Each ps task from each node will receive the computation results from at least one node’s worker in the cluster. In model parallel, for example, for each node, its worker task typically do the compution corresponding to its ps task’s sub-model; the ps task in each node will update its sub-model according to the training results it received.

Is that right? Thanks!

from tensorflow.

raghav20 commented on April 19, 2024

I am wondering how can I run Distributed Tensor Flow on Top of Spark

from tensorflow.

Distributed Version about tensorflow HOT 54 CLOSED

Comments (54)

HTH.

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent