Would you please tell us how to support tensorflow distributions train in FfDL. As

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Closed with <a class="issue-link js-issue-link" data-error-text="Failed to load title"

support for tensorflow distribution about ffdl HOT 7 CLOSED

ibm commented on July 22, 2024

support for tensorflow distribution

from ffdl.

Comments (7)

Tomcli commented on July 22, 2024

Yes, we need to define which node we want to use for parameter servers and workers when we run the TensorFlow distributed job.

Currently we have the example in one of our Pull requests https://github.com/fplk/FfDL/tree/merge_20180514_1536/etc/examples/tf-distributed which we will provide the launcher.py script to configure the networking and communication between different node, so users can do minimal work to enable distributed training.

We will soon merge this pull request after we finish our code scan and reviews. Sorry for the inconvenience.

from ffdl.

cloustone commented on July 22, 2024

@Tomcli Thanks for your good working, Anyway, according to the example https://github.com/fplk/FfDL/tree/merge_20180514_1536/etc/examples/tf-distributed, parameter server also take on GPU resources. for tensorflow native framework, the learner container doesn't distinguish ps and worker, it will be a problem. how to deal with it?

from ffdl.

atinsood commented on July 22, 2024

@cloustone unfortunately yes, the ps also takes on a gpu resource, so yes in the current implementation you end up wasting a gpu running parameters server even though its not required.

I guess we can probably do interesting things like start a sidecar container in these pods and have ps run in the sidecar containers on some of the pods.

We are still thinking through what's the best approach of implementing this. the above is one thought, the other thought process is to basically use the same container and expose ps as a separate process. another one can be to run ps as a separate standalone pod(s) . we have also thought about enhancing our job monitor component and may be have it take the responsibility of ps.

As you can see nothing concrete here, since the TF distributed approach seems to be rapidly evolving as well.

Honestly, we are thinking about what's the best way to capture all of these in a clean manner https://www.youtube.com/watch?v=bRMGoPqsn20 especially the mirrored strategy which brings ring reduce natively to TF

I am not sure if I follow or tensorflow native framework, the learner container doesn't distinguish ps and worker, it will be a problem. how to deal with it? the launcher.py code is responsible for that https://github.com/fplk/FfDL/blob/merge_20180514_1536/etc/examples/tf-distributed/launcher.py#L42 it basically marks the first n (by default 1) replicas as ps and the others as workers.

I can provide more in depth detail if you think the above does not make sense

from ffdl.

animeshsingh commented on July 22, 2024

Thanks @atinsood

@cloustone in short, yes in the current implementation there is penalty to be paid for use of GPU for PS. Would love to schedule a call and discuss options with your team

from ffdl.

cloustone commented on July 22, 2024

Thanks @animeshsingh @atinsood
We understand the current status and wish a great PR or improvement will be available.

Our team want to push FfDL into production environment, however there are so many works to be done.We should keep in touch deeply and also wish get helps from FfDL team.
Anyway, it's a little difficult for our team to speak in English fluently, so Mail or WeChat will be a better way:)

My mail address: [email protected]

from ffdl.

animeshsingh commented on July 22, 2024

Thanks @cloustone
The next PR we have in pipeline for 0.1 release candidate #79 is much closer to what you would need. I would send a mail on your id and we should iterate.

from ffdl.

Tomcli commented on July 22, 2024

Closed with #79

from ffdl.

support for tensorflow distribution about ffdl HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent