Giter Site home page Giter Site logo

support for tensorflow distribution about ffdl HOT 7 CLOSED

ibm avatar ibm commented on July 22, 2024
support for tensorflow distribution

from ffdl.

Comments (7)

Tomcli avatar Tomcli commented on July 22, 2024

Yes, we need to define which node we want to use for parameter servers and workers when we run the TensorFlow distributed job.

Currently we have the example in one of our Pull requests https://github.com/fplk/FfDL/tree/merge_20180514_1536/etc/examples/tf-distributed which we will provide the launcher.py script to configure the networking and communication between different node, so users can do minimal work to enable distributed training.

We will soon merge this pull request after we finish our code scan and reviews. Sorry for the inconvenience.

from ffdl.

cloustone avatar cloustone commented on July 22, 2024

@Tomcli Thanks for your good working, Anyway, according to the example https://github.com/fplk/FfDL/tree/merge_20180514_1536/etc/examples/tf-distributed, parameter server also take on GPU resources. for tensorflow native framework, the learner container doesn't distinguish ps and worker, it will be a problem. how to deal with it?

from ffdl.

atinsood avatar atinsood commented on July 22, 2024

@cloustone unfortunately yes, the ps also takes on a gpu resource, so yes in the current implementation you end up wasting a gpu running parameters server even though its not required.

I guess we can probably do interesting things like start a sidecar container in these pods and have ps run in the sidecar containers on some of the pods.

We are still thinking through what's the best approach of implementing this. the above is one thought, the other thought process is to basically use the same container and expose ps as a separate process. another one can be to run ps as a separate standalone pod(s) . we have also thought about enhancing our job monitor component and may be have it take the responsibility of ps.

As you can see nothing concrete here, since the TF distributed approach seems to be rapidly evolving as well.

Honestly, we are thinking about what's the best way to capture all of these in a clean manner https://www.youtube.com/watch?v=bRMGoPqsn20 especially the mirrored strategy which brings ring reduce natively to TF

I am not sure if I follow or tensorflow native framework, the learner container doesn't distinguish ps and worker, it will be a problem. how to deal with it? the launcher.py code is responsible for that https://github.com/fplk/FfDL/blob/merge_20180514_1536/etc/examples/tf-distributed/launcher.py#L42 it basically marks the first n (by default 1) replicas as ps and the others as workers.

I can provide more in depth detail if you think the above does not make sense

from ffdl.

animeshsingh avatar animeshsingh commented on July 22, 2024

Thanks @atinsood

@cloustone in short, yes in the current implementation there is penalty to be paid for use of GPU for PS. Would love to schedule a call and discuss options with your team

from ffdl.

cloustone avatar cloustone commented on July 22, 2024

Thanks @animeshsingh @atinsood
We understand the current status and wish a great PR or improvement will be available.

Our team want to push FfDL into production environment, however there are so many works to be done.We should keep in touch deeply and also wish get helps from FfDL team.
Anyway, it's a little difficult for our team to speak in English fluently, so Mail or WeChat will be a better way:)

My mail address: [email protected]

from ffdl.

animeshsingh avatar animeshsingh commented on July 22, 2024

Thanks @cloustone
The next PR we have in pipeline for 0.1 release candidate #79 is much closer to what you would need. I would send a mail on your id and we should iterate.

from ffdl.

Tomcli avatar Tomcli commented on July 22, 2024

Closed with #79

from ffdl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.