Comments (7)
Yes, we need to define which node we want to use for parameter servers and workers when we run the TensorFlow distributed job.
Currently we have the example in one of our Pull requests https://github.com/fplk/FfDL/tree/merge_20180514_1536/etc/examples/tf-distributed which we will provide the launcher.py script to configure the networking and communication between different node, so users can do minimal work to enable distributed training.
We will soon merge this pull request after we finish our code scan and reviews. Sorry for the inconvenience.
from ffdl.
@Tomcli Thanks for your good working, Anyway, according to the example https://github.com/fplk/FfDL/tree/merge_20180514_1536/etc/examples/tf-distributed, parameter server also take on GPU resources. for tensorflow native framework, the learner container doesn't distinguish ps and worker, it will be a problem. how to deal with it?
from ffdl.
@cloustone unfortunately yes, the ps also takes on a gpu resource, so yes in the current implementation you end up wasting a gpu running parameters server even though its not required.
I guess we can probably do interesting things like start a sidecar container in these pods and have ps run in the sidecar containers on some of the pods.
We are still thinking through what's the best approach of implementing this. the above is one thought, the other thought process is to basically use the same container and expose ps as a separate process. another one can be to run ps as a separate standalone pod(s) . we have also thought about enhancing our job monitor component and may be have it take the responsibility of ps.
As you can see nothing concrete here, since the TF distributed approach seems to be rapidly evolving as well.
Honestly, we are thinking about what's the best way to capture all of these in a clean manner https://www.youtube.com/watch?v=bRMGoPqsn20 especially the mirrored strategy which brings ring reduce natively to TF
I am not sure if I follow or tensorflow native framework, the learner container doesn't distinguish ps and worker, it will be a problem. how to deal with it?
the launcher.py code is responsible for that https://github.com/fplk/FfDL/blob/merge_20180514_1536/etc/examples/tf-distributed/launcher.py#L42 it basically marks the first n (by default 1) replicas as ps and the others as workers.
I can provide more in depth detail if you think the above does not make sense
from ffdl.
Thanks @atinsood
@cloustone in short, yes in the current implementation there is penalty to be paid for use of GPU for PS. Would love to schedule a call and discuss options with your team
from ffdl.
Thanks @animeshsingh @atinsood
We understand the current status and wish a great PR or improvement will be available.
Our team want to push FfDL into production environment, however there are so many works to be done.We should keep in touch deeply and also wish get helps from FfDL team.
Anyway, it's a little difficult for our team to speak in English fluently, so Mail or WeChat will be a better way:)
My mail address: [email protected]
from ffdl.
Thanks @cloustone
The next PR we have in pipeline for 0.1 release candidate #79 is much closer to what you would need. I would send a mail on your id and we should iterate.
from ffdl.
Closed with #79
from ffdl.
Related Issues (20)
- FfDL v0.1.1 model training error HOT 4
- FfDL CLI output is not properly machine parsable
- [Documentation] Update IBM Cloud CLI instructions in /etc/converter/train-deploy-wml.md
- dind-port-forward.sh -> invalid resource name ? HOT 5
- Grafana charts shows no data points HOT 1
- Unable to mount volumes for pod Learner HOT 8
- Learner pod stuck at training step 100 using custom image with TF Object Detection HOT 5
- / FfDL/demos/fashion-mnist-adversarial/README.md references internal repository HOT 1
- how to use pytorch and caffe built by ourselves? HOT 2
- kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff HOT 26
- tiller-deploy is in status CrashLoopBackOff HOT 2
- Confused about manifest.yml HOT 2
- learner pod failed HOT 19
- caffe training speed is very slow HOT 4
- pytorch training issue: insufficient shared memory HOT 2
- distributed training questions HOT 2
- why pytorch distributed training on two servers is slower than training on one server HOT 21
- .travis.yml: The 'sudo' tag is now deprecated in Travis CI
- ssh permission denied when deploying FfDL on public cloud
- fail to install
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ffdl.