Comments (4)
In my understanding, there is just one update op put on ps device, and each step every worker will call this op after finishing dependence ops.
I'm sorry I clicked close
button because there was some water on my touchpad...
from benchmarks.
@suiyuan2009 I am asking the team about this problem.
Edit: It looks like Reed is looking at it and responded to the post in tensorflow/tensorflow. Keeping this one assigned to me so you can ping me if things get stale.
from benchmarks.
We should probably only increment global_step as chief. That way there will be no performance impact from locking, and it makes more sense IMO for the global_step to increment once per step.
@zheng-xq what do you think? Also, can this problem cause deadlock, or would it just occasionally cause global_step to not be incremented as much as it shoud?
@suiyuan2009 can you give the exact command line arguments you used on each worker and PS to run tf_cnn_benchmarks?
from benchmarks.
I lost job history, it happens when there are many workers, command is like this.
ps
python3 /home/dongziming/repos/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model resnet101 --server_protocol grpc+verbs --num_batches 600 --num_gpus 4
--variable_update distributed_replicated --local_parameter_device gpu --ps_hosts 10.9.8.120:34213
--worker_hosts 10.9.8.119:33421 --job_name ps --task_index 0
worker
python3 /home/dongziming/repos/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model resnet101 --server_protocol grpc+verbs --num_batches 600 --num_gpus 4
--variable_update distributed_replicated --local_parameter_device gpu --ps_hosts 10.9.8.120:34213
--worker_hosts 10.9.8.119:33421 --job_name worker --task_index 0
from benchmarks.
Related Issues (20)
- Impossible to test perfzero without docker HOT 2
- not using GCE instance HOT 2
- NotFoundError: No CPU devices are available in this process HOT 1
- Failure when running models with tf_cnn_benchmarks (threadpool in preprocessing.py) HOT 1
- ImportError: cannot import name multi_device_iterator_ops HOT 1
- OP_REQUIRES failed at save_restore_v2_ops.cc:205 : Not found: Key grouping/TCN/res_0_1/layer_normalization_1/beta not found in checkpoint
- TypeError: visualize_boxes_and_labels_on_image_array() missing 1 required positional argument: 'category_index'
- The accuracy of the program running by horovod is low
- resnet50 --use_fp16 error: cuDNN launch failure : input shape ([128,112,112,64])
- Perfzero support for Openshift on RHEL
- How to evaluate worker performance independently on a distributed training
- Alternative/current state of tf_cnn_benchmark HOT 3
- perfzero resnet benchmark is outdated HOT 3
- PerfZero Dataset for RetinaNet is not Avaiable HOT 1
- 4090 multi gpu not support
- What to use as replacement for tf_cnn_benchmark in the official tensorflow models HOT 2
- Docker 22.04 ubuntu and 12.1 cuda? HOT 1
- VariableV1 does not exist in /tensorflow/python/ops/variables.py HOT 2
- ImportError: cannot import name 'variable_v1' from 'tensorflow.python.ops' HOT 2
- how to adjust gpu uiltlization during benchamark_cnn.py
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from benchmarks.