Comments (15)
@pavelgolik I remember you had a similar problem, and solved it somehow? But you did not pushed any fixes to the code?
from returnn-experiments.
I'm not sure if this is the same error - the one I reported was related to the XLA devices. My (temporary) workaround was to filter them out by adding if "XLA" in dev.device_type: continue
in the beginning of the for loop in function get_tf_list_local_devices
here: https://github.com/rwth-i6/returnn/blob/master/TFUtil.py#L3382
Anyway, this error is related to MaxBytesInUse
. @Shilpil, are you setting CUDA_VISIBLE_DEVICES
explicitly?
from returnn-experiments.
Yes we also tried that by running it as :
CUDA_VISIBLE_DEVICES = 0,1,2,3 ./rnn.py 2019-librispeech-system/attention/base2.conv2l.specaug.curric3.config
from returnn-experiments.
Another possible workaround was to set tf_log_memory_usage = False
.
from returnn-experiments.
The error:
InvalidArgumentError (see above for traceback): Cannot assign a device for operation global_tensor_mem_usage_deviceGPU1/MaxBytesInUse: node global_tensor_mem_usage_deviceGPU1/MaxBytesInUse (defined at /home/ubuntu/returnn/TFUtil.py:7748) was explicitly assigned to /device:GPU:1 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1, /job:localhost/replica:0/task:0/device:XLA_GPU:2, /job:localhost/replica:0/task:0/device:XLA_GPU:3 ]. Make sure the device specification refers to a valid device.
So I wonder, why does TF has GPU:0 & XLA_GPU:0 available, but not GPU:1, but only XLA_GPU:1?
from returnn-experiments.
I found TF #30748, where one Google dev states:
XLA creates an XLA_GPU device for every present on the system whereas TF creates a GPU device only for GPUs suitable for compute (i.e. ignores slower GPUs) so that could explain what you're seeing. TF logs out "Ignoring visible gpu device" when it does this enhanced filtering so you should see it in your logs.
Btw, anyway, is this for multi-GPU training? How are you actually doing this? Just making all the GPUs available to Returnn will not be helpful at all (how should it?). It will not use them.
Please check our documentation about multi-GPU training about how to actually do multi-GPU training.
from returnn-experiments.
Hi,
I installed Horovod and tested it by running mpirun -np 8 -mca pml ob1 -mca btl ^openib python3 returnn/demos/demo-horovod-mpi.py
But when I run the librispeech config file using the following command :
mpirun -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x HOROVOD_TIMELINE -x DEBUG -mca pml ob1 -mca btl ^openib python3 rnn.py 2019-librispeech-system/attention/base2.conv2l.specaug.curric3.config ++use_horovod 1
I get this error:
InvalidArgumentError (see above for traceback): Cannot assign a device for operation GetDeviceAttr_1: node GetDeviceAttr_1 (defined at :67) was explicitly assigned to /job:localhost/replica:0/task:0/device:GPU:1 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. Make sure the device specification refers to a valid device.
[[node GetDeviceAttr_1 (defined at :67) ]]
When I make -np as 1 it runs fine but the other 3 GPUs are not utilized.
from returnn-experiments.
Can you post the full stdout somewhere?
In the function get_device_attr
, there is dev = dev.replace(":XLA_", ":")
. When you remove that, does it work then?
The idea in Horovod is that for N GPUs (either on a single host, or even multiple hosts), it will start N instances of the program (or Returnn), and each single instance only has access to one single GPU, and all the instances communicate with each other through special Horovod TF ops. So in your output, it seems like there is one instance of Returnn which actually sees two GPUs (GPU:0 and XLA_GPU:1).
@pavelgolik Maybe you have some idea how this happens?
from returnn-experiments.
CC'ing @Gerstenberger, in case you might also have some idea.
from returnn-experiments.
Hi,
I am attaching the full output.
output.txt
I have a small query,
If we upgrade the instance on AWS from 1 GPU to 4 GPUs do we need to reinstall Horovod?
Can this be the reason the error is thrown?
Thank you all for helping me with this issue.
from returnn-experiments.
Please try the work-around mentioned in #28 (comment)
If we upgrade the instance on AWS from 1 GPU to 4 GPUs do we need to reinstall Horovod?
No, I don't think so.
from returnn-experiments.
I am attaching the full output.
output.txt
Thanks.
Some notes:
++use_horovod 4
: You don't need 4 here, this is just a True/False flag.horovod_reduce_type = "param"
is probably more efficient (see the multi GPU documentation).- You use TF 1.13.2. Did you try another version? E.g. TF 1.14? Or maybe also older.
- In the function
get_device_attr
, there isdev = dev.replace(":XLA_", ":")
. When you remove that, does it work then? If not, can you post the output log for that as well?
from returnn-experiments.
Using comment #28 (comment) worked. Thank you. It started running after I put if "XLA" in dev.device_type: continue
in the beginning of the for loop in function get_tf_list_local_devices
.
But I could not see the acceleration in the timing. Is there anything I am missing?
Here is the command I used for 4 GPUs :
mpirun -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x HOROVOD_TIMELINE -x DEBUG -mca pml ob1 -mca btl ^openib python3 rnn.py 2019-librispeech-system/attention/base2.conv2l.specaug.curric3.config ++use_horovod 1
Here is the log for 4 GPUs:
pretrain epoch 43, step 20, cost:ctc 1.3570787483922613, cost:output/output_prob 1.0764964977539009, error:ctc 0.2901119335438125, error:decision 0.0, error:output/output_prob 0.21921641280641777, loss 652.19824, max_size:classes 48, max_size:data 1584 , num_seqs 6, 1.883 sec/step, elapsed 0:01:38, exp. remaining 0:41:37, complete 3.80%
This is the log for 1 GPU:
pretrain epoch 43, step 73, cost:ctc 1.245547996649293, cost:output/output_prob 1.0171804541269012, error:ctc 0.2908366620540619, error:decision 0.0, error:output/output_prob 0.21115538477897644, loss 567.9448, max_size:classes 45, max_size:data 1626, num_seqs 6, 0.998 sec/step, elapsed 0:01:37, exp. remaining 0:41:00, complete 3.80%
In both of them it says 3.8 % complete and remaining time is around 41 minutes.
Please help me if I am going wrong in any step.
I checked the usage of the GPUs using nvidia-smi command and all the 4 GPUs were being used to the max extent. No other programs were running in parallel.
from returnn-experiments.
Please check the documentation about multi-GPU training. I guess (but I cannot tell from your output) that this is a bottleneck in the dataset. Did you follow the documentation? Relevant for this is to check this part in the log:
train epoch 1, finished after 2941 steps, 0:28:58 elapsed (99.3% computing time)
Look at the computing time in particular. That numbers measures how much relative time was spend inside TF session.run. If this is below 90% or so, it means that you wasted some time elsewhere, e.g. the dataset loading.
If that is the case, you can try to use HDFDataset
instead. There is the tool hdf_dump.py
to convert your dataset into a HDF dataset.
Otherwise, what horovod_reduce_type
do you use now? And what horovod_param_sync_step
? These are all very relevant (please check the documentation).
In the function get_device_attr
, there is dev = dev.replace(":XLA_", ":")
. When you remove that, does it work then? (Instead of the workaround if "XLA" in dev.device_type: continue
.) If not, can you post the output log for that as well?
from returnn-experiments.
Closing this now.
Please open a new issue in the Returnn repo (not in this experiments repo) if there are still problems, although I think this is fixed, or not?
from returnn-experiments.
Related Issues (20)
- local attention with unidirectional lstm not converging HOT 5
- Implement a unidirectional variant of local attention HOT 10
- Loading a saved Returnn model from its .meta file HOT 16
- query regarding LM data preprocessing HOT 2
- Reusing parameters inside rec layer HOT 5
- Training Configuration for TEDLIUMv2 HOT 3
- specAugment policy and schedules HOT 3
- Question about 2020-rnn-transducer HOT 16
- 2018-asr-attention/librispeech/attention/exp3.ctc.lm.config: target 'bpe' unknown HOT 3
- Question about 2018-asr-librispeech dev = get_dataset("dev", subset=3000) HOT 2
- loss nan and cost nan while running my own corpus using librispeech sets HOT 10
- Hierarchical layer name not captured correctly
- Problem with retrieving source layer from a hierarchical definition
- Multi Stage Training
- Questions on librispeech transformer lm HOT 10
- Transducer error in GetFilteredScoreOp HOT 4
- Big files in repo HOT 5
- Git commit/push rule to not allow big files HOT 3
- Could you please provide a script that could run lsh-attention for translation? HOT 4
- Assert Error when running 2022-lsh-attention HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from returnn-experiments.