We are trying to run Librispeech corpus with the latest base2.conv2l.specaug.curric3 c

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I found <a href="https://github.com/tensorflow/tensorflow/issues/30748" data-hovercard

Can you post the full stdout somewhere? In the function <code class=

CC'ing <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Hi, I am attaching the full output. <a href="https://github.com/rwth-i6/return

Cannot assign a device for operation global_tensor_mem_usage_deviceGPU1/MaxBytesInUse about returnn-experiments HOT 15 CLOSED

rwth-i6 commented on May 24, 2024

Cannot assign a device for operation global_tensor_mem_usage_deviceGPU1/MaxBytesInUse

from returnn-experiments.

Comments (15)

albertz commented on May 24, 2024

@pavelgolik I remember you had a similar problem, and solved it somehow? But you did not pushed any fixes to the code?

from returnn-experiments.

pavelgolik commented on May 24, 2024

I'm not sure if this is the same error - the one I reported was related to the XLA devices. My (temporary) workaround was to filter them out by adding if "XLA" in dev.device_type: continue in the beginning of the for loop in function get_tf_list_local_devices here: https://github.com/rwth-i6/returnn/blob/master/TFUtil.py#L3382

Anyway, this error is related to MaxBytesInUse. @Shilpil, are you setting CUDA_VISIBLE_DEVICES explicitly?

from returnn-experiments.

Shilpil commented on May 24, 2024

Yes we also tried that by running it as :
CUDA_VISIBLE_DEVICES = 0,1,2,3 ./rnn.py 2019-librispeech-system/attention/base2.conv2l.specaug.curric3.config

from returnn-experiments.

pavelgolik commented on May 24, 2024

Another possible workaround was to set tf_log_memory_usage = False.

from returnn-experiments.

albertz commented on May 24, 2024

The error:
InvalidArgumentError (see above for traceback): Cannot assign a device for operation global_tensor_mem_usage_deviceGPU1/MaxBytesInUse: node global_tensor_mem_usage_deviceGPU1/MaxBytesInUse (defined at /home/ubuntu/returnn/TFUtil.py:7748) was explicitly assigned to /device:GPU:1 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1, /job:localhost/replica:0/task:0/device:XLA_GPU:2, /job:localhost/replica:0/task:0/device:XLA_GPU:3 ]. Make sure the device specification refers to a valid device.

So I wonder, why does TF has GPU:0 & XLA_GPU:0 available, but not GPU:1, but only XLA_GPU:1?

from returnn-experiments.

albertz commented on May 24, 2024

I found TF #30748, where one Google dev states:

XLA creates an XLA_GPU device for every present on the system whereas TF creates a GPU device only for GPUs suitable for compute (i.e. ignores slower GPUs) so that could explain what you're seeing. TF logs out "Ignoring visible gpu device" when it does this enhanced filtering so you should see it in your logs.

Btw, anyway, is this for multi-GPU training? How are you actually doing this? Just making all the GPUs available to Returnn will not be helpful at all (how should it?). It will not use them.

Please check our documentation about multi-GPU training about how to actually do multi-GPU training.

from returnn-experiments.

Shilpil commented on May 24, 2024

Hi,
I installed Horovod and tested it by running mpirun -np 8 -mca pml ob1 -mca btl ^openib python3 returnn/demos/demo-horovod-mpi.py

But when I run the librispeech config file using the following command :

mpirun -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x HOROVOD_TIMELINE -x DEBUG -mca pml ob1 -mca btl ^openib python3 rnn.py 2019-librispeech-system/attention/base2.conv2l.specaug.curric3.config ++use_horovod 1

I get this error:

InvalidArgumentError (see above for traceback): Cannot assign a device for operation GetDeviceAttr_1: node GetDeviceAttr_1 (defined at :67) was explicitly assigned to /job:localhost/replica:0/task:0/device:GPU:1 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. Make sure the device specification refers to a valid device.
[[node GetDeviceAttr_1 (defined at :67) ]]

When I make -np as 1 it runs fine but the other 3 GPUs are not utilized.

from returnn-experiments.

albertz commented on May 24, 2024

Can you post the full stdout somewhere?

In the function get_device_attr, there is dev = dev.replace(":XLA_", ":"). When you remove that, does it work then?

The idea in Horovod is that for N GPUs (either on a single host, or even multiple hosts), it will start N instances of the program (or Returnn), and each single instance only has access to one single GPU, and all the instances communicate with each other through special Horovod TF ops. So in your output, it seems like there is one instance of Returnn which actually sees two GPUs (GPU:0 and XLA_GPU:1).

@pavelgolik Maybe you have some idea how this happens?

from returnn-experiments.

kazuki-irie commented on May 24, 2024

CC'ing @Gerstenberger, in case you might also have some idea.

from returnn-experiments.

Shilpil commented on May 24, 2024

Hi,
I am attaching the full output.
output.txt

I have a small query,
If we upgrade the instance on AWS from 1 GPU to 4 GPUs do we need to reinstall Horovod?
Can this be the reason the error is thrown?

Thank you all for helping me with this issue.

from returnn-experiments.

pavelgolik commented on May 24, 2024

Please try the work-around mentioned in #28 (comment)

If we upgrade the instance on AWS from 1 GPU to 4 GPUs do we need to reinstall Horovod?

No, I don't think so.

from returnn-experiments.

albertz commented on May 24, 2024

I am attaching the full output.
output.txt

Thanks.

Some notes:

++use_horovod 4: You don't need 4 here, this is just a True/False flag.
horovod_reduce_type = "param" is probably more efficient (see the multi GPU documentation).
You use TF 1.13.2. Did you try another version? E.g. TF 1.14? Or maybe also older.
In the function get_device_attr, there is dev = dev.replace(":XLA_", ":"). When you remove that, does it work then? If not, can you post the output log for that as well?

from returnn-experiments.

Shilpil commented on May 24, 2024

Using comment #28 (comment) worked. Thank you. It started running after I put if "XLA" in dev.device_type: continue in the beginning of the for loop in function get_tf_list_local_devices.
But I could not see the acceleration in the timing. Is there anything I am missing?

Here is the command I used for 4 GPUs :
mpirun -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x HOROVOD_TIMELINE -x DEBUG -mca pml ob1 -mca btl ^openib python3 rnn.py 2019-librispeech-system/attention/base2.conv2l.specaug.curric3.config ++use_horovod 1

Here is the log for 4 GPUs:
pretrain epoch 43, step 20, cost:ctc 1.3570787483922613, cost:output/output_prob 1.0764964977539009, error:ctc 0.2901119335438125, error:decision 0.0, error:output/output_prob 0.21921641280641777, loss 652.19824, max_size:classes 48, max_size:data 1584 , num_seqs 6, 1.883 sec/step, elapsed 0:01:38, exp. remaining 0:41:37, complete 3.80%

This is the log for 1 GPU:
pretrain epoch 43, step 73, cost:ctc 1.245547996649293, cost:output/output_prob 1.0171804541269012, error:ctc 0.2908366620540619, error:decision 0.0, error:output/output_prob 0.21115538477897644, loss 567.9448, max_size:classes 45, max_size:data 1626, num_seqs 6, 0.998 sec/step, elapsed 0:01:37, exp. remaining 0:41:00, complete 3.80%

In both of them it says 3.8 % complete and remaining time is around 41 minutes.
Please help me if I am going wrong in any step.

I checked the usage of the GPUs using nvidia-smi command and all the 4 GPUs were being used to the max extent. No other programs were running in parallel.

from returnn-experiments.

albertz commented on May 24, 2024

Please check the documentation about multi-GPU training. I guess (but I cannot tell from your output) that this is a bottleneck in the dataset. Did you follow the documentation? Relevant for this is to check this part in the log:

train epoch 1, finished after 2941 steps, 0:28:58 elapsed (99.3% computing time)

Look at the computing time in particular. That numbers measures how much relative time was spend inside TF session.run. If this is below 90% or so, it means that you wasted some time elsewhere, e.g. the dataset loading.

If that is the case, you can try to use HDFDataset instead. There is the tool hdf_dump.py to convert your dataset into a HDF dataset.

Otherwise, what horovod_reduce_type do you use now? And what horovod_param_sync_step? These are all very relevant (please check the documentation).

In the function get_device_attr, there is dev = dev.replace(":XLA_", ":"). When you remove that, does it work then? (Instead of the workaround if "XLA" in dev.device_type: continue.) If not, can you post the output log for that as well?

from returnn-experiments.

albertz commented on May 24, 2024

Closing this now.
Please open a new issue in the Returnn repo (not in this experiments repo) if there are still problems, although I think this is fixed, or not?

from returnn-experiments.

Cannot assign a device for operation global_tensor_mem_usage_deviceGPU1/MaxBytesInUse about returnn-experiments HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent