Giter Site home page Giter Site logo

Comments (15)

albertz avatar albertz commented on May 24, 2024

@pavelgolik I remember you had a similar problem, and solved it somehow? But you did not pushed any fixes to the code?

from returnn-experiments.

pavelgolik avatar pavelgolik commented on May 24, 2024

I'm not sure if this is the same error - the one I reported was related to the XLA devices. My (temporary) workaround was to filter them out by adding if "XLA" in dev.device_type: continue in the beginning of the for loop in function get_tf_list_local_devices here: https://github.com/rwth-i6/returnn/blob/master/TFUtil.py#L3382

Anyway, this error is related to MaxBytesInUse. @Shilpil, are you setting CUDA_VISIBLE_DEVICES explicitly?

from returnn-experiments.

Shilpil avatar Shilpil commented on May 24, 2024

Yes we also tried that by running it as :
CUDA_VISIBLE_DEVICES = 0,1,2,3 ./rnn.py 2019-librispeech-system/attention/base2.conv2l.specaug.curric3.config

from returnn-experiments.

pavelgolik avatar pavelgolik commented on May 24, 2024

Another possible workaround was to set tf_log_memory_usage = False.

from returnn-experiments.

albertz avatar albertz commented on May 24, 2024

The error:
InvalidArgumentError (see above for traceback): Cannot assign a device for operation global_tensor_mem_usage_deviceGPU1/MaxBytesInUse: node global_tensor_mem_usage_deviceGPU1/MaxBytesInUse (defined at /home/ubuntu/returnn/TFUtil.py:7748) was explicitly assigned to /device:GPU:1 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1, /job:localhost/replica:0/task:0/device:XLA_GPU:2, /job:localhost/replica:0/task:0/device:XLA_GPU:3 ]. Make sure the device specification refers to a valid device.

So I wonder, why does TF has GPU:0 & XLA_GPU:0 available, but not GPU:1, but only XLA_GPU:1?

from returnn-experiments.

albertz avatar albertz commented on May 24, 2024

I found TF #30748, where one Google dev states:

XLA creates an XLA_GPU device for every present on the system whereas TF creates a GPU device only for GPUs suitable for compute (i.e. ignores slower GPUs) so that could explain what you're seeing. TF logs out "Ignoring visible gpu device" when it does this enhanced filtering so you should see it in your logs.

Btw, anyway, is this for multi-GPU training? How are you actually doing this? Just making all the GPUs available to Returnn will not be helpful at all (how should it?). It will not use them.

Please check our documentation about multi-GPU training about how to actually do multi-GPU training.

from returnn-experiments.

Shilpil avatar Shilpil commented on May 24, 2024

Hi,
I installed Horovod and tested it by running mpirun -np 8 -mca pml ob1 -mca btl ^openib python3 returnn/demos/demo-horovod-mpi.py

But when I run the librispeech config file using the following command :

mpirun -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x HOROVOD_TIMELINE -x DEBUG -mca pml ob1 -mca btl ^openib python3 rnn.py 2019-librispeech-system/attention/base2.conv2l.specaug.curric3.config ++use_horovod 1

I get this error:

InvalidArgumentError (see above for traceback): Cannot assign a device for operation GetDeviceAttr_1: node GetDeviceAttr_1 (defined at :67) was explicitly assigned to /job:localhost/replica:0/task:0/device:GPU:1 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. Make sure the device specification refers to a valid device.
[[node GetDeviceAttr_1 (defined at :67) ]]

When I make -np as 1 it runs fine but the other 3 GPUs are not utilized.

from returnn-experiments.

albertz avatar albertz commented on May 24, 2024

Can you post the full stdout somewhere?

In the function get_device_attr, there is dev = dev.replace(":XLA_", ":"). When you remove that, does it work then?

The idea in Horovod is that for N GPUs (either on a single host, or even multiple hosts), it will start N instances of the program (or Returnn), and each single instance only has access to one single GPU, and all the instances communicate with each other through special Horovod TF ops. So in your output, it seems like there is one instance of Returnn which actually sees two GPUs (GPU:0 and XLA_GPU:1).

@pavelgolik Maybe you have some idea how this happens?

from returnn-experiments.

kazuki-irie avatar kazuki-irie commented on May 24, 2024

CC'ing @Gerstenberger, in case you might also have some idea.

from returnn-experiments.

Shilpil avatar Shilpil commented on May 24, 2024

Hi,
I am attaching the full output.
output.txt

I have a small query,
If we upgrade the instance on AWS from 1 GPU to 4 GPUs do we need to reinstall Horovod?
Can this be the reason the error is thrown?

Thank you all for helping me with this issue.

from returnn-experiments.

pavelgolik avatar pavelgolik commented on May 24, 2024

Please try the work-around mentioned in #28 (comment)

If we upgrade the instance on AWS from 1 GPU to 4 GPUs do we need to reinstall Horovod?

No, I don't think so.

from returnn-experiments.

albertz avatar albertz commented on May 24, 2024

I am attaching the full output.
output.txt

Thanks.

Some notes:

  • ++use_horovod 4: You don't need 4 here, this is just a True/False flag.
  • horovod_reduce_type = "param" is probably more efficient (see the multi GPU documentation).
  • You use TF 1.13.2. Did you try another version? E.g. TF 1.14? Or maybe also older.
  • In the function get_device_attr, there is dev = dev.replace(":XLA_", ":"). When you remove that, does it work then? If not, can you post the output log for that as well?

from returnn-experiments.

Shilpil avatar Shilpil commented on May 24, 2024

Using comment #28 (comment) worked. Thank you. It started running after I put if "XLA" in dev.device_type: continue in the beginning of the for loop in function get_tf_list_local_devices.
But I could not see the acceleration in the timing. Is there anything I am missing?

Here is the command I used for 4 GPUs :
mpirun -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x HOROVOD_TIMELINE -x DEBUG -mca pml ob1 -mca btl ^openib python3 rnn.py 2019-librispeech-system/attention/base2.conv2l.specaug.curric3.config ++use_horovod 1

Here is the log for 4 GPUs:
pretrain epoch 43, step 20, cost:ctc 1.3570787483922613, cost:output/output_prob 1.0764964977539009, error:ctc 0.2901119335438125, error:decision 0.0, error:output/output_prob 0.21921641280641777, loss 652.19824, max_size:classes 48, max_size:data 1584 , num_seqs 6, 1.883 sec/step, elapsed 0:01:38, exp. remaining 0:41:37, complete 3.80%

This is the log for 1 GPU:
pretrain epoch 43, step 73, cost:ctc 1.245547996649293, cost:output/output_prob 1.0171804541269012, error:ctc 0.2908366620540619, error:decision 0.0, error:output/output_prob 0.21115538477897644, loss 567.9448, max_size:classes 45, max_size:data 1626, num_seqs 6, 0.998 sec/step, elapsed 0:01:37, exp. remaining 0:41:00, complete 3.80%

In both of them it says 3.8 % complete and remaining time is around 41 minutes.
Please help me if I am going wrong in any step.

I checked the usage of the GPUs using nvidia-smi command and all the 4 GPUs were being used to the max extent. No other programs were running in parallel.

from returnn-experiments.

albertz avatar albertz commented on May 24, 2024

Please check the documentation about multi-GPU training. I guess (but I cannot tell from your output) that this is a bottleneck in the dataset. Did you follow the documentation? Relevant for this is to check this part in the log:

train epoch 1, finished after 2941 steps, 0:28:58 elapsed (99.3% computing time)

Look at the computing time in particular. That numbers measures how much relative time was spend inside TF session.run. If this is below 90% or so, it means that you wasted some time elsewhere, e.g. the dataset loading.

If that is the case, you can try to use HDFDataset instead. There is the tool hdf_dump.py to convert your dataset into a HDF dataset.

Otherwise, what horovod_reduce_type do you use now? And what horovod_param_sync_step? These are all very relevant (please check the documentation).

In the function get_device_attr, there is dev = dev.replace(":XLA_", ":"). When you remove that, does it work then? (Instead of the workaround if "XLA" in dev.device_type: continue.) If not, can you post the output log for that as well?

from returnn-experiments.

albertz avatar albertz commented on May 24, 2024

Closing this now.
Please open a new issue in the Returnn repo (not in this experiments repo) if there are still problems, although I think this is fixed, or not?

from returnn-experiments.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.