Hi, I am training the my own 5000h corpus using librispeech setup on 1 GPU with no cha

Hi, what config are you using? <a href="https://github.com/rwth-i6/returnn-experim

What TF version do you use? Can you try with TF 2.3? (<a href="https://github.com/rwth

what config are you using? <a href="https://github.com/rwth-i6/returnn-ex

loss nan and cost nan while running my own corpus using librispeech sets about returnn-experiments HOT 10 CLOSED

rwth-i6 commented on May 24, 2024

loss nan and cost nan while running my own corpus using librispeech sets

from returnn-experiments.

Comments (10)

yanghongjiazheng commented on May 24, 2024

It happens on pretrain epoch 38 step 3379

from returnn-experiments.

Spotlight0xff commented on May 24, 2024

Hi,
what config are you using? this one?
Also what TF/CUDA/CUDNN versions do you have?

from returnn-experiments.

albertz commented on May 24, 2024

What TF version do you use? Can you try with TF 2.3? (Maybe related)

Note that the learning rate warmup is only for the first 10 epochs (or 15 epochs after my later change). Warmup is not the same as pretrain. Check learning_rates in your config, which defines the warmup. Do you already have it for 15 epochs? You might try to increase it even more. Or also use a newer config, like the one linked by @Spotlight0xff .

from returnn-experiments.

yanghongjiazheng commented on May 24, 2024

what config are you using? this one?

I use this one

Also what TF/CUDA/CUDNN versions do you have?

TF version is 1.8
CUDA version is 0.9.0

from returnn-experiments.

yanghongjiazheng commented on May 24, 2024

warmup step in my config is still 10.

from returnn-experiments.

yanghongjiazheng commented on May 24, 2024

So the nan problem is due to the heavy training datas?

from returnn-experiments.

yanghongjiazheng commented on May 24, 2024

I encountered this problem after adding 3000h training datas. When I used the same configuration training on 2000h corpus, the nan problem did not happen.

from returnn-experiments.

albertz commented on May 24, 2024

I would recommend to use some of our newer configs, and increase the learning rate warmup.

from returnn-experiments.

Spotlight0xff commented on May 24, 2024

Also you should update your TF and CUDA.

from returnn-experiments.

christophmluscher commented on May 24, 2024

This issue seems outdated. I will close. If necessary feel free to reopen.

from returnn-experiments.

Recommend Projects

loss nan and cost nan while running my own corpus using librispeech sets about returnn-experiments HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent