Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

It seems a very related question was also asked in <a class="issue-link js-issue-link"

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

All the things which I wrote already (<a href="https://github.com/rwth-i6/returnn-expe

Lowering the learning rate lr=0.0005 and <code class

local attention with unidirectional lstm not converging about returnn-experiments HOT 5 CLOSED

rwth-i6 commented on May 24, 2024

local attention with unidirectional lstm not converging

from returnn-experiments.

Comments (5)

albertz commented on May 24, 2024

It seems a very related question was also asked in #42.
I have not tried unidir LSTMs in the encoder yet, so I don't know. You probably should play around with all the available hyper params, e.g.:

Learning rate (initial at warmup, and highest learning rate after warmup)
Learning rate warmup length (num epochs)
Pretraining. E.g. start number of layers (try 2). Initial time reduction (try increasing it, e.g. 6, 8, 16, or even 32). Try make it longer (more repetitions). Etc.
No or less SpecAugment in the beginning.
Higher batch size in the beginning. Or gradient accumulation in the beginning.
Curriculum learning, i.e. the epoch_wise_filter option.
...

Let this smallest network with highest time reduction, high batch size, less/no SpecAugment etc train like that for as long as needed, before increasing anything. This small network should first get some half-way good score. Only when you see that, at that point the pretraining can increase the depth and other things, but only carefully and slowly (such that the network does not totally break again).

from returnn-experiments.

manish-kumar-garg commented on May 24, 2024

Thanks @albertz for suggesting these.
I trained following models upto pretraining epochs (45) with following observation of loss:

Base model - asr_2018_attention - below hyperparameters:

pretrain = {"repetitions": 5, "construction_algo": custom_construction_algo}
learning_rate = 0.0008
learning_rates = list(numpy.linspace(0.0003, learning_rate, num=15))  # warmup

Uni lstm size 1024 with all hyperparameters same as 1:
Uni lstm size 1024 with all hyperparams same as 1 except:
pretrain = {"repetitions": 7, "construction_algo": custom_construction_algo}
Uni lstm size 1024 with all hyperparams same as 1 except:
learning_rate = 0.0005
Uni lstm size 1024 with all hyperparams same as 1 except warmup steps 10:
learning_rates = list(numpy.linspace(0.0003, learning_rate, num=10)) # warmup
Uni lstm size 1024 with all hyperparams same as 1 except warmup steps 20:
learning_rates = list(numpy.linspace(0.0003, learning_rate, num=20)) # warmup
In this case loss becomes nan after 10 epochs
Uni lstm size 1536 with all hyperparameters same as 1:

All the models are with global attention

from returnn-experiments.

manish-kumar-garg commented on May 24, 2024

Seems like decreasing the learning rate helps.
Also, increasing the lstm cell size to 1536 helps a bit, however, not much.

What other combination do you suggest to try next?

from returnn-experiments.

albertz commented on May 24, 2024

All the things which I wrote already (here), but basically everything else as well.

from returnn-experiments.

manish-kumar-garg commented on May 24, 2024

Lowering the learning rate lr=0.0005 and lr_init=0.0002 worked for me.
Thanks!

from returnn-experiments.

local attention with unidirectional lstm not converging about returnn-experiments HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent