Giter Site home page Giter Site logo

Comments (20)

albertz avatar albertz commented on May 24, 2024

Yes, this config trains from scratch.
There is also a small conv NN before the first encoder LSTM layer. Despite that, and SpecAugment, I think they are quite similar.
Maybe make a diff to see the exact differences.
The training time in this config is still the same number of epochs (12.5 full epochs), and the training speed should be similar (SpecAugment is very fast; the conv NN adds some small overhead here).
Note though that with SpecAugment, you can in principle train much longer, until convergence, i.e. make the learning rate scheduling more conservative. By training e.g. twice as long (25 full epochs), or even longer, you still can gain lots of improvement. This is also what the original SpecAugment paper reports. They train for 600 full epochs! (One month of training time on a TPU cluster...) But even when training as long as before (12.5 epochs), you still should see lots of improvement with SpecAugment.

Note that this config is not described in the corresponding paper, as I only later added it, for reference. But it is described (briefly) in our 2019 ASRU paper. (More configs here.)

from returnn-experiments.

akshatdewan avatar akshatdewan commented on May 24, 2024

Thanks! This sounds great, I will see how my training goes. Will post any updates here.

from returnn-experiments.

akshatdewan avatar akshatdewan commented on May 24, 2024

train-scores.data.txt
I just started a specAug training with bas2.conv2l.specaug.curric3, and I am seeing some thing curious here - the dev_error_ctc does not seem to be going down as fast as in my other trainings (without specAug). It is still at 0.92 after 26 "epochs". Am i reading it wrong or is it kind of expected? Thanks!

from returnn-experiments.

akshatdewan avatar akshatdewan commented on May 24, 2024

Please disregard my previous comment, I think my data might be a mess.

from returnn-experiments.

akshatdewan avatar akshatdewan commented on May 24, 2024

Hello!

In order to reproduce your results, I tried training only with librispeech data.
I used:
returnn commit rwth-i6/returnn@bea4cb5 bea4cb578a8c93c7d59a4d7e4898dc3eeaa042d0
returnn-experiments commit 98cea81
base2.conv2l.specaug.curric3.config.

The train-scores.data does not show a lot of improvement and the training breaks at 53 "epochs" with the below error:

Model seems broken, got inf or nan score.
Accumulated scores: NumbersDict({'error:decision': 0.0, 'cost:ctc': inf, 'cost:output/output_prob': 304599.4338989258, 'error:ctc': 72092.0, 'error:output/output_prob': 57407.0, 'loss': inf})
Exception Exception('Inf/nan score in step 282.',) in step 282.

Are you familiar with this issue?
Thanks!

from returnn-experiments.

albertz avatar albertz commented on May 24, 2024

Can you check #34? There was a similar discussion.

from returnn-experiments.

akshatdewan avatar akshatdewan commented on May 24, 2024

Thanks Albert!
I am using Tesla V100 SXM2 32 GB for my experiments. TF 1.12.
I ran experiments with higher number of steps (15 instead of 10) in learning rate warming up. Also reduced the min learning rate to 0.0001. I will let it run for a few epochs and see.

from returnn-experiments.

akshatdewan avatar akshatdewan commented on May 24, 2024

Hello, after changing the number of steps in warmup, I do not receive nan's anymore but the training still seems to be off. Does this train-scores.txt make sense to you? Is it consistent with your experiments? The error rate is not really coming down very fast.

from returnn-experiments.

albertz avatar albertz commented on May 24, 2024

It looks like it never really converges properly (from those scores). E.g. your last epoch:

88: EpochData(learningRate=0.00020334926626632013, error={
'dev_error_ctc': 0.9141972329849649,
'dev_error_decision': 0.0,
'dev_error_output/output_prob': 0.7254169391549435,
'dev_score_ctc': 6.506660251327039,
'dev_score_output/output_prob': 3.7826016050155156,
'train_error_ctc': 0.9481592299677715,
'train_error_decision': 0.0,
'train_error_output/output_prob': 0.7328658090259365,
'train_score_ctc': 6.724799936911294,
'train_score_output/output_prob': 3.7869126781015656,
}),

The CTC score/error should be much lower. The CTC error definitely below 50%. This should happen also fairly early in the training, maybe after 10 epochs (depending on the epoch split, but e.g. such that one epoch corresponds to ~50h train data).

Is this now for the original LibriSpeech data, or your own data? I would double check that your data is correct. Use e.g. dump-dataset. Also compare that to a run with the original LibriSpeech data.

Play around with pretraining more (see custom_construction_algo). Let it start with 2 BLSTM encoder layers initially (StartNumLayers = 2), and increase the number of repetitions for this first pretrain step (there is sth like idx = max(idx - 3, 0) # repeat first, increase that, i.e. maybe idx = max(idx - 6, 0)). You also have pretrain = {"repetitions": 5, ...}, i.e. that means that the first 6 * 5 = 30 epochs, it will use that same network (2 encoder layers, 512 dims). This small encoder should converge fast, so 30 epochs should be more than enough. If the scores don't go down during these first 30 epochs, sth is wrong.

from returnn-experiments.

akshatdewan avatar akshatdewan commented on May 24, 2024

The above scores are on the original LibriSpeech dataset. I did run the dump-dataset in the past and seemed to have run fine. I will re-confirm that. Thanks, and I will play around with the custom_construction_algo.

from returnn-experiments.

albertz avatar albertz commented on May 24, 2024

You can also try to set the initial learning rate (of the learning rate warmup) lower, e.g. 0.0001.

from returnn-experiments.

akshatdewan avatar akshatdewan commented on May 24, 2024

I already tried that (lower initial learning rate) in another experiment but that did not help either.

from returnn-experiments.

akshatdewan avatar akshatdewan commented on May 24, 2024

I ran dump-dataset and everything seems fine.

To re-validate my copy of the libri data, I ran another experiment with an older configl_file
and the training error is coming down as expected. Here is the train-scores.data

Your suggestion of "increase the number of repetitions for this first pretrain step" does not seem to be converging either.
base2.conv2l.specaug.curric3.txt
train-scores.txt

Is there some glaring mistake in my config file? Or do I just need to play more with HPs?

from returnn-experiments.

albertz avatar albertz commented on May 24, 2024

What is the difference between your older config, and your new config? Can you post a diff? (Skip the SpecAugment part, if that is also included there.)

from returnn-experiments.

akshatdewan avatar akshatdewan commented on May 24, 2024

I am very embarrassed, I was mistakenly using the mean of audio features file for std_dev as well. I am redoing the experiment, I hope that it solves the issue.

from returnn-experiments.

akshatdewan avatar akshatdewan commented on May 24, 2024

Unfortunately, I see no improvement after fixing my mean std_dev bug either. Please find below a diff,
red is old and green is new config (after removing the functions for specAug from new)

21c21
< if int(os.environ.get("DEBUG", "0")):
---
> if int(os.environ.get("RETURNN_DEBUG", "0")):
41a42
>         "use_cache_manager": not debug_mode,
56a58
>                     'use_new_filter': True, 
175c177,178
< "source": {"class": "eval", "eval": "tf.clip_by_value(source(0), -3.0, 3.0)"},
---
> "source": {"class": "eval", "eval": "self.network.get_config().typed_value('transform')(source(0), network=self.network)"},
> "source0": {"class": "split_dims", "axis": "F", "dims": (-1, 1), "from": "source"},  # (T,40,1)
177,179c180,189
< "lstm0_fw" : { "class": "rec", "unit": "nativelstm2", "n_out" : LstmDim, "direction": 1, "from": ["source"] },
< "lstm0_bw" : { "class": "rec", "unit": "nativelstm2", "n_out" : LstmDim, "direction": -1, "from": ["source"] },
< "lstm0_pool": {"class": "pool", "mode": "max", "padding": "same", "pool_size": (2,), "from": ["lstm0_fw", "lstm0_bw"], "trainable": False},
---
> # Lingvo: ep.conv_filter_shapes = [(3, 3, 1, 32), (3, 3, 32, 32)],  ep.conv_filter_strides = [(2, 2), (2, 2)]
> "conv0": {"class": "conv", "from": "source0", "padding": "same", "filter_size": (3, 3), "n_out": 32, "activation": None, "with_bias": True},  # (T,40,32)
> "conv0p": {"class": "pool", "mode": "max", "padding": "same", "pool_size": (1, 2), "from": "conv0"},  # (T,20,32)
> "conv1": {"class": "conv", "from": "conv0p", "padding": "same", "filter_size": (3, 3), "n_out": 32, "activation": None, "with_bias": True},  # (T,20,32)
> "conv1p": {"class": "pool", "mode": "max", "padding": "same", "pool_size": (1, 2), "from": "conv1"},  # (T,10,32)
> "conv_merged": {"class": "merge_dims", "from": "conv1p", "axes": "static"},  # (T,320)
> 
> "lstm0_fw" : { "class": "rec", "unit": "nativelstm2", "n_out" : LstmDim, "direction": 1, "from": ["conv_merged"] },
> "lstm0_bw" : { "class": "rec", "unit": "nativelstm2", "n_out" : LstmDim, "direction": -1, "from": ["conv_merged"] },
> "lstm0_pool": {"class": "pool", "mode": "max", "padding": "same", "pool_size": (3,), "from": ["lstm0_fw", "lstm0_bw"], "trainable": False},
187c197
< "lstm2_pool": {"class": "pool", "mode": "max", "padding": "same", "pool_size": (2,), "from": ["lstm2_fw", "lstm2_bw"], "trainable": False},
---
> "lstm2_pool": {"class": "pool", "mode": "max", "padding": "same", "pool_size": (1,), "from": ["lstm2_fw", "lstm2_bw"], "trainable": False},
219c229
<     "s": {"class": "rnn_cell", "unit": "LSTMBlock", "from": ["prev:target_embed", "prev:att"], "n_out": 1000},  # transform
---
>     "s": {"class": "rec", "unit": "nativelstm2", "from": ["prev:target_embed", "prev:att"], "n_out": 1000},  # transform
255,257c265,266
<     # We will first construct layer-by-layer, starting with 2 layers.
<     # Initially, we will use a higher reduction factor, and at the end, we will reduce it.
<     # Also, we will initially have not label smoothing.
---
>     StartNumLayers = 2
>     InitialDimFactor = 0.5
265,276c274,278
<     num_lstm_layers = idx + 2  # idx starts at 0. start with 2 layers
<     if idx == 0:
<         net_dict["lstm%i_fw" % (orig_num_lstm_layers - 1)]["dropout"] = 0
<         net_dict["lstm%i_bw" % (orig_num_lstm_layers - 1)]["dropout"] = 0
<     if idx >= 1:
<         num_lstm_layers -= 1  # repeat like idx=0, but now with dropout
<     # We will start with a higher reduction factor initially, for better convergence.
<     red_factor = 2 ** 5
<     if num_lstm_layers == orig_num_lstm_layers + 1:
<         # Use original reduction factor now.
<         num_lstm_layers = orig_num_lstm_layers
<         red_factor = orig_red_factor
---
>     net_dict["#config"] = {}
>     if idx < 4:
>         net_dict["#config"]["batch_size"] = 15000
>     idx = max(idx - 6, 0)  # repeat first
>     num_lstm_layers = idx + StartNumLayers  # idx starts at 0. start with N layers
280,296c282,285
<     # Use label smoothing only at the very end.
<     net_dict["output"]["unit"]["output_prob"]["loss_opts"]["label_smoothing"] = 0
<     # Other options during pretraining.
<     if idx == 0:
<       net_dict["#config"] = {"max_seq_length": {"classes": 60}}
<       net_dict["#repetition"] = 10
<     # Leave the last lstm layer as-is, but only modify its source.
<     net_dict["lstm%i_fw" % (orig_num_lstm_layers - 1)]["from"] = ["lstm%i_pool" % (num_lstm_layers - 2)]
<     net_dict["lstm%i_bw" % (orig_num_lstm_layers - 1)]["from"] = ["lstm%i_pool" % (num_lstm_layers - 2)]
<     if red_factor > orig_red_factor:
<         for i in range(num_lstm_layers - 2):
<             net_dict["lstm%i_pool" % i]["pool_size"] = (2,)
<         # Increase last pool-size to get the initial reduction factor.
<         assert red_factor % (2 ** (num_lstm_layers - 2)) == 0
<         last_pool_size = red_factor // (2 ** (num_lstm_layers - 2))
<         # Increase last pool-size to get the same encoder-seq-length folding.
<         net_dict["lstm%i_pool" % (num_lstm_layers - 2)]["pool_size"] = (last_pool_size,)
---
>     if num_lstm_layers == 2:
>         net_dict["lstm0_pool"]["pool_size"] = (orig_red_factor,)        
>     # Skip to num layers.
>     net_dict["encoder"]["from"] = ["lstm%i_fw" % (num_lstm_layers - 1), "lstm%i_bw" % (num_lstm_layers - 1)]
298c287
<     for i in range(num_lstm_layers - 1, orig_num_lstm_layers - 1):
---
>     for i in range(num_lstm_layers, orig_num_lstm_layers):
301c290,302
<         del net_dict["lstm%i_pool" % i]
---
>         del net_dict["lstm%i_pool" % (i - 1)]
>     # Thus we have layers 0 .. (num_lstm_layers - 1).
>     layer_idxs = list(range(0, num_lstm_layers))
>     layers = ["lstm%i_fw" % i for i in layer_idxs] + ["lstm%i_bw" % i for i in layer_idxs]
>     grow_frac = 1.0 - float(orig_num_lstm_layers - num_lstm_layers) / (orig_num_lstm_layers - StartNumLayers)
>     dim_frac = InitialDimFactor + (1.0 - InitialDimFactor) * grow_frac
>     for layer in layers:
>         net_dict[layer]["n_out"] = int(net_dict[layer]["n_out"] * dim_frac)
>         if "dropout" in net_dict[layer]:
>             net_dict[layer]["dropout"] *= dim_frac
>     net_dict["enc_value"]["dims"] = (AttNumHeads, int(EncValuePerHeadDim * dim_frac * 0.5) * 2)
>     # Use label smoothing only at the very end.
>     net_dict["output"]["unit"]["output_prob"]["loss_opts"]["label_smoothing"] = 0
304c305
< pretrain = {"repetitions": 5, "construction_algo": custom_construction_algo} #reduced number of reps
---
> pretrain = {"repetitions": 5, "copy_param_mode": "subset", "construction_algo": custom_construction_algo}
312a314
> accum_grad_multiple_step = 2
315c317
< stop_on_nonfinite_train_score = False
---
> #stop_on_nonfinite_train_score = False
319c321,322
< learning_rates = list(numpy.linspace(0.0003, learning_rate, num=10))  # warmup
---
> learning_rates = list(numpy.linspace(0.0001, learning_rate, num=20))  # warmup
> min_learning_rate = learning_rate / 50.

from returnn-experiments.

albertz avatar albertz commented on May 24, 2024

For reference, again, I think this is the new config, right?
And this is the old config, I guess.

Unfortunately, I see no improvement after fixing my mean std_dev bug either. Please find below a diff, red is old and green is new config (after removing the functions for specAug from new)

I'm stripping that a bit down to relevant parts.

56a58
>                     'use_new_filter': True,

You might play around with this and related settings (epoch_wise_filter in the dataset). This is basically the curriculum learning, where the idea is that you only use the clean (simpler) and shorter sequences initially in training.
Did you removed parts of the diff here? Because otherwise this is wrong. E.g. max_mean_len is different, and there are also several steps now. Please double check. This is very important.

It follows the diff in the network. I.e. these changes:

  • SpecAugment
  • Small initial convolution network
  • Time reduction factor 6 (3 * 2; was 2 * 2 * 2 before)
  • Use NativeLstm2 in decoder

And then:

  • Slightly different pretraining logic. But the new one should be ok. It's exactly like the new config I linked?
    You can try to play around more with that. E.g. increase the initial batch size.
  • accum_grad_multiple_step = 2
  • longer learning rate warmup, starting with lower learning rate
  • min_learning_rate (probably minor effect)

from returnn-experiments.

akshatdewan avatar akshatdewan commented on May 24, 2024

I was surprised to see this morning that even though there was not much improvement until epoch 36, the training error suddenly dropped after that epoch 37. And it seems to be decreasing as expected. train-scores.data.txt

So I am assuming the biggest problem was my mistake of using mean file as std_dev.

Did you removed parts of the diff here? Because otherwise this is wrong. E.g. max_mean_len is different, and there are also several steps now. Please double check. This is very important.

In order to make the fewest possible changes to the config file, I removed the other steps. I will put the other steps back. But I think that since the network seems to be learning well, I will put the other steps in and retrain with the latest config file.

Many thanks for your help.

from returnn-experiments.

albertz avatar albertz commented on May 24, 2024

I would recommend that in the epoch_wise_filter, you also try with the multiple steps, and basically the settings like the new config I linked. In my experiments on this, playing around with this was very fragile and had a huge effect on the final performance, and also how fast it would converge, or if it would converge at all, etc. So if it only goes down in epoch 37, this does not sound optimal for me. It should go down much sooner.

from returnn-experiments.

akshatdewan avatar akshatdewan commented on May 24, 2024

Yes, thanks! Now I am using the epoch_wise_filter exactly as in the config file you linked

d["epoch_wise_filter"] = {
                (1, 5): {
                    'use_new_filter': True,
                    'max_mean_len': 50,  # chars
                    'subdirs': ['train-clean-100', 'train-clean-360']},
                (5, 10): {
                    'use_new_filter': True,
                    'max_mean_len': 150,  # chars
                    'subdirs': ['train-clean-100', 'train-clean-360']},
                (11, 20): {
                    'use_new_filter': True,
                    'subdirs': ['train-clean-100', 'train-clean-360']},
                }

from returnn-experiments.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.