rwth-i6 / returnn-experiments Goto Github PK

experiments with RETURNN

Python 97.65% Shell 0.60% Makefile 0.04% C++ 1.03% Smalltalk 0.03% Emacs Lisp 0.28% JavaScript 0.01% NewLisp 0.03% Perl 0.30% Ruby 0.03% Slash 0.01% SystemVerilog 0.01%

returnn-experiments's Introduction

This repo contains the configs and related files to be used with RETURNN (called CRNN earlier/internally) and RASR (called Sprint internally) for data preprocessing and decoding.

To use the RETURNN configs with other data, replace the train/dev config settings, which specify the train and dev corpus data. At the moment, they will use the ExternSprintDataset interface to get the preprocessed data out of RASR. You can also use other dataset implementations provided by RETURNN (see RETURNN doc / source code), e.g. the HDF format directly.

Most of these configs belong to publications from our group, Human Language Technology and Pattern Recognition Group (i6), RWTH Aachen University, Germany.

returnn-experiments's People

Contributors

Stargazers

Watchers

Forkers

rpersie qoboty kazuki-irie xiaoyeye1117 owen864720655 wbengine yangyang233 k-sandhu hiyoung-asr deep-speech yangdaxia6 kurnianggoro freshzy akshatdewan nipengmath chavesliu lzr9926 ruohoruotsi wd929 1215thebqtic lbxcfx zehuangfang sachin-singh-12 ishine car3936 jaycicle lee666lee yuqi9579 afd77 tvatsal1996 lizezheng shekharnayak muks14x rxhmdia jotix16 dannychiu wangtianrui rogervaas marvin84 upchao gansaikhanshur vishnuvardhan-b albertcui98 rakhi-alina rnjsrlgns0

returnn-experiments's Issues

2018-asr-attention/librispeech/attention/exp3.ctc.lm.config: target 'bpe' unknown

Hello,

i modified the returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/ pipeline to train an encoder-attention-decoder system on MEGS.
The result was a WER of ~20%. To improve the WER i wanted to combine it with a language model.
So i modified the 2018-asr-attention/librispeech/lm/i512_m2048_m2048.sgd_b64_lr0_cl2.newbobabs.d0.2.config to train a language model on MEGS.

The 2018-asr-attention/librispeech/lm/README says one should modify the 2018-asr-attention/librispeech/attention/exp3.ctc.lm.config to combine the language model with the encoder-attention-decoder system.
Doing so and running an inference leads for me to this error:

  File "/home/returnn_testenv/TFNetworkLayer.py", line 970, in _static_get_target_value
    line: assert network.extern_data.has_data(target), "target %r unknown" % target
    locals:
      network = <local> <TFNetwork '<network via transform_config_dict>' parent_net=<TFNetwork 'root' train=False search> train=False>
      network.extern_data = <local> <ExternData data={'classes': Data(name='classes', shape=(None,), dtype='int32', sparse=True, dim=10045, available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:classes']), 'data': Data(name='data', shape=(None, 40), batch_shape_meta=[B,T|'time:var:extern_data:data',F|40])}>
      network.extern_data.has_data = <local> <bound method ExternData.has_data of <ExternData data={'classes': Data(name='classes', shape=(None,), dtype='int32', sparse=True, dim=10045, available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:classes']), 'data': Data(name='data', shape=(None, 40), batch_shape_meta=[B,T|'ti...
      target = <local> 'bpe'
AssertionError: target 'bpe' unknown

The reason for the error is that the target "bpe" is not part of the networks extern data.

The extern data looks like this:

<ExternData data={'classes': Data(name='classes', shape=(None,), dtype='int32', sparse=True, dim=10045, available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:classes']), 'data': Data(name='data', shape=(None, 40), batch_shape_meta=[B,T|'time:var:extern_data:data',F|40])}>

Im am a little bit lost at this point. Has anyone any idea whats the actual problem?
Thanks in advance!

Shallow fusion of LSTM LM

Hi,

I trained an LSTM LM using this config file and also an attention model using the this config. The attention model alone gives impressive (18.3 WER) results on a test set composed of accented speech from multiple speakers in a conference setting. I am using it for building an ASR system for a UN specialized agency with no intentions of commercializing it.

But, now I am trying to figure out how I can integrate the LM using shallow fusion (as mentioned in your paper - Improved training of end-to-end..) There does not seem to be a layer_class for it like kenlm used here.

Could you please help me understand if I need some special class to define in the RecLayer file or can I do it differently?

Thanks for your great work.

Akshat

Asking help for training LM using returnn.

Hi,
I added more training date for better performance . Because the corpus has change, I want to train a new LM for my data. But the exactly form of data is not offered on Git such as"train": "/work/asr3/irie/data/librispeech/lm_bpe/librispeech-lm-norm.bpe.txt.gz". Can you offer me a example?
Second, when I run ./returnn/rnn.py **.config , a FileNotFoundError occurred err_msg = "No such file or directory: 'cf'", len = 31. So I'd like to now the function of def cf(filename) in config file.
And the cf command is for ?
Last but not the least, i want to know why train_num_seqs = 40418260 , but there are only 281241 sentences in train dataset.
Thanks for your great work.

query regarding lm_fusion_config

Hi guys, I am using https://github.com/rwth-i6/returnn-experiments/blob/master/2018-asr-attention/librispeech/attention/exp3.ctc.lm.config for lm_fusion on my custom dataset. Without lm_fusion hyps are good. But with lm_fusion for some utterances hyp predicted is too long & mainly repeating last symbols of utterances. it's happening with only few utterances.
for ex:
ref: play bi@@ king in ch@@ ob@@ k
hyp: play bi@@ king in show book
hyp_fusion: play bi@@ king in show bo@@ k bo@@ k bo@@ k bo@@ k bo@@ k bo@@ k bo@@ k bo@@ k
Any pointers to debug this issue?

GPU memory utilization puzzle

Hello,

I am using Tesla V100 SXM2 32 GB and GeForce GTX 1080 (separately) for my experiments with Librispeech, and I see some strange behaviour around which I am failing to wrap my head.

During my training experiments, for the same config file base2.conv2l.specaug.curric3.config, with batch_size of 10000 and max_seqs of 200, I end up filling GPU memory of both the above mentioned GPUs even if the num_seqs getting processed per step is around 6 for both of them.

Training Experiment 1:
GPU - Tesla V100 SXM2 32 GB
GPU Memory used 31392 MB
Config file - base2.conv2l.specaug.curric3.config
batch_size - 10,000
max_seqs - 200
num_seqs/step ~ 6

Training Experiment 2:
GPU - GeForce GTX 1080
GPU Memory used 7989 MB
Config file - base2.conv2l.specaug.curric3.config
batch_size - 10,000
max_seqs - 200
num_seqs/step ~ 6

Can anyone help me understand what could be going on?
Thanks!

loss nan and cost nan while running my own corpus using librispeech sets

Hi, I am training the my own 5000h corpus using librispeech setup on 1 GPU with no changes in configuration.
I am getting below logs after warmup. And I see the issue #34. However, his problem happened during warm up. Changing the warm up steps will help. What about my problems?
I hope you can help.

pretrain epoch 38, step 3374, cost:ctc 1.445432382337188, cost:output/output_prob 0.7627274919020337, error:ctc 0.29677418898791075, error:decision 0.0, error:output/output_prob 0.15967741690110415, loss 1369.0591, max_size:classes 25, max_size:data 1072, mem_usage:GPU:0 9.2GB, num_seqs 46, 0.932 sec/step, elapsed 1:33:33, exp. remaining 0:43:00, complete 68.50%pretrain epoch 38, step 3375, cost:ctc 1.3635925442755905, cost:output/output_prob 0.6688257869468188, error:ctc 0.3215926594566554, error:decision 0.0, error:output/output_prob 0.15313936164602637, loss 1327.1692, max_size:classes 25, max_size:data 1083, mem_usage:GPU:0 9.2GB, num_seqs 46, 0.947 sec/step, elapsed 1:33:34, exp. remaining 0:42:59, complete 68.52%
pretrain epoch 38, step 3376, cost:ctc 1.702117337969625, cost:output/output_prob 0.9609764886744969, error:ctc 0.3573883334174752, error:decision 0.0, error:output/output_prob 0.21649485582020134, loss 2324.8809, max_size:classes 28, max_size:data 897, mem_usage:GPU:0 9.2GB, num_seqs 55, 0.927 sec/step, elapsed 1:33:37, exp. remaining 0:42:57, complete 68.54%
pretrain epoch 38, step 3377, cost:ctc 1.9066916597477077, cost:output/output_prob 1.2203216964220545, error:ctc 0.3672680299496278, error:decision 0.0, error:output/output_prob 0.22809277649503204, loss 2426.5625, max_size:classes 29, max_size:data 881, mem_usage:GPU:0 9.2GB, num_seqs 45, 0.800 sec/step, elapsed 1:33:38, exp. remaining 0:42:55, complete 68.57% pretrain epoch 38, step 3378, cost:ctc 1.8165156518388414, cost:output/output_prob 1.2114319657550254, error:ctc 0.3988764085806906, error:decision 0.0, error:output/output_prob 0.2485955081647262, loss 2155.8987, max_size:classes 26, max_size:data 1340, mem_usage:GPU:0 9.2GB, num_seqs 37, 1.020 sec/step, elapsed 1:33:40, exp. remaining 0:42:54, complete 68.59%
pretrain epoch 38, step 3379, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9493293667910622, error:decision 0.0, error:output/output_prob 0.9493293667910622, loss nan, max_size:classes 27, max_size:data 1310, mem_usage:GPU:0 9.2GB, num_seqs 34, 0.969 sec/step, elapsed 1:33:42, exp. remaining 0:42:53, complete 68.60%
pretrain epoch 38, step 3380, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9534482723101974, error:decision 0.0, error:output/output_prob 0.9534482723101974, loss nan, max_size:classes 27, max_size:data 1837, mem_usage:GPU:0 9.2GB, num_seqs 27, 1.130 sec/step, elapsed 1:33:43, exp. remaining 0:42:52, complete 68.62%

option to provide fp16 for layer weights

is their an option to specify in returnn configs for initializing layer weights as floating point 16 instead of floating point 32? or could it be modified in returnn code somehow? I need it for testing model optimization.

Returnn doing layer optimizations

I was trying to debug some model I implement in returnn, for which I needed attention and rnn output of each timestamp

For this, I created a new layer with this line

 "att_debug": {"class": "copy", "from": ["prev:att_debug", "att"]},  # (B, H*V)

and adding it after this line

Later I was planning to add att_debug to fetch dictonary

But I was not able to find att_debug anywhere in the network topology. If I do

print( self.engine.network.layers["output"].cell.net.layers.keys() )

in this funciton, it doesn't show att_debug.

I think somehow returnn is optimizing out this layer as it is not needed for the loss calculation.

I have tried various things like setting "optimize_move_layers_out" to False , but was not successful.

Is there a way to turn off optimizations in returnn ?

Thanks in advance

Error running 23_recog.sh on pre-trained dataset

I am trying to test out running the pretrained model on my system. I have downloaded the model from http://www-i6.informatik.rwth-aachen.de/~zeyer/models/librispeech/enc-dec/2018.zeyer.exp3.ctc/ and renamed the file network.238.data-00000-of-00001 to train-scores.data and put it in /data/exp-returnn, as well as moved and renamed the exp3.ctc.config to /returnn.config.

Upon running ./23_recog.sh I get the output:

You can run this already while it is still training.
Run it a final time when the training is finished.

experiment=returnn-pretrained
+ experiment=returnn-pretrained
test -e data/exp-$experiment  # experiment existing?
+ test -e data/exp-returnn-pretrained
test -e data/exp-$experiment/train-scores.data  # some epochs trained?
+ test -e data/exp-returnn-pretrained/train-scores.data

Get us some recommended epochs to do recog on.
epochs=$(./tools/recommend-recog-epochs.py --experiment $experiment)
./tools/recommend-recog-epochs.py --experiment $experiment
++ ./tools/recommend-recog-epochs.py --experiment returnn-pretrained
Experiment config: returnn-pretrained.config
EXCEPTION
Traceback (most recent call last):
  File "./tools/recommend-recog-epochs.py", line 46, in <module>
    line: train_scores_data = open(train_scores_fn).read()
    locals:
      train_scores_data = <not found>
      open = <builtin> <built-in function open>
      train_scores_fn = <local> 'data/exp-returnn-pretrained/train-scores.data', len = 45
      read = <not found>
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    line: return codecs.ascii_decode(input, self.errors)[0]
    locals:
      codecs = <global> <module 'codecs' from '/usr/lib/python3.5/codecs.py'>
      codecs.ascii_decode = <global> <built-in function ascii_decode>
      input = <local> b'{\xba[\xbegQ\x0f\xbf\x0b\xd4\xc0\xbe\x00\xe3:\xbe\xdbm\xfd\xbeaT\x1d\xbf\r\x14\x80>\xa4\xde\xe7\xbd\x0f\xd1\xff=\xd0\r\x86<X3\x03\xbfnv\xd7\xbe\xe7b\x89>\x05:\x86\xbe\x07G\xc7>\xda\xf6\x9d\xbe\x93\xf4\x85\xbe\xbe\x93\x8b\xbe\xfb\xeb\x06>\x9e\xe7\xa3>\xa8\x93\\=\x97O\xb5=\xfc\xd3\x96\xbe\x8e\xe3..., len = 751448632
      self = <local> <encodings.ascii.IncrementalDecoder object at 0x7f761d520ac8>
      self.errors = <local> 'strict', len = 6
UnicodeDecodeError: 'ascii' codec can't decode byte 0xba in position 1: ordinal not in range(128)
+ epochs=

UnicodeDecodeError: 'ascii' codec can't decode byte 0xba in position 1: ordinal not in range(128) seems to be a problem loading data from the train-scores.data file is there something wrong in my workflow?

Audio + Text embedding alignment

Is it possible to create a vector space containing words in textual and audio form and alignigning them so that both share the same vector representation?
The idea is that text-word-embeddings could be improved in further steps.

Same Wav file have a total different beam scores when it is assigned to different batches

First of all, thanks for your great works. And I have just encountered a problem. I find that a same wav file will have total different beam scores When it's assigned to different batches. The output_layer is "output" and the beam size is 16. The beam score is -4.7 when there is only one audio in the batch , and the beam score is -103 when there are 26 audios, etc. Can you explain why did this happen . Thanks a lot

Output Shape for Intermediate Layers of Librispeech-Attention Config

Hi,
I was going through this attention config provided by you and trying to figure out the output shape for intermediate layers by looking into the layer implementation. Also there are several comments in the config, perhaps, indicating the same (like (B, enc-T, H)).

However, it seemed to me at some places that the shape should be time-major but the comments mention it as batch-major. For example, since output of recurrent layers is time-major, all of encoder, enc_ctx, inv_fertility and enc_value (as axis argument is F) seemed to be time-major to me going by the layer implementation. But the comment in the config states enc_value to be batch-major. I also ran the code and checked the log which had batch_dim_axis as 1 for output of the layers. Am I missing some thing here?

I also wanted to ask if there is any way in which we can check the intermediate shapes at run time. I found a similar issue here which mentioned DumpLayer for the same. I checked out this layer but it seemed to me that it was for Theano only.

Thanks for sharing the code and the framework.

feeding decoded output into training network

How can we use decision layer during training time? I want to use decoded beams generated from first sub network (freezed) as input into second sub network(trainable)? So my complete network have one pre-trained & one trainable network. Because "decsion" layer can be used only during search time as written in returnn code.

EXCEPTION NetworkConstructionDependencyLoopException: Error: There is a dependency loop on layer 'accum_att_weights'.

2019-asr-local-attention configs.

I am getting NetworkConstructionDependencyLoopException while running 2019-asr-local-attention librispeech setup.
Full stdout here

Few lines from where error starts:

EXCEPTION
NetworkConstructionDependencyLoopException: Error: There is a dependency loop on layer 'accum_att_weights'.
Construction stack (most recent first):
  accum_att_weights
  weight_feedback
  energy_in
  energy_tanh
  energy
  energy_reinterpreted
  att_weights
  att0
  att
  s
  readout_in
  readout
  output_prob

EXCEPTION
CannotHandleUndefinedSourcesException: 's_transformed': cannot handle undefined sources without defined out_type.
{'activation': None,
 'loss': None,
 'n_out': 1024,
 'network': <TFNetwork 'root/output:rec-subnet' parent_net=<TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>> train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>,
 'size_target': None,
 'sources': [None],
 'target': None,
 'with_bias': False}

Exception creating layer root/'output' of class RecLayer with opts:

char lm using the default config

I tried training a char lm using the default lm config with vocab size of 39. I have a few queries

How to add space in the vocab
I set "word_based": False, any other changes required in the config to train a char model

Also, after first epoch it shows dev_score_output:exp': 2.9312321849827905 which doesn't seem reasonable.

n-gram shallow fusion

Your paper mentions shallow fusion of n-gram LM with end to end model, are the experiment details available in the repo?

The training or Server demo will occupy all resources of one single GPU

My friends, thanks for your great work. I have a question wondering if you can give me the right answer. When I execute the "22_train.sh" or "31a_start_demo_server.sh", I find that the model just occupy all the resources of one single GPU. How can I control the occupying rate?

Does transformer-based asr model need less time than LSTM in the inference stage?

Hi, zeyer. The transformer-based asr model need less time than LSTM in the training time, however, Is the situation the same at the inference stage?

LM fusion with for ctc model

I have trained a ctc model for Libri-Speech, the model config is similar to the encoder of this config

https://github.com/rwth-i6/returnn-experiments/blob/master/2018-asr-attention/librispeech/full-setup-attention/returnn.config

I want to know 2 things

How do I get beam search output during inference? (Something Similar to https://github.com/rwth-i6/returnn-experiments/blob/master/2018-asr-attention/librispeech/full-setup-attention/tools/search.py)
How can I combine it with the LM? (Is there any way provided by returnn?)

My efforts in these directions as of now are

I'm able to get beam output by using tf.nn.ctc_beam_search_decoder inside ctc layer and then pulling this operation with a script similar to get-attention-weights.py. I want to know if there's a better way?
I can't find any good solution for this problem apart from using some external beam search decoder code.

Thanks a lot in advance.

some questions about lm config file

I hope to reproduce the results in this great paper: “Improved training of end-to-end attention models for speech recognition”.

I need to train 2 models: 1)the seq2seq model 2)an external LM，and they use the same vocabulary file.

I want to know which corpus is used to generate the vocabulary file("trans.bpe.vocab.lm.txt")
Is it from the voice tagging corpus(only 40M)，
or from the corpus of external LM(eg:librispeech corpus file is 4G)

I hope I have expressed it clearly. Looking forward to your reply, thanks.

loss nan and cost nan while running full librispeech setup

Hi, I am training the full librispeech setup on 1 GPU with no changes in configuration.
I am getting below logs. Is this expected?

pretrain epoch 8, step 634, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9799331023823471, error:decision 0.0, error:output/output_prob 0.9799331023823471, loss nan, max_size:classes 57, max_size:data 1595, mem_usage:GPU:0 3.8GB, num_seqs 12, 3.044 sec/step, elapsed 0:30:28, exp. remaining 0:19:15, complete 61.27%
pretrain epoch 8, step 635, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9800000237300992, error:decision 0.0, error:output/output_prob 0.9800000237300992, loss nan, max_size:classes 52, max_size:data 1636, mem_usage:GPU:0 3.8GB, num_seqs 12, 3.082 sec/step, elapsed 0:30:31, exp. remaining 0:19:12, complete 61.39%
pretrain epoch 8, step 636, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9806678476743401, error:decision 0.0, error:output/output_prob 0.9806678476743401, loss nan, max_size:classes 59, max_size:data 1683, mem_usage:GPU:0 3.8GB, num_seqs 11, 3.192 sec/step, elapsed 0:30:34, exp. remaining 0:19:03, complete 61.60%
pretrain epoch 8, step 637, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9802513704635203, error:decision 0.0, error:output/output_prob 0.9802513704635203, loss nan, max_size:classes 57, max_size:data 1640, mem_usage:GPU:0 3.8GB, num_seqs 11, 3.111 sec/step, elapsed 0:30:37, exp. remaining 0:18:56, complete 61.78%
pretrain epoch 8, step 638, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.980902785086073, error:decision 0.0, error:output/output_prob 0.980902785086073, loss nan, max_size:classes 59, max_size:data 1677, mem_usage:GPU:0 3.8GB, num_seqs 11, 3.173 sec/step, elapsed 0:30:41, exp. remaining 0:18:51, complete 61.93%
pretrain epoch 8, step 639, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9810126828961073, error:decision 0.0, error:output/output_prob 0.9810126828961073, loss nan, max_size:classes 59, max_size:data 1656, mem_usage:GPU:0 3.8GB, num_seqs 12, 3.182 sec/step, elapsed 0:30:44, exp. remaining 0:18:48, complete 62.03%
pretrain epoch 8, step 640, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9810996705200523, error:decision 0.0, error:output/output_prob 0.9810996705200523, loss nan, max_size:classes 57, max_size:data 1695, mem_usage:GPU:0 3.8GB, num_seqs 11, 3.207 sec/step, elapsed 0:30:47, exp. remaining 0:18:46, complete 62.12%
pretrain epoch 8, step 641, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9812792513985187, error:decision 0.0, error:output/output_prob 0.9812792513985187, loss nan, max_size:classes 58, max_size:data 1590, mem_usage:GPU:0 3.8GB, num_seqs 12, 3.055 sec/step, elapsed 0:30:50, exp. remaining 0:18:44, complete 62.20%

Query regarding lm_cross_sentence transformer implementation

Hi guys, is transformer config in 2019-lm-cross-sentence in somewhat memory implementation similar to transformer-xl or different? And what is reason for not using positional encoding?

Enhancement Request: Include training time in Readme

@albertz Can you include training time for various experiments you did. It would give an idea on how much time will be required for experiments.

some questions about lm config file

thanks for helping me answer these two questions：
1)
returnn-experiments/2018-asr-attention/librispeech/lm/i512_m2048_m2048.sgd_b64_lr0_cl2.newbobabs.d0.2.config

line26: cached_fn = check_output(["cf", filename]).strip().decode("utf8")

what does the "cf" mean in this line?

when I rnn the returnn with this config file, it reports the following error:
"CalledProcessError: Command '['cf', 'data/librispeech/lm_bpe/dev.clean.other.bpe.txt.gz']' returned non-zero exit status 1"

What is the cause of this?

Looking forward to your reply!

Transformer LM training issues

I am able to replicate single gpu scores with the libri corpus, but the training is slow - 23-24 hours per epoch. I tried 4 gpu training using horovod, after 30 epochs, train/dev ppls are around 170. Any suggestions for improving the convergence of multi-gpu training?

Have you tried multi-head attention on Librispeech?

Hi
I found that I can give multi-head attention attributes with config file. (AttNumHeads)
Have you tried AttNumHeads greater than 1 for asr on librispeech?
I wonder if the rather bad performance on test-other dataset could be alleviated by multi-head attention.
Thank you in advance :)

Audio coverage issues

Hi,

I trained models on around 1500h of audio data (Librispeech + WIPO proprietary) and obtain good performance (around 18% WER on proprietary test set).

But sometimes we encounter an issue when the coverage of the audio is very low and lot of text (5-6 seconds) is missing. I just wanted to ask if anyone of you has encountered similar challenges.

Best
Akshat

Reusing parameters inside rec layer

Hi,
I am following this config for E2E ASR. I want to add another decoder which shares s layer (rnn_cell inside output layer) between the two decoders. I tried reuse_params with following options:

kernel: {"reuse_layer" : "base:output", "custom" : (lambda reuse_layer, **kwargs: reuse_layer.params["s/rec/lstm_cell/kernel"])}

Similarly for bias variable. However, it does not work. I tried to share parameters in output_prob and other linear layers in the same way and it worked. If it is useful, I am using an older version of RETURNN.

Also, can we reuse layers inside the rec-subnet in a better way? What if we want to reuse the entire rec-subnet?

Implementation of MoChA for ASR

Is this the config for implementing MoChA - Monotonic Chunkwise Attention?
2019-asr-local-attention/librispeech/local-heuristic.argmax.win02.exp3.ctc.config

Loading a saved Returnn model from its .meta file

Hi. I am trying to load a tensorflow meta graph from a saved checkpoint using Tensorflow version 1.15 to convert it to a SavedModel for tensorflow serving.. I am using the following code.

import tensorflow as tf
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants
import sys

if len(sys.argv)!=2:
        print("Usage:" + sys.argv[0] + "save_dir")
        exit(1)
export_dir=sys.argv[1]
builder = tf.compat.v1.saved_model.builder.SavedModelBuilder(export_dir)
sigs={}
with tf.Session(graph=tf.Graph()) as sess:
        new_saver=tf.train.import_meta_graph("./serv_test/model.238.meta")
        new_saver.restore(sess, tf.train.latest_checkpoint("./serv_test"))
        graph=tf.get_default_graph()
        input_audio=graph.get_tensor_by_name('inference/default/wav:0')
        output_hyps=graph.get_tensor_by_name('inference/default/Reshape_7:0')
        sigs[signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY] = tf.saved_model.signature_def_utils.predict_signature_def({"in":input_audio},{"out":output_hyps})
        builder.add_meta_graph_and_variables(sess, [tag_constants.SERVING], signature_def_map=sigs,)
builder.save()

But I am getting the following error in the import_meta_graph line:

Traceback (most recent call last):
  File "xport.py", line 16, in <module>
    new_saver=tf.train.import_meta_graph("./serv_test/model.238.meta")
  File "/home/ubuntu/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1453, in import_meta_graph
    **kwargs)[0]
  File "/home/ubuntu/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1477, in _import_meta_graph_with_return_elements
    **kwargs))
  File "/home/ubuntu/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/meta_graph.py", line 809, in import_scoped_meta_graph_with_return_elements
    return_elements=return_elements)
  File "/home/ubuntu/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/importer.py", line 405, in import_graph_def
    producer_op_list=producer_op_list)
  File "/home/ubuntu/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/importer.py", line 501, in _import_graph_def_internal
    graph._c_graph, serialized, options)  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered
 'NativeLstm2' in binary running on ip-10-1-21-241. Make sure the Op and Kernel
 are registered in the binary running in this process. Note that if you are loading a
 saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler`
 should be done before importing the graph, as contrib ops are lazily registered when
 the module is first accessed.

Is there any way to get around this error? Is it because of the custom built layers used in Returnn? Is there any way to make a Returnn Model tensorflow servable?
Thanks.

Converage term

In the original paper for Shallow Fusion (https://arxiv.org/abs/1612.02695), the authors used a convergence term to achieve better results.

Did you guys also try it?

If yes where there any improvements and can you share the corresponding config.

Training Configuration for TEDLIUMv2

Hi, thanks for a great framework and experiments.

I'm reproducing the TEDLIUMv2 LSTM results in Table 4 from the paper "A comparison of Transformer and LSTM encoder decoder models for ASR".

I'm using the data preprocessing pipeline from here. My configuration file is here. (This is almost the same with librispeech configuration, I rewrote it for TEDLIUMv2.)

I expected the test WER of about 10.8%, but I can only get 11.5% ~12.0% from several runs. Am I missing something? Could you kindly share the training configures for TEDLIUMv2?

Thanks.

Question about 2018-asr-librispeech dev = get_dataset("dev", subset=3000)

I have encountered this problem, when I add more audio corpus in the training and keep the subset=3000. The training loss went 'nan' and it will never converge. But when I enlarge the subset to 10000, the problem disappeared. The small dev subset make the model break.

And when I set the subset to a fix number , the toolkit randomly select some audio forming the the dev-set. Then the dev-set will be the same for all epoches during the training ? Is it?

N-best hyps

Hello,
I am sorry for a probably silly question but is there a parameter that I could specify to output n-best instead of 1-best hypothesis? I was using the tools/search.py which just calls returnn/rnn.py with --task search. And in rnn.py implementation I could not find a parameter like nbest. Thanks!

Change in accuracy on changing batch size during test time

Hello,
I am encountering a weird problem that accuracy of test result (for a variant of 2018-asr-attention) changes depending on the batchsize at test time. For smaller batchsize, there is an increase in repetition errors. Just wondering if you guys have an idea about what could be going on here?
Thanks

local attention with unidirectional lstm not converging

Hi @albertz ,
I tried with different learning rates but model seems to be not converging after changing bidirectional lstm to unidirectional lstm in case of local attention setup.
Can you suggest something?
What else should I try.

Getting WER worse than expected

Hi
I tested on my own 2018-asr-atention/librispeech/full-setup-attention and I've got WER score of 4.82 on librispeech test-clean data as below.

(tensorflow3.5) [myoungji.han@eodssr1 exp-returnn]$ cat search.test-clean.ep250.beam12.recog.scoring.wer
4.82

This is worse than your reported score of 3.82.
I hope you can give me a clue if there is something wrong.
Thank you in advance.

CTC - No valid path found

Hi,

I am trying to reproduce your results for Librispeech dataset (train-clean-100+ train-clean-360+train-other-500). I follow all the step scripts here https://github.com/rwth-i6/returnn-experiments/tree/master/2018-asr-attention/librispeech/full-setup-attention before running the 22_train.sh and they all seem to run fine without errors.

But I am getting
2018-07-17 16:19:35.497434: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found.
messages every few batches. Should I worry about it? Or there is something going wrong with my training? https://stackoverflow.com/questions/45130184/ctc-loss-error-no-valid-path-found suggests that it is caused when the length of the label sequence is greater than the length of the input sequence. I am filtering out audio sequences shorter than 1 sec and larger than 15 secs.

Thanks
Best!
Akshat

Question about 2020-rnn-transducer

Dear author, I find that you have given the config files about the experiment <2020-rnn-transducer>. And this experiment is based on SWB corpus. I tried using the this corpus reproduce your results. But I'm not familiar with the features extraction process. I have not build and installed RASR toolkit successfully yet. And I do not understand why you used the gammatone features but not the mfcc features. May I use Librispeech corpus and mfcc features to reproduce your result. Or could you share some config files about transducer experiments using Librispeech corpus and mfcc features? Thanks a lot.

specAugment policy and schedules

Hi,

I wanted to run an experiment with LD augmentation policy (as described in the Google Brain paper ) along with D learning schedule.

I was wondering what would be the right way of doing something like with base2.conv2l.specaug.curric3.config.

I was thinking of doing:

Two additional masks in transform function just by simply calling random_mask two more times.
Slowing down the warm-up by increasing num from 10 to 20 or 40
Reducing the exponential LR decay newbob_learning_rate_decay from 0.9 to 0.95

Would it be a reasonable thing to do?

Thanks

Cannot assign a device for operation global_tensor_mem_usage_deviceGPU1/MaxBytesInUse

We are trying to run Librispeech corpus with the latest base2.conv2l.specaug.curric3 config file.
It used to take 0.5 hrs for 1 epoch for 1 Tesla V100 GPU so we upgraded to 4 GPUs. Suddenly it started throwing this error:

Caused by op 'global_tensor_mem_usage_deviceGPU1/MaxBytesInUse', defined at:
  File "./rnn.py", line 655, in <module>
    main(sys.argv)
  File "./rnn.py", line 643, in main
    execute_main_task()
  File "./rnn.py", line 453, in execute_main_task
    engine.train()
  File "/home/ubuntu/returnn/TFEngine.py", line 1213, in train
    self.train_epoch()
  File "/home/ubuntu/returnn/TFEngine.py", line 1321, in train_epoch
    trainer.run(report_prefix=("pre" if self.is_pretrain_epoch() else "") + "train epoch %s" % self.epoch)
  File "/home/ubuntu/returnn/TFEngine.py", line 522, in run
    fetches_dict = self._get_fetches_dict()
  File "/home/ubuntu/returnn/TFEngine.py", line 133, in _get_fetches_dict
    with_summary=True, with_size=True)
  File "/home/ubuntu/returnn/TFNetwork.py", line 1027, in get_fetches_dict
    d["mem_usage:%s" % os.path.basename(dev.name.replace("/device:", "/"))] = TFUtil.mem_usage_for_dev(dev.name)
  File "/home/ubuntu/returnn/TFUtil.py", line 7752, in mem_usage_for_dev
    return global_tensor(get, "mem_usage_%s" % scope_name)
  File "/home/ubuntu/returnn/TFUtil.py", line 5805, in global_tensor
    v = f()
  File "/home/ubuntu/returnn/TFUtil.py", line 7748, in get
    return bytes_in_use()
  File "/home/ubuntu/returnn/.rwth/lib/python3.5/site-packages/tensorflow/contrib/memory_stats/python/ops/memory_stats_ops.py", line 41, in MaxBytesInUse
    return gen_memory_stats_ops.max_bytes_in_use()
  File "/home/ubuntu/returnn/.rwth/lib/python3.5/site-packages/tensorflow/contrib/memory_stats/ops/gen_memory_stats_ops.py", line 214, in max_bytes_in_use
    "MaxBytesInUse", name=name)
  File "/home/ubuntu/returnn/.rwth/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/ubuntu/returnn/.rwth/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/returnn/.rwth/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/ubuntu/returnn/.rwth/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Cannot assign a device for operation global_tensor_mem_usage_deviceGPU1/MaxBytesInUse: node global_tensor_mem_usage_deviceGPU1/MaxBytesInUse (defined at /home/ubuntu/returnn/TFUtil.py:7748) was explicitly assigned to /device:GPU:1 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1, /job:localhost/replica:0/task:0/device:XLA_GPU:2, /job:localhost/replica:0/task:0/device:XLA_GPU:3 ]. Make sure the device specification refers to a valid device.
         [[node global_tensor_mem_usage_deviceGPU1/MaxBytesInUse (defined at /home/ubuntu/returnn/TFUtil.py:7748) ]]

We tried debugging by running basic commands like nvidia-smi and it seems fine:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+===============================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   45C    P0    52W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   47C    P0    52W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   45C    P0    54W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   46C    P0    57W / 300W |      0MiB / 16130MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=========================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

We are using tensorflow-gpu (1.13.2) and Cuda V7.5.17.

I will be thankful for any help regarding this issue.Please help.

Expected loss during training

I am trying to define a new loss based on WER as per this paper - https://arxiv.org/pdf/1712.01818.pdf. Is there a way to run beam search at train time, calculate WER using edit_distance and incorporate that as expected_loss into the training ce loss?

specAugment Implementation

Many thanks for the the specAugment implementation.

The base2.conv2l.specaug.curric3 here, is my understanding correct that it is not a continued training from anything?

Also, is there significant difference in training time in bas2.conv2l.specaug.curric3 as compared to let's say base2?

Output text alignment

Hello,

Is it currently possible in returnn to train and consequently generate output alignment between subword units and audio?

Thanks

Implement a unidirectional variant of local attention

Hi,
I am trying to implement a unidirectional variant of the local attention model in the 2018-asr attention directory. I have made some changes to the config file according to my understanding. Can you verify if it is correct. The changed config file can be found in the following drive link:
https://drive.google.com/open?id=1-H0nJgCZn0vr56rgJ4M6A9ZYAB0bWPCA

When I start the training I get the following log files. The loss seems to be very high, because for the bidirectional model the loss starts in the range of 5-6k for me. But here it begins in the range of 100k.

pretrain epoch 1, step 0, cost:ctc 370.774999040268, cost:output/output_prob 9.212651966398369, error:ctc 17.81853248248808, error:decision 0.0, error:output/output_prob 0.9999999811407178, loss 98416.805, max_size:classes 12, max_size:data 437, mem_usage:GPU:0 772.1MB, num_seqs 45, 7.013 sec/step, elapsed 0:00:20, exp. remaining 0:00:31, complete 38.81%
pretrain epoch 1, step 1, cost:ctc 284.98298864587923, cost:output/output_prob 9.212257950929825, error:ctc 14.188172029796988, error:decision 0.0, error:output/output_prob 0.9999999990686774, loss 109440.63, max_size:classes 15, max_size:data 361, mem_usage:GPU:0 4.5GB, num_seqs 45, 1.343 sec/step, elapsed 0:00:21, exp. remaining 0:00:30, complete 41.12%
pretrain epoch 1, step 2, cost:ctc 290.689638091957, cost:output/output_prob 9.212171886229044, error:ctc 13.878237992990762, error:decision 0.0, error:output/output_prob 0.9974093013443053, loss 115762.1, max_size:classes 11, max_size:data 457, mem_usage:GPU:0 4.5GB, num_seqs 43, 1.492 sec/step, elapsed 0:00:22, exp. remaining 0:00:28, complete 44.16%
pretrain epoch 1, step 3, cost:ctc 293.1260531155767, cost:output/output_prob 9.212529359078815, error:ctc 14.327272824011743, error:decision 0.0, error:output/output_prob 1.0000000067520887, loss 83143.11, max_size:classes 13, max_size:data 697, mem_usage:GPU:0 4.5GB, num_seqs 28, 1.295 sec/step, elapsed 0:00:24, exp. remaining 0:00:26, complete 47.32%
pretrain epoch 1, step 4, cost:ctc 260.3412321177311, cost:output/output_prob 9.211921857411653, error:ctc 12.447256781160831, error:decision 0.0, error:output/output_prob 0.9999999515712261, loss 127768.2, max_size:classes 17, max_size:data 429, mem_usage:GPU:0 4.6GB, num_seqs 43, 1.548 sec/step, elapsed 0:00:25, exp. remaining 0:00:25, complete 50.61%
pretrain epoch 1, step 5, cost:ctc 318.70524216529157, cost:output/output_prob 9.211640645493048, error:ctc 15.151261280290782, error:decision 0.0, error:output/output_prob 1.0000000512227416, loss 78044.21, max_size:classes 14, max_size:data 669, mem_usage:GPU:0 5.1GB, num_seqs 21, 1.246 sec/step, elapsed 0:00:26, exp. remaining 0:00:22, complete 54.14%
pretrain epoch 1, step 6, cost:ctc 299.3549390775952, cost:output/output_prob 9.211832857916477, error:ctc 14.26940580131486, error:decision 0.0, error:output/output_prob 0.9999999585561454, loss 67576.125, max_size:classes 14, max_size:data 1096, mem_usage:GPU:0 5.1GB, num_seqs 18, 1.654 sec/step, elapsed 0:00:28, exp. remaining 0:00:21, complete 56.69%
pretrain epoch 1, step 7, cost:ctc 293.0398462318408, cost:output/output_prob 9.210876611768981, error:ctc 13.748554605990648, error:decision 0.0, error:output/output_prob 0.9942196309566498, loss 104578.75, max_size:classes 15, max_size:data 702, mem_usage:GPU:0 5.1GB, num_seqs 28, 1.451 sec/step, elapsed 0:00:30, exp. remaining 0:00:19, complete 60.10%
pretrain epoch 1, step 8, cost:ctc 254.12312227828806, cost:output/output_prob 9.211448004022031, error:ctc 12.09428558475338, error:decision 0.0, error:output/output_prob 0.9999999892897904, loss 92167.1, max_size:classes 18, max_size:data 663, mem_usage:GPU:0 5.1GB, num_seqs 28, 1.229 sec/step, elapsed 0:00:31, exp. remaining 0:00:18, complete 62.41%
pretrain epoch 1, step 9, cost:ctc 316.1771712676091, cost:output/output_prob 9.21173871878409, error:ctc 14.632000694982708, error:decision 0.0, error:output/output_prob 1.0000000474974513, loss 81347.23, max_size:classes 19, max_size:data 691, mem_usage:GPU:0 5.3GB, num_seqs 19, 1.290 sec/step, elapsed 0:00:32, exp. remaining 0:00:17, complete 65.57%
pretrain epoch 1, step 10, cost:ctc 278.35383669496514, cost:output/output_prob 9.21164610571941, error:ctc 11.460377529263496, error:decision 0.0, error:output/output_prob 1.0000000149011612, loss 76204.85, max_size:classes 19, max_size:data 1015, mem_usage:GPU:0 5.3GB, num_seqs 19, 1.624 sec/step, elapsed 0:00:34, exp. remaining 0:00:15, complete 68.25%
pretrain epoch 1, step 11, cost:ctc 283.5369720542076, cost:output/output_prob 9.211048180620764, error:ctc 12.168604605831206, error:decision 0.0, error:output/output_prob 0.9825581358745694, loss 100705.32, max_size:classes 17, max_size:data 796, mem_usage:GPU:0 5.5GB, num_seqs 25, 1.489 sec/step, elapsed 0:00:35, exp. remaining 0:00:14, complete 70.56%
pretrain epoch 1, step 12, cost:ctc 294.0424982657896, cost:output/output_prob 9.211014433250512, error:ctc 11.176151985302567, error:decision 0.0, error:output/output_prob 0.9972899928689002, loss 111900.55, max_size:classes 18, max_size:data 744, mem_usage:GPU:0 5.5GB, num_seqs 26, 1.467 sec/step, elapsed 0:00:37, exp. remaining 0:00:13, complete 73.84%
pretrain epoch 1, step 13, cost:ctc 264.9904200157962, cost:output/output_prob 9.211146379502793, error:ctc 10.400467950617895, error:decision 0.0, error:output/output_prob 0.997658038046211, loss 117084.07, max_size:classes 19, max_size:data 739, mem_usage:GPU:0 5.5GB, num_seqs 27, 1.461 sec/step, elapsed 0:00:38, exp. remaining 0:00:12, complete 76.28%
pretrain epoch 1, step 14, cost:ctc 279.6960028297326, cost:output/output_prob 9.21092234193111, error:ctc 10.493362792767584, error:decision 0.0, error:output/output_prob 0.9845132706686854, loss 130585.93, max_size:classes 20, max_size:data 683, mem_usage:GPU:0 5.7GB, num_seqs 29, 1.521 sec/step, elapsed 0:00:40, exp. remaining 0:00:11, complete 78.10%
pretrain epoch 1, step 15, cost:ctc 327.895143393187, cost:output/output_prob 9.21084384552745, error:ctc 11.577777936821803, error:decision 0.0, error:output/output_prob 0.9968254105187953, loss 106188.38, max_size:classes 18, max_size:data 915, mem_usage:GPU:0 5.8GB, num_seqs 21, 1.584 sec/step, elapsed 0:00:41, exp. remaining 0:00:10, complete 80.66%
pretrain epoch 1, step 16, cost:ctc 287.5264747359033, cost:output/output_prob 9.21073743537363, error:ctc 10.535791894420981, error:decision 0.0, error:output/output_prob 1.000000013038516, loss 136795.86, max_size:classes 20, max_size:data 706, mem_usage:GPU:0 5.8GB, num_seqs 28, 1.509 sec/step, elapsed 0:00:43, exp. remaining 0:00:09, complete 82.60%
pretrain epoch 1, step 17, cost:ctc 285.89704467331467, cost:output/output_prob 9.210682287022337, error:ctc 8.432024161331356, error:decision 0.0, error:output/output_prob 0.9909365549683571, loss 97680.66, max_size:classes 20, max_size:data 1021, mem_usage:GPU:0 5.8GB, num_seqs 19, 1.595 sec/step, elapsed 0:00:44, exp. remaining 0:00:08, complete 84.31%
pretrain epoch 1, step 18, cost:ctc 290.6208765563606, cost:output/output_prob 9.210362345630415, error:ctc 9.50226285494864, error:decision 0.0, error:output/output_prob 0.9773756079375744, loss 132525.4, max_size:classes 21, max_size:data 764, mem_usage:GPU:0 5.8GB, num_seqs 26, 1.869 sec/step, elapsed 0:00:46, exp. remaining 0:00:07, complete 86.74%
pretrain epoch 1, step 19, cost:ctc 289.32279153186573, cost:output/output_prob 9.210678139116453, error:ctc 9.320099179632962, error:decision 0.0, error:output/output_prob 0.9950372127350421, loss 120308.99, max_size:classes 24, max_size:data 881, mem_usage:GPU:0 6.0GB, num_seqs 22, 1.619 sec/step, elapsed 0:00:48, exp. remaining 0:00:05, complete 89.05%
pretrain epoch 1, step 20, cost:ctc 280.0556132506572, cost:output/output_prob 9.21039583990057, error:ctc 10.322946036234498, error:decision 0.0, error:output/output_prob 0.968838513828814, loss 102110.91, max_size:classes 23, max_size:data 1038, mem_usage:GPU:0 6.3GB, num_seqs 19, 1.835 sec/step, elapsed 0:00:50, exp. remaining 0:00:04, complete 91.36%
pretrain epoch 1, step 21, cost:ctc 254.93665059748307, cost:output/output_prob 9.209757156185105, error:ctc 7.915057765785604, error:decision 0.0, error:output/output_prob 0.9787644603056832, loss 136827.84, max_size:classes 23, max_size:data 677, mem_usage:GPU:0 6.3GB, num_seqs 27, 1.672 sec/step, elapsed 0:00:51, exp. remaining 0:00:03, complete 93.31%

pretrain epoch 1 'dev' eval, step 31, cost:ctc 277.11975923537466, cost:output/output_prob 9.208977020054363, error:ctc 7.104166878387332, error:decision 0.0, error:output/output_prob 0.9791666958481073, loss 109950.234, max_size:classes 42, max_size:data 1952, mem_usage:GPU:0 7.0GB, num_seqs 10, 2.828 sec/step, elapsed 0:01:55, exp. remaining 0:08:21, complete 18.73%
pretrain epoch 1 'dev' eval, step 32, cost:ctc 273.59442138671875, cost:output/output_prob 9.209274291992188, error:ctc 7.88671875, error:decision 0.0, error:output/output_prob 0.970703125, loss 144795.5, max_size:classes 40, max_size:data 1409, mem_usage:GPU:0 7.0GB, num_seqs 14, 2.666 sec/step, elapsed 0:01:58, exp. remaining 0:08:16, complete 19.23%
pretrain epoch 1 'dev' eval, step 33, cost:ctc 256.75234398911925, cost:output/output_prob 9.208727096917642, error:ctc 7.4961538531351835, error:decision 0.0, error:output/output_prob 0.9711538470583037, loss 138299.75, max_size:classes 41, max_size:data 1299, mem_usage:GPU:0 7.0GB, num_seqs 14, 2.445 sec/step, elapsed 0:02:00, exp. remaining 0:08:10, complete 19.77%
pretrain epoch 1 'dev' eval, step 34, cost:ctc 285.8596498768602, cost:output/output_prob 9.208559764520942, error:ctc 8.386117246001959, error:decision 0.0, error:output/output_prob 0.9761388413608074, loss 136026.44, max_size:classes 42, max_size:data 1447, mem_usage:GPU:0 7.0GB, num_seqs 13, 2.358 sec/step, elapsed 0:02:03, exp. remaining 0:08:03, complete 20.30%
pretrain epoch 1 'dev' eval, step 35, cost:ctc 269.1395396027401, cost:output/output_prob 9.209745830636166, error:ctc 8.055670401314273, error:decision 0.0, error:output/output_prob 0.9731959123164415, loss 134999.4, max_size:classes 39, max_size:data 1395, mem_usage:GPU:0 7.0GB, num_seqs 14, 2.831 sec/step, elapsed 0:02:05, exp. remaining 0:07:59, complete 20.80%
pretrain epoch 1 'dev' eval, step 36, cost:ctc 259.18438174932635, cost:output/output_prob 9.208568567526925, error:ctc 8.137362669571303, error:decision 0.0, error:output/output_prob 0.9725274763768539, loss 146542.55, max_size:classes 37, max_size:data 1191, mem_usage:GPU:0 7.0GB, num_seqs 16, 2.836 sec/step, elapsed 0:02:08, exp. remaining 0:07:53, complete 21.40%
pretrain epoch 1 'dev' eval, step 37, cost:ctc 252.84458955970877, cost:output/output_prob 9.208074668781308, error:ctc 7.913344890112057, error:decision 0.0, error:output/output_prob 0.9722703642910346, loss 151204.39, max_size:classes 38, max_size:data 1139, mem_usage:GPU:0 7.0GB, num_seqs 17, 2.572 sec/step, elapsed 0:02:11, exp. remaining 0:07:47, complete 21.93%
pretrain epoch 1 'dev' eval, step 38, cost:ctc 250.15737703586183, cost:output/output_prob 9.20930364439414, error:ctc 7.438461545389146, error:decision 0.0, error:output/output_prob 0.9673076932085678, loss 134870.67, max_size:classes 40, max_size:data 1285, mem_usage:GPU:0 7.0GB, num_seqs 15, 2.609 sec/step, elapsed 0:02:13, exp. remaining 0:07:43, complete 22.43%
pretrain epoch 1 'dev' eval, step 39, cost:ctc 271.8697553540187, cost:output/output_prob 9.209092128607153, error:ctc 8.836169932968915, error:decision 0.0, error:output/output_prob 0.9744680542498827, loss 132107.06, max_size:classes 38, max_size:data 1355, mem_usage:GPU:0 7.0GB, num_seqs 14, 2.146 sec/step, elapsed 0:02:16, exp. remaining 0:07:37, complete 22.93%
pretrain epoch 1 'dev' eval, step 40, cost:ctc 261.506115385273, cost:output/output_prob 9.209192646439988, error:ctc 7.9118939489126205, error:decision 0.0, error:output/output_prob 0.9691629558801651, loss 122904.76, max_size:classes 41, max_size:data 1448, mem_usage:GPU:0 7.0GB, num_seqs 13, 2.436 sec/step, elapsed 0:02:18, exp. remaining 0:07:31, complete 23.50%
pretrain epoch 1 'dev' eval, step 41, cost:ctc 274.07519103557206, cost:output/output_prob 9.209490461988025, error:ctc 8.77160467277281, error:decision 0.0, error:output/output_prob 0.9711933862417936, loss 137676.36, max_size:classes 39, max_size:data 1281, mem_usage:GPU:0 7.0GB, num_seqs 15, 2.441 sec/step, elapsed 0:02:21, exp. remaining 0:07:25, complete 24.03%
pretrain epoch 1 'dev' eval, step 42, cost:ctc 252.70003694258233, cost:output/output_prob 9.20855387423012, error:ctc 7.676635749056005, error:decision 0.0, error:output/output_prob 0.9700934876454995, loss 140121.1, max_size:classes 37, max_size:data 1178, mem_usage:GPU:0 7.0GB, num_seqs 16, 2.736 sec/step, elapsed 0:02:23, exp. remaining 0:07:20, complete 24.60%
pretrain epoch 1 'dev' eval, step 43, cost:ctc 270.67411690674635, cost:output/output_prob 9.20859987210747, error:ctc 7.644913419382648, error:decision 0.0, error:output/output_prob 0.9673704151064159, loss 145818.9, max_size:classes 36, max_size:data 1207, mem_usage:GPU:0 7.0GB, num_seqs 16, 2.465 sec/step, elapsed 0:02:26, exp. remaining 0:07:14, complete 25.17%
pretrain epoch 1 'dev' eval, step 44, cost:ctc 280.80680964180283, cost:output/output_prob 9.209749808459833, error:ctc 8.78251619799994, error:decision 0.0, error:output/output_prob 0.9701492765452713, loss 136017.77, max_size:classes 35, max_size:data 1272, mem_usage:GPU:0 7.0GB, num_seqs 15, 2.668 sec/step, elapsed 0:02:28, exp. remaining 0:07:10, complete 25.70%
pretrain epoch 1 'dev' eval, step 45, cost:ctc 254.6034903952077, cost:output/output_prob 9.20842834551786, error:ctc 7.482638944638893, error:decision 0.0, error:output/output_prob 0.96875000721775, loss 151955.66, max_size:classes 38, max_size:data 1031, mem_usage:GPU:0 7.0GB, num_seqs 18, 2.503 sec/step, elapsed 0:02:31, exp. remaining 0:07:05, complete 26.23%
pretrain epoch 1 'dev' eval, step 46, cost:ctc 274.09750769345555, cost:output/output_prob 9.209082607156233, error:ctc 8.13664611428976, error:decision 0.0, error:output/output_prob 0.9710145108401775, loss 136837.08, max_size:classes 38, max_size:data 1201, mem_usage:GPU:0 7.0GB, num_seqs 16, 2.307 sec/step, elapsed 0:02:33, exp. remaining 0:06:58, complete 26.87%
pretrain epoch 1 'dev' eval, step 47, cost:ctc 290.25258018526074, cost:output/output_prob 9.209130957663774, error:ctc 8.91721111512743, error:decision 0.0, error:output/output_prob 0.9716775366105139, loss 137452.92, max_size:classes 34, max_size:data 1332, mem_usage:GPU:0 7.0GB, num_seqs 15, 2.507 sec/step, elapsed 0:02:36, exp. remaining 0:06:53, complete 27.43%
pretrain epoch 1 'dev' eval, step 48, cost:ctc 258.8656995520141, cost:output/output_prob 9.2095162026435, error:ctc 7.991323314607143, error:decision 0.0, error:output/output_prob 0.9739696439355612, loss 123582.67, max_size:classes 37, max_size:data 1326, mem_usage:GPU:0 7.0GB, num_seqs 15, 2.818 sec/step, elapsed 0:02:39, exp. remaining 0:06:47, complete 28.07%
pretrain epoch 1 'dev' eval, step 49, cost:ctc 272.2239783266559, cost:output/output_prob 9.208625928783931, error:ctc 8.412573975510895, error:decision 0.0, error:output/output_prob 0.966601213440299, loss 143249.19, max_size:classes 34, max_size:data 1123, mem_usage:GPU:0 7.0GB, num_seqs 17, 2.734 sec/step, elapsed 0:02:41, exp. remaining 0:06:41, complete 28.70%
pretrain epoch 1 'dev' eval, step 50, cost:ctc 258.8279444774871, cost:output/output_prob 9.209226809130655, error:ctc 8.533468355191872, error:decision 0.0, error:output/output_prob 0.977687603328377, loss 132142.33, max_size:classes 35, max_size:data 1229, mem_usage:GPU:0 7.0GB, num_seqs 16, 3.092 sec/step, elapsed 0:02:44, exp. remaining 0:06:38, complete 29.27%
pretrain epoch 1 'dev' eval, step 51, cost:ctc 257.45671048380245, cost:output/output_prob 9.208775578838413, error:ctc 8.383104426320642, error:decision 0.0, error:output/output_prob 0.9685658500529826, loss 135732.73, max_size:classes 34, max_size:data 1152, mem_usage:GPU:0 7.0GB, num_seqs 17, 2.869 sec/step, elapsed 0:02:47, exp. remaining 0:06:31, complete 29.97%
pretrain epoch 1 'dev' eval, step 52, cost:ctc 257.5612085307803, cost:output/output_prob 9.209717567658572, error:ctc 8.740594332106411, error:decision 0.0, error:output/output_prob 0.9702970599755645, loss 134719.31, max_size:classes 39, max_size:data 1120, mem_usage:GPU:0 7.0GB, num_seqs 17, 2.591 sec/step, elapsed 0:02:50, exp. remaining 0:06:27, complete 30.53%
pretrain epoch 1 'dev' eval, step 53, cost:ctc 288.9668045128128, cost:output/output_prob 9.208448120739831, error:ctc 8.214442108757794, error:decision 0.0, error:output/output_prob 0.9628008864820004, loss 136266.1, max_size:classes 34, max_size:data 1224, mem_usage:GPU:0 7.0GB, num_seqs 16, 2.542 sec/step, elapsed 0:02:52, exp. remaining 0:06:22, complete 31.13%
pretrain epoch 1 'dev' eval, step 54, cost:ctc 266.2692113663579, cost:output/output_prob 9.209002644777911, error:ctc 8.604803629685193, error:decision 0.0, error:output/output_prob 0.9628821113146842, loss 126169.02, max_size:classes 33, max_size:data 1180, mem_usage:GPU:0 7.0GB, num_seqs 16, 2.219 sec/step, elapsed 0:02:55, exp. remaining 0:06:16, complete 31.77%
pretrain epoch 1 'dev' eval, step 55, cost:ctc 266.5517305701014, cost:output/output_prob 9.208483576998958, error:ctc 7.970425241626799, error:decision 0.0, error:output/output_prob 0.9667282934533432, loss 149186.28, max_size:classes 32, max_size:data 1015, mem_usage:GPU:0 7.0GB, num_seqs 19, 2.677 sec/step, elapsed 0:02:57, exp. remaining 0:06:10, complete 32.40%

When RETURNN is training with multiple GPU using horovod, is it necessary to decline the batch_size?

Dear friends, thanks for your great work. I have tried to train the returnn with multiple GPUs using horovod. However, the result is not quite good. Multi-Gpus didn't save much time for us.
So I have to suspect that maybe the multi-gpus training actually is several single GPU training combined. Every training still hands on their own things. After one epoch, they will communicate and refine weights? So in this way, we actually train the model with times of epoch? Then is it necessary to decline the batch_size to get the Multi-Gpus training results?
When RETURNN is training with multiple GPU using horovod, is it necessary to decline the batch_size?

where to get the vocab file for the pretrained librispeech language model

Hi,
I was wondering where I could find the vocab file for the pretrained language model since I only found the language model for librispeech.

data_files = {
"train": "/work/asr3/irie/data/librispeech/lm_bpe/librispeech-lm-norm.bpe.txt.gz",
"cv": "/work/asr3/irie/data/librispeech/lm_bpe/dev.clean.other.bpe.txt.gz",
"test": "/work/asr3/irie/data/librispeech/lm_bpe/test.clean.other.txt.gz"}
vocab_file = "/work/asr3/irie/data/librispeech/lm_bpe/trans.bpe.vocab.lm.txt"

For the train, cv, test files, I could use apply_bpe.py to generate the according datasets but if I want to use the pretrained model, I should use the same vocabulary file as the pretrained model. Not sure if I miss something.

Training only a specific Layer

I am trying to perform model adaptation for an encoder-decoder model (Similar to full attention setup ). For this, I'm freezing (trainable=False) all the layers except a couple of layers( let's say output_prob and readout ). I have 2 questions regarding this :-

Is there any easy way to achieve this in the config. Right now I'm adding trainable=False for every layer except output_prob and readout and running the config.
I noticed that training speed was similar for my initial config and my config with only 2 layers trainable. I think that's because the gradient is being calculated for all layers. Is there any way to speed up the training given that I only need to train 2 layers and that too towards the end of the network.

Thanks in advance for your answer.

query regarding LM data preprocessing

Hi @albertz, in returnn lm config what does batch_size, max_seq_length, max seqs refer to? In my understanding batch_size should be max bpe symbols in a batch, similar should be max_seq_length, max_seqs should be maximum number of sentences in a batch. Is my understanding correct? I saw that in librispeech bpe_train data there are many sentences with length ~500. so how those sentences are handled? are they ignored or go into batches with fewer sentences?

Explanation of the ASR loss averaging

I am looking for the explanation of the following case (it's hard for me to find it in the code).
After calculating the ce loss in the ASR, the loss is accumulated over all the timesteps of the decoder. My question is: is this loss averaged over the timesteps or not, before the optimizer step? Where can I find it in the code?

Thank you

rwth-i6 / returnn-experiments Goto Github PK

returnn-experiments's Introduction

returnn-experiments's People

Contributors

Stargazers

Watchers

Forkers

returnn-experiments's Issues

Recommend Projects

Recommend Topics

Recommend Org