Comments (9)
Which config is that? This one?
Which Returnn version?
On which data is this trained? Can you reproduce that with our pretrained model? Can you share the audio? Or reproduce that on some Librispeech sequence?
In batched decoding, i.e. when multiple sequences are together in the search, the option max_seq_len
of the RecLayer is treated a bit different. E.g. in the mentioned config above, we have "max_seq_len": "max_len_from('base:encoder')"
. max_len_from
will be the max length of the whole batch, i.e. not per individual sequence. I.e. the decoder search will potentially run for more decoder steps, until it will pick the best sequence. And for all finished sequences in a batch, we do length normalization, i.e. the score of finished sequences in a batch is normalized such that is comparable to non-finished sequences. This means effectively that for your specific sequence, you might have a high length norm factor in it. Unfortunately it's not so easy to get the un-normalized score.
The length norm is part of the ChoiceLayer
. See the related code. Specifically, if seq ended, score_t/t == score_{t-1}/(t-1), thus score_t = score_{t-1}*(t/(t-1)), i.e. we multiply the factor (t/(t-1)) in every step. I.e. you can check the longest sequence from this search batch, and then the sequence length of your sequence of interest, and then you know how often this factor was multiplied.
Or you could maybe try to disable the length normalization, and see if the score is the same in that case. But the overall WER might be worse.
from returnn-experiments.
Yes, the config is that. The returnn version is c28b6f4f (29 May 2019). The training data contains Librispeech and other English speaking corpus. Because that I have changed some details of the config, I haven't try this with your pretrained model. The audio last 4.93s. And the minmun length of audios in the batches is 0.84s. Iām wondering if the padding 0.0 will influence the search result ?
from returnn-experiments.
You might want to try a newer Returnn version also.
As said, I assume this is due to the length normalization, which will get different final scores depending on how big the batch is. I explained how you can calculate one final score into another.
Or you could maybe try to disable the length normalization, and see if the score is the same in that case. But the overall WER might be worse.
from returnn-experiments.
Thanks. But how can I disable the length normalization
from returnn-experiments.
There is the flag length_normalization
in the ChoiceLayer
, which you can simply set to False
.
In the config, that looks sth like this:
...
'output': {'class': 'choice', ..., 'length_normalization': False, ...},
...
from returnn-experiments.
Besides, the longest audio in the batch lasts 6.89s, the audio that I interest lasts 4.93s. I calculate the score multiply the factor, the result should be -6.5 rather than -103
from returnn-experiments.
Besides, the longest audio in the batch lasts 6.89s, the audio that I interest lasts 4.93s. I calculate the score multiply the factor, the result should be -6.5 rather than -103
You should look at the longest decoded label length, not the audio length.
E.g. if the longest label length is 100, and your specific label length is 5, you would get the factor 100/5, which is close to what you observe.
from returnn-experiments.
Thanks a lot. I have disabled the length_normalization function. And that do work , the difference get closer. However, there is still a small gap. Actually , I have try the google mask schedule referring to SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
. Have your team try this augmentation? If so , how it works. I tried this, and it bring 2.66% WER decline in test-other. And the mask method will mask the part mfccs to zero which is same as the padding. So I think this gap may come from this. I'll do several experiments to find it out . Thanks again for your help!
from returnn-experiments.
Thanks a lot. I have disabled the length_normalization function. And that do work , the difference get closer. However, there is still a small gap.
This might also be due to different max seq length used during the search (as I explained earlier).
Actually , I have try the google mask schedule referring to
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
. Have your team try this augmentation? If so , how it works. I tried this, and it bring 2.66% WER decline in test-other. And the mask method will mask the part mfccs to zero which is same as the padding. So I think this gap may come from this. I'll do several experiments to find it out . Thanks again for your help!
Do you use this also during decoding? You likely have some randomness in it, so this might just be due to different random seed?
Yes, we also use SpecAugment. E.g. see here, or here, or here.
from returnn-experiments.
Related Issues (20)
- local attention with unidirectional lstm not converging HOT 5
- Implement a unidirectional variant of local attention HOT 10
- Loading a saved Returnn model from its .meta file HOT 16
- query regarding LM data preprocessing HOT 2
- Reusing parameters inside rec layer HOT 5
- Training Configuration for TEDLIUMv2 HOT 3
- specAugment policy and schedules HOT 3
- Question about 2020-rnn-transducer HOT 16
- 2018-asr-attention/librispeech/attention/exp3.ctc.lm.config: target 'bpe' unknown HOT 3
- Question about 2018-asr-librispeech dev = get_dataset("dev", subset=3000) HOT 2
- loss nan and cost nan while running my own corpus using librispeech sets HOT 10
- Hierarchical layer name not captured correctly
- Problem with retrieving source layer from a hierarchical definition
- Multi Stage Training
- Questions on librispeech transformer lm HOT 10
- Transducer error in GetFilteredScoreOp HOT 4
- Big files in repo HOT 5
- Git commit/push rule to not allow big files HOT 3
- Could you please provide a script that could run lsh-attention for translation? HOT 4
- Assert Error when running 2022-lsh-attention HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
š Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ššš
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ā¤ļø Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from returnn-experiments.