First of all, thanks for your great works. And I have just encountered a problem. I

Which config is that? <a href="https://github.com/rwth-i6/returnn-experiments/blob/mas

Yes, the config is that. The returnn version is <a href="https://github.com/rwth-i6/r

There is the flag length_normalization in the <code c

Same Wav file have a total different beam scores when it is assigned to different batches about returnn-experiments HOT 9 CLOSED

rwth-i6 commented on June 4, 2024

Same Wav file have a total different beam scores when it is assigned to different batches

from returnn-experiments.

Comments (9)

albertz commented on June 4, 2024

Which config is that? This one?
Which Returnn version?
On which data is this trained? Can you reproduce that with our pretrained model? Can you share the audio? Or reproduce that on some Librispeech sequence?

In batched decoding, i.e. when multiple sequences are together in the search, the option max_seq_len of the RecLayer is treated a bit different. E.g. in the mentioned config above, we have "max_seq_len": "max_len_from('base:encoder')". max_len_from will be the max length of the whole batch, i.e. not per individual sequence. I.e. the decoder search will potentially run for more decoder steps, until it will pick the best sequence. And for all finished sequences in a batch, we do length normalization, i.e. the score of finished sequences in a batch is normalized such that is comparable to non-finished sequences. This means effectively that for your specific sequence, you might have a high length norm factor in it. Unfortunately it's not so easy to get the un-normalized score.
The length norm is part of the ChoiceLayer. See the related code. Specifically, if seq ended, score_t/t == score_{t-1}/(t-1), thus score_t = score_{t-1}*(t/(t-1)), i.e. we multiply the factor (t/(t-1)) in every step. I.e. you can check the longest sequence from this search batch, and then the sequence length of your sequence of interest, and then you know how often this factor was multiplied.
Or you could maybe try to disable the length normalization, and see if the score is the same in that case. But the overall WER might be worse.

from returnn-experiments.

yanghongjiazheng commented on June 4, 2024

Yes, the config is that. The returnn version is c28b6f4f (29 May 2019). The training data contains Librispeech and other English speaking corpus. Because that I have changed some details of the config, I haven't try this with your pretrained model. The audio last 4.93s. And the minmun length of audios in the batches is 0.84s. I‘m wondering if the padding 0.0 will influence the search result ?

from returnn-experiments.

albertz commented on June 4, 2024

You might want to try a newer Returnn version also.
As said, I assume this is due to the length normalization, which will get different final scores depending on how big the batch is. I explained how you can calculate one final score into another.
Or you could maybe try to disable the length normalization, and see if the score is the same in that case. But the overall WER might be worse.

from returnn-experiments.

yanghongjiazheng commented on June 4, 2024

Thanks. But how can I disable the length normalization

from returnn-experiments.

albertz commented on June 4, 2024

There is the flag length_normalization in the ChoiceLayer, which you can simply set to False.
In the config, that looks sth like this:

...
'output': {'class': 'choice', ..., 'length_normalization': False, ...},
...

from returnn-experiments.

yanghongjiazheng commented on June 4, 2024

Besides, the longest audio in the batch lasts 6.89s, the audio that I interest lasts 4.93s. I calculate the score multiply the factor, the result should be -6.5 rather than -103

from returnn-experiments.

albertz commented on June 4, 2024

Besides, the longest audio in the batch lasts 6.89s, the audio that I interest lasts 4.93s. I calculate the score multiply the factor, the result should be -6.5 rather than -103

You should look at the longest decoded label length, not the audio length.
E.g. if the longest label length is 100, and your specific label length is 5, you would get the factor 100/5, which is close to what you observe.

from returnn-experiments.

yanghongjiazheng commented on June 4, 2024

Thanks a lot. I have disabled the length_normalization function. And that do work , the difference get closer. However, there is still a small gap. Actually , I have try the google mask schedule referring to SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition . Have your team try this augmentation? If so , how it works. I tried this, and it bring 2.66% WER decline in test-other. And the mask method will mask the part mfccs to zero which is same as the padding. So I think this gap may come from this. I'll do several experiments to find it out . Thanks again for your help!

from returnn-experiments.

albertz commented on June 4, 2024

Thanks a lot. I have disabled the length_normalization function. And that do work , the difference get closer. However, there is still a small gap.

This might also be due to different max seq length used during the search (as I explained earlier).

Actually , I have try the google mask schedule referring to SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition . Have your team try this augmentation? If so , how it works. I tried this, and it bring 2.66% WER decline in test-other. And the mask method will mask the part mfccs to zero which is same as the padding. So I think this gap may come from this. I'll do several experiments to find it out . Thanks again for your help!

Do you use this also during decoding? You likely have some randomness in it, so this might just be due to different random seed?

Yes, we also use SpecAugment. E.g. see here, or here, or here.

from returnn-experiments.

Same Wav file have a total different beam scores when it is assigned to different batches about returnn-experiments HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent