Giter Site home page Giter Site logo

Comments (9)

albertz avatar albertz commented on June 4, 2024

Which config is that? This one?
Which Returnn version?
On which data is this trained? Can you reproduce that with our pretrained model? Can you share the audio? Or reproduce that on some Librispeech sequence?

In batched decoding, i.e. when multiple sequences are together in the search, the option max_seq_len of the RecLayer is treated a bit different. E.g. in the mentioned config above, we have "max_seq_len": "max_len_from('base:encoder')". max_len_from will be the max length of the whole batch, i.e. not per individual sequence. I.e. the decoder search will potentially run for more decoder steps, until it will pick the best sequence. And for all finished sequences in a batch, we do length normalization, i.e. the score of finished sequences in a batch is normalized such that is comparable to non-finished sequences. This means effectively that for your specific sequence, you might have a high length norm factor in it. Unfortunately it's not so easy to get the un-normalized score.
The length norm is part of the ChoiceLayer. See the related code. Specifically, if seq ended, score_t/t == score_{t-1}/(t-1), thus score_t = score_{t-1}*(t/(t-1)), i.e. we multiply the factor (t/(t-1)) in every step. I.e. you can check the longest sequence from this search batch, and then the sequence length of your sequence of interest, and then you know how often this factor was multiplied.
Or you could maybe try to disable the length normalization, and see if the score is the same in that case. But the overall WER might be worse.

from returnn-experiments.

yanghongjiazheng avatar yanghongjiazheng commented on June 4, 2024

Yes, the config is that. The returnn version is c28b6f4f (29 May 2019). The training data contains Librispeech and other English speaking corpus. Because that I have changed some details of the config, I haven't try this with your pretrained model. The audio last 4.93s. And the minmun length of audios in the batches is 0.84s. Iā€˜m wondering if the padding 0.0 will influence the search result ?

from returnn-experiments.

albertz avatar albertz commented on June 4, 2024

You might want to try a newer Returnn version also.
As said, I assume this is due to the length normalization, which will get different final scores depending on how big the batch is. I explained how you can calculate one final score into another.
Or you could maybe try to disable the length normalization, and see if the score is the same in that case. But the overall WER might be worse.

from returnn-experiments.

yanghongjiazheng avatar yanghongjiazheng commented on June 4, 2024

Thanks. But how can I disable the length normalization

from returnn-experiments.

albertz avatar albertz commented on June 4, 2024

There is the flag length_normalization in the ChoiceLayer, which you can simply set to False.
In the config, that looks sth like this:

...
'output': {'class': 'choice', ..., 'length_normalization': False, ...},
...

from returnn-experiments.

yanghongjiazheng avatar yanghongjiazheng commented on June 4, 2024

Besides, the longest audio in the batch lasts 6.89s, the audio that I interest lasts 4.93s. I calculate the score multiply the factor, the result should be -6.5 rather than -103

from returnn-experiments.

albertz avatar albertz commented on June 4, 2024

Besides, the longest audio in the batch lasts 6.89s, the audio that I interest lasts 4.93s. I calculate the score multiply the factor, the result should be -6.5 rather than -103

You should look at the longest decoded label length, not the audio length.
E.g. if the longest label length is 100, and your specific label length is 5, you would get the factor 100/5, which is close to what you observe.

from returnn-experiments.

yanghongjiazheng avatar yanghongjiazheng commented on June 4, 2024

Thanks a lot. I have disabled the length_normalization function. And that do work , the difference get closer. However, there is still a small gap. Actually , I have try the google mask schedule referring to SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition . Have your team try this augmentation? If so , how it works. I tried this, and it bring 2.66% WER decline in test-other. And the mask method will mask the part mfccs to zero which is same as the padding. So I think this gap may come from this. I'll do several experiments to find it out . Thanks again for your help!

from returnn-experiments.

albertz avatar albertz commented on June 4, 2024

Thanks a lot. I have disabled the length_normalization function. And that do work , the difference get closer. However, there is still a small gap.

This might also be due to different max seq length used during the search (as I explained earlier).

Actually , I have try the google mask schedule referring to SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition . Have your team try this augmentation? If so , how it works. I tried this, and it bring 2.66% WER decline in test-other. And the mask method will mask the part mfccs to zero which is same as the padding. So I think this gap may come from this. I'll do several experiments to find it out . Thanks again for your help!

Do you use this also during decoding? You likely have some randomness in it, so this might just be due to different random seed?

Yes, we also use SpecAugment. E.g. see here, or here, or here.

from returnn-experiments.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.