Hi, I was wondering where I could find the vocab file for the pretrained language

The vocabulary created by the apply_bpe is the same as for the pretrained model

Thanks!That's clear now.Albert Zeyer <notifications@github.com>

No problem. I extended the Readme a bit (<a href="https://github.com/rwth-i6/returnn-e

where to get the vocab file for the pretrained librispeech language model,about rwth-i6/returnn-experiments

albertz commented on May 25, 2024

The vocabulary created by the apply_bpe script is the same as for the pretrained model.

from returnn-experiments.

Slyne commented on May 25, 2024

Hi, I'm not sure if I get this right. By using the training step full-setup-attention/15_create_bpe.sh, I could get the vocab file trans.bpe.vocab where '<s>' and '</s>' share the same index 0. The total number of words are 10026, while the index range from 0 to 10024. If I use this script to directly get vocab from librispeech-lm-norm.txt by setting the number of codes to be 10000, then I will get a vocabulary of 10056, which is obviously different from the config file 10025. Do I need to merge '<s>' and '</s>' to one symbol ? Another question is that the perplexity during training means the word level perplexity or bpe perplexity ? Albert Zeyer <[email protected]> 於 2018年12月27日週四下午10:52寫道：

…

The vocabulary created by the apply_bpe script is the same as for the pretrained model. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#12 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF_t1MRzek1XA2nH7w6sGqXro2SdX-5Rks5u9N60gaJpZM4Zi0Pa> .

from returnn-experiments.

albertz commented on May 25, 2024

I'm but sure what you mean by librispeech-lm-norm.txt.
You just use the vocabulary (vocab_file) which you created via full-setup-attention.
You don't need anything else.
You can use the configs here to do recognition with your attention model together with a LM (e.g. the pretrained model).

from returnn-experiments.

Slyne commented on May 25, 2024

I mean in the pretrained model config files you give, there are four files which be given. train: which is the training language model corpus, should be trained on librispecch-lm-norm.txt the vocab file should be a txt file, and the format should be: word index which is different from what I get from the full-setup-attention, which looks like this: { '<s>': 0, '</s>': 0, '<unk>': 1, 'THE': 2, 'AND': 3, 'OF': 4, 'TO': 5, 'A': 6, 'IN': 7, ..... } Attach part of the pretrained model configuration file: confile: data_files = { "train": "/work/asr3/irie/data/librispeech/lm_bpe/librispeech-lm-norm.bpe.txt.gz", "cv": "/work/asr3/irie/data/librispeech/lm_bpe/dev.clean.other.bpe.txt.gz", "test": "/work/asr3/irie/data/librispeech/lm_bpe/test.clean.other.txt.gz"} vocab_file = "/work/asr3/irie/data/librispeech/lm_bpe/trans.bpe.vocab.lm.txt"} Albert Zeyer <[email protected]> 於 2018年12月29日週六上午12:08寫道：

…

I'm but sure what you mean by librispeech-lm-norm.txt. You just use the vocabulary (vocab_file) which you created via full-setup-attention. You don't need anything else. You can use the configs here <https://github.com/rwth-i6/returnn-experiments/tree/master/2018-asr-attention/librispeech/lm> to do recognition with your attention model together with a LM (e.g. the pretrained model). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#12 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF_t1FezIwMhX5nHtc4RxEGHQxhVR-0Lks5u9kH9gaJpZM4Zi0Pa> .

from returnn-experiments.

albertz commented on May 25, 2024

I don't quite understand. If you use a pretrained model, you don't need to train it. The config you are referring to is for training (which you don't need, because it is already trained). For recognition, you need a different config, which I linked already.
The vocabulary which you pasted is the correct one.

from returnn-experiments.

Slyne commented on May 25, 2024

I need to train the language model. Furthermore, the vocab file has two symbols with the same index which will cause error when using the config file. Albert Zeyer <[email protected]> 於 2019年1月2日週三上午6:27寫道：

…

I don't quite understand. If you use a pretrained model, you don't need to train it. The config you are referring to is for training (which you don't need, because it is already trained). For recognition, you need a different config, which I linked already. The vocabulary which you pasted is the correct one. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#12 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF_t1J9khX9akbhERysZRyxvoeBqvXMlks5u--DZgaJpZM4Zi0Pa> .

from returnn-experiments.

Slyne commented on May 25, 2024

And the config file seems to be a training configuration. You can notice that the task="train" in the configuration file.

from returnn-experiments.

albertz commented on May 25, 2024

Ah, I thought that you want to use the pretrained model.
If you want to train it yourself, then yes, you need that config (btw, yes, there is task='train', but still in many cases, the config file is used for both training + inference; any option there can be overwritten via command line).
I just checked and you are right, the vocab used for the LmDataset has a different format. But I think it should be straight forward to convert from one to the other. Or to add support for the other format. Actually I wonder why we have that at all. I uploaded that vocab file here.
The train files (data_files in config) are generated somehow from the LibriSpeech LM training data. We might have done some post processing / normalization. @kazuki-irie can give you more details on that. Actually, maybe he can just upload his normalization scripts also here.

from returnn-experiments.

Slyne commented on May 25, 2024

Thanks! That's clear now. Albert Zeyer <[email protected]> 於 2019年1月2日週三下午4:02寫道：

…

Ah, I thought that you want to use the pretrained model. If you want to train it yourself, then yes, you need that config (btw, yes, there is task='train', but still in many cases, the config file is used for both training + inference; any option there can be overwritten via command line). I just checked and you are right, the vocab used for the LmDataset has a different format. But I think it should be straight forward to convert from one to the other. Or to add support for the other format. Actually I wonder why we have that at all. I uploaded that vocab file here <http://www-i6.informatik.rwth-aachen.de/~zeyer/models/librispeech/lm/bpe-10k/trans.bpe.vocab.lm.txt> . The train files (data_files in config) are generated somehow from the LibriSpeech LM training data <http://www.openslr.org/11/>. We might have done some post processing / normalization. @kazuki-irie <https://github.com/kazuki-irie> can give you more details on that. Actually, maybe he can just upload his normalization scripts also here <https://github.com/rwth-i6/returnn-experiments/tree/master/2018-asr-attention/librispeech/lm> . — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#12 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF_t1Nr185ew9Sddy71hK1EYN-cNDMGVks5u_GeEgaJpZM4Zi0Pa> .

from returnn-experiments.

albertz commented on May 25, 2024

No problem. I extended the Readme a bit (here).
Maybe @kazuki-irie can later add some of the post processing scripts.
Closing this now.

from returnn-experiments.

where to get the vocab file for the pretrained librispeech language model about returnn-experiments HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent