Comments (30)
I am not sure what you are doing and what problem you are experiencing. Btw, here you can find the dictionaries for en-es:
- spm_model_es.vocab
- spm_model_es.txt
- spm_model_es.model
- spm_model_en.vocab
- spm_model_en.txt
- spm_model_en.model
from fbk-fairseq.
The TSV is formatted as in the example above. You can create it using deeppavlov (https://docs.deeppavlov.ai/en/0.9.0/features/models/ner.html), as we have done (with the model ner_ontonotes_bert_mult
).
from fbk-fairseq.
thanks a lot😭
from fbk-fairseq.
no problem, let me know if you have more issues or need any help.
from fbk-fairseq.
I am closing it for now, feel free to reopen if you need anything else. Thanks.
from fbk-fairseq.
using the command in readme and the dict by you upload,i can't get the bleu score on task speech_to_text_tagged in en-es.
and this is my trainssh:
python train.py datasetdir
--train-subset train_st --valid-subset dev_st
--save-dir datasetdir
--num-workers 2 --max-update 100000
--max-tokens 15000
--user-dir examples/speech_to_text
--task speech_to_text_ctc_tagged --config-yaml config_st.yaml
--criterion ctc_multi_loss --underlying-criterion cross_entropy_with_tags --label-smoothing 0.1 --tags-loss-weight 1.0
--arch conformer_with_tags
--ctc-encoder-layer 4 --ctc-weight 0.5 --ctc-compress-strategy avg
--optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt
--warmup-updates 25000
--clip-norm 10.0
--seed 9 --update-freq 8 --patience 5 --keep-last-epochs 7
--skip-invalid-size-inputs-valid-test --find-unused-parameters
Hope to get your help, thank you very much🙏
from fbk-fairseq.
Well, it seems your model is working at all, BLEU is nearly 0.0. My best guess is that you have some problems with either training or inference data. Can you send me the logs of your training and the full generate output? You can also send them to me via email (you can find my email in the paper). Also, please check and send me your config_st.yaml. Thanks.
from fbk-fairseq.
thanks for your suggestion. this is the logs of my training, config_st.yaml and the full generate output.thanks a lot!
config_st.txt
generate-tst-COMMON_st.txt
from fbk-fairseq.
Well, there is definitely something wrong in your training data. The ctc_loss is 0, which is weird, and the ppl on the dev set is very high. Also in the generate it always repeats the same things. Also on the training set the training loss is very high. I can send you my logs, but the main problem is definitely your training data. Please check it, try to regenerate it maybe. The other weird thing is the 0 ctc_loss. I am no sure why you have that. I also do not understand where the tags_loss you have in your logs come from. If you have changed the code be careful that you have not introduced issues e.g. in the collater, which may also be the cause of your problem.
from fbk-fairseq.
Thanks for discovering my problem, I haven't changed the code yet. My dataset is must-c en-es and the training command comes from "fbk_works/JOINT_ST_NER2023.md". Is there something wrong with my training command? I'll double check my preprocessing and training data.Thanks again!
from fbk-fairseq.
I see, I think there is nothing wrong with your training command, but if you send me the full log of your training I can double check. The problem is that your ppl is very high, which means that the training is not working, so I am confident that it is a data issue.
Check your TSV, mine looks like:
id audio n_frames src_text tgt_text speaker
ted_1_0 /storage/MT/mgaido/corpora/MuST-C/tagged/en-es/fbank.zip:54923124817:921088 2878 And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful. I have been blown away by this conference, and I want to thank all of you for the many nice comments about what I had to say the other night. Muchas gracias <PERSON>Chris</PERSON>. Y es en verdad un gran honor tener la oportunidad de venir a este escenario por <ORDINAL>segunda</ORDINAL> vez. Estoy extremadamente agradecido. He quedado conmovido por esta conferencia, y deseo agradecer a todos ustedes sus amables comentarios acerca de lo que tenía que decir <TIME>la otra noche</TIME>. spk.1
ted_1_1 /storage/MT/mgaido/corpora/MuST-C/tagged/en-es/fbank.zip:54785873214:714688 2233 And I say that sincerely, partly because (Mock sob) I need that. (<PERSON>Laughter</PERSON>) Y digo eso sinceramente, en parte porque — (Sollozos fingidos) — ¡lo necesito! (<PERSON>Risas</PERSON>) ¡Pónganse en mi posición! spk.1
ted_1_2 /storage/MT/mgaido/corpora/MuST-C/tagged/en-es/fbank.zip:33455963128:451328 1410 (Laughter) Now I have to take off my shoes or boots to get on an airplane! (Laughter) (Applause) Volé en el avión vicepresidencial por <DATE>ocho años</DATE>. ¡Ahora tengo que quitarme mis zapatos o botas para subirme a un avión! (<PERSON>Risas</PERSON>) (<PERSON>Aplausos</PERSON>) spk.1
ted_1_3 /storage/MT/mgaido/corpora/MuST-C/tagged/en-es/fbank.zip:17195051777:131328 410 I'll tell you <CARDINAL>one</CARDINAL> quick story to illustrate what that's been like for me. Les diré una rápida historia para ilustrar lo que ha sido para mí. spk.1
ted_1_4 /storage/MT/mgaido/corpora/MuST-C/tagged/en-es/fbank.zip:31898357685:124928 390 (Laughter) It's a true story — every bit of this is true. Es una historia verdadera — cada parte de esto es verdad. spk.1
You also need to find out why the CTC loss is 0. That usually happens when the input is always longer than the transcript, which should not be the case. So there is something wrong with your data. This may be in the preprocessing, or loading. I would recommend you to start a debugger and check what you have in the forward of cross_entropy_with_tags.py
. You can also create a script where you load your data with a SpeechToTextDatasetTagged
and you can check that the length of the input audio is the expected one (should be the number of milliseconds / 10 roughly) and that the transcript/translation are correctly loaded.
from fbk-fairseq.
Currently, I have found a problem that there are no ner labels in my dataset.
from fbk-fairseq.
We have labelled the data with Deeppavlov, as described in the paper.
from fbk-fairseq.
I am sorry for this, thank you very much for your help, the problem about CTC loss I will confirm further
from fbk-fairseq.
I am closing this as it has been stale for a while. Feel free to reopen if anything else is needed. Thanks.
from fbk-fairseq.
I'm very sorry for my mistake. I found I did't make new training tsv with ner tag. This may be the main reason of 0 bleu. Would you mind sharing the new traing tsv for me? Because I don't know the formate of addtion ner tag. Sincere thanks!
from fbk-fairseq.
Thank you very much!
from fbk-fairseq.
I got a different ner tag than you, so weird.
from fbk-fairseq.
It is not different, I just formatted it differently. I converted the BIO format (output of Deeppavlov you see) in the format I told you wrapping the text with tags.
from fbk-fairseq.
But in my result, almost all word get ner tag but not O.
from fbk-fairseq.
Mmmmh.... this is weird indeed. I am not sure why this is happening. My script is:
from deeppavlov import configs, build_model
import sys
CHUNK_SIZE = 1000
ner_model = build_model(configs.ner.ner_ontonotes_bert_mult)
def ner(inputs):
res = ner_model(inputs)
tokens = res[0]
nes = res[1]
for s_i in range(len(tokens)):
outs = []
for t_i, token in enumerate(tokens[s_i]):
ne = nes[s_i][t_i]
outs.append((t_i, token, ne))
if nes[s_i][t_i].startswith("I-"):
i = 1
while nes[s_i][t_i-i].startswith("I-"):
i += 1
if i > t_i:
outs[0] = (outs[0][0], outs[0][1], "B-" + outs[0][2].split("-")[1])
else:
assert nes[s_i][t_i-i].startswith("B-"), "{} /// {}".format(str(nes), str(tokens))
for o in outs:
print("{}\t{}\t{}".format(o[0], o[1], o[2]))
print("")
lines = []
for line in sys.stdin:
lines.append(line.strip())
if len(lines) >= CHUNK_SIZE:
ner(lines)
lines = []
if len(lines) > 0:
ner(lines)
from fbk-fairseq.
😆You did me a big favor. Thank you very much! I have submitted this question to deeppavlov.
from fbk-fairseq.
I re-downloaded the ner model and referred to your script. However, the author of deeppavlov have updated its code. It seems difficult to reproduce the label example you gave.
from fbk-fairseq.
Results may be a bit different (likely better), but apart from that, everything should be fine
from fbk-fairseq.
All right, let me reorganize training process? How can I get dev_ep_netagged.tsv?
from fbk-fairseq.
Same way as the training set using the dev set of Europarl-ST.
from fbk-fairseq.
Europarl-ST' dev tsv? Europarl-ST' train tsv after ner?
from fbk-fairseq.
As dev set, we used Europarl-ST dev set with NER.
from fbk-fairseq.
Thanks. If I want only training on MUST-C. How can I change traning script?
from fbk-fairseq.
You just need to put in --train-subset
the name of the TSV for MuST-C. Similarly, for the --valid-subset
parameter you can sepcify the TSV for the dev set of MuST-C (as well as that of any dataset you might want to use).
from fbk-fairseq.
Related Issues (6)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fbk-fairseq.