Comments (12)
1、The data set is provided by the teacher,include Lang-8 and NUCLE(version 3.2), 5458 sentence pairs from NUCLE, is taken out to be used as the development data. The training data include 132M sentence pairs .
2、I will try use interactive.py instead of generate.py.
3、You mean I need to turn the test set(conll14st-test.tok.src file) into the --testpref by run the command
python3.5 $FAIRSEQPY/preprocess.py --source-lang src --target-lang trg --trainpref processed/train --validpref processed/dev --testpref processed/dev --nwordssrc 30000 --nwordstgt 30000 --destdir processed/bin
4、Can you explain what the /training/processed/bin directory is for?
5、If I use version of Fairseq-py (which uses PyTorch 0.2.0), Do I need to compile and install pytorch from source? Instead of installing via pip? And Do other parameters need to be changed?
from mlconvgec2018.
Use interactive.py
instead of generate.py
to decode the test set if you are using the latest Fairseq-py version. I was saying that alternatively, you can use generate.py
itself if you had used conll14st-test for --testpref
while doing preprocessing. The reason, I believe, is that in the current Fairseq-py, generate.py
automatically uses the test.src-trg.{src,trg}.{bin,idx} files within processed/bin directory to perform decoding. And interactive.py
decodes any input file that is passed through standard input.
-
The
training/processed/bin
directory contains the binarized and indexed versions of the training, development and test datasets for faster loading during training, validation and testing. Also, it contains the vocabulary files (dict.src.txt and dict.trg.txt). -
Yes, I had to compile Pytorch from source since the Fairseq-py version that I used required the ATen library which was only available on the github version of PyTorch and not in the official release back then.
from mlconvgec2018.
Use the output.tok.txt file. We use the M2scorer, which is the standard scorer used for evaluating the CoNLL 2014 shared task systems. Note that the evaluation on some sentences can take long time with the standard scorer.
from mlconvgec2018.
Thank you very much. And I encountered a new problem : the number of sentences in the output.tok.txt file different form the number of sentences in the conll14st-test.tok.src file 。the number of sentences in the output.tok.txt file is 5458, it is the same as validation set. Can you help me ?
I would be obliged if you could reply me at your earlist convenience.Thanks a lot in advance for your time and attention.
from mlconvgec2018.
I used command : ./run.sh ./data/test/conll14st-test/conll14st-test.tok.src ./data/test/conll14st-test/output 0 ./training/models/mlconv/model1000
$SCRIPTS_DIR/apply_bpe.py -c $TRAINING_DIR/models/bpe_model/train.bpe.model < $input_file > $output_dir/input.bpe.txt
running fairseq on the test data --workers $threads $MODEL_DIR/data_bin < $output_dir/input.bpe.txt > $output_dir/output.bpe.nbest.txt
CUDA_VISIBLE_DEVICES=$device python3.5 $FAIRSEQPY/generate.py --no-progress-bar --path $models --beam $beam --nbest $beam --workers $threads $TRAINING_DIR/processed/bin < $output_dir/input.bpe.txt > $output_dir/output.bpe.nbest.txt --skip-invalid-size-inputs-valid-test
from mlconvgec2018.
The flag --interactive is necessary while running fairseq on a custom input test set.
CUDA_VISIBLE_DEVICES=$device python3.5 $FAIRSEQPY/generate.py --no-progress-bar --path $models --beam $beam --nbest $beam --interactive --workers $threads $MODEL_DIR/data_bin < $output_dir/input.bpe.txt > $output_dir/output.bpe.nbest.txt
from mlconvgec2018.
Thanks a lot in advance for your time and attention. I summed up the questions I encountered. I think this is a problem of version.
First, I use the sofeware directory download.sh file download fairseq-py (github: https://github.com/shamilcm/fairseq-py), but when I run the command “python setup.py build”,there are a error : cffi.error.VerificationError: CompileError: command 'x86_64-linux-gnu-gcc' failed with exit status 1.
so I change the fairseq-py version (**github : https://github.com/facebookresearch/fairseq-py.git**),this error did not appear.
But then I found a problem: the parameter does not correspond. When I run the command "./run.sh ./data/test/conll14st-test/conll14st-test.tok.src ./data/test/conll14st-test/output 0 ./training/models/mlconv/model1000",there are two error : **generate.py: error: unrecognized arguments: --interactive ,**so I remove the flag --interactive .
Another error is:Exception: Sample #10 has size (src=1, dst=1) but max size is 1022. Skip this example with --skip-invalid-size-inputs-valid-test,so I add the flag --skip-invalid-size-inputs-valid-test. Then the order can be successfully implemented ,but the the number of sentences in the output.tok.txt file different form the number of sentences in the conll14st-test.tok.src file.
Can you help me ?Thank you very much.
from mlconvgec2018.
Oh ok. The version of Fairseq-py in the download.sh script compiles only on previous version of PyTorch (PyTorch 0.2.0) that is compiled from source.
In the recent version of fairseq-py, the developers have replaced generate.py --interactive
with a different script interactive.py
https://github.com/facebookresearch/fairseq-py/blob/master/interactive.py
from mlconvgec2018.
1、So, you mean I can use pytorch (0.3.0) and remove the flag --interactive ?How should I solve the mistake of the number of sentences in the output.tok.txt file different form the number of sentences in the conll14st-test.tok.src file ?
2、I tested with pre-trained models, I run the command "./run.sh ./data/test/conll14st-test/conll14st-test.m2 ./log/ 0 ./models/mlconv_embed/ eolm" ,there are a same error: the number of sentences in the output.tok.txt file different form the number of sentences in the conll14st-test.tok.src file.
Thank you very much.
from mlconvgec2018.
-
If you use the recent version of Fairseq-py (which uses PyTorch 0.3.0), you should use the script
interactive.py
(https://github.com/facebookresearch/fairseq-py/blob/master/interactive.py) instead ofgenerate.py
. -
If you run
run.sh
with the recent version of Fairseq-py and not the one mentioned indownload.sh
script, you may encounter this error. This is because generate.py does not have the--interactive
flag anymore. I believe, it will use the test set within theprocessed/bin
directory and not the one that is provided through standard input. In our training script, we pass the development data itself to the--testpref
flag. See:
mlconvgec2018/training/preprocess.sh
Line 41 in 3f270bc
Btw, where did you obtain the 5458 sentences development set from? Did you download and process the training data yourself ?
from mlconvgec2018.
Hello again. I can also trying to test the models using run.sh. But, ran into the same problem. I want to get the m2 scores, which is not in run.sh. The output would be output.bpe.nbest.txt .How to get those scores with the trained models?
I will follow the new fairseq implementation.
Any help is appreciated.
Thanks
from mlconvgec2018.
Thank you for the wonderful source code.
I have a favor to ask of you.
The only GPU I can use is...
Colab GPU. Therefore, I couldn't do pretrain myself, so I wanted to use the pre-trained one.
https://tinyurl.com/yd6wvhgw/mlconvgec2018/models
Can I download test.src-trg.src.bin, test.src-trg.src.idx, etc. in addition to dict.src.txt, which is published in the link above?
I am referring to https://github.com/kanekomasahiro/bert-gec.
from mlconvgec2018.
Related Issues (20)
- Using interactive instead of generate to evaluate HOT 1
- Python error: <stdin> is a directory, cannot continue HOT 2
- the size of training dataset? HOT 2
- How to use language model (94Bcclm.trie)? HOT 2
- TypeError: iter() returned non-iterator of type 'NBestList' HOT 9
- An error about Compile and install PyTorch(0.2.0) HOT 5
- Accuracy of trained model? HOT 12
- Fairseq-py Installation Issue HOT 5
- About training data
- reranker error HOT 6
- pytorch version? HOT 3
- pre-trainined embeddings. HOT 1
- ImportError: cannot import name 'libbleu' HOT 4
- 'Levenshtein greater than source size' when training re-ranker HOT 1
- Output directory error while running run.sh in google colab HOT 1
- ImportError: cannot import name 'libbleu' from 'fairseq' HOT 3
- Requesting NUCLE dataset 3.2
- Error loading state_size in FConvModel : size mismatch in encoder-decoder weights
- General question: GEC data preprocessing
- Can this model be used for any language?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mlconvgec2018.