nusnlp / mlconvgec2018 Goto Github PK

Code and model files for the paper: "A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction" (AAAI-18).

License: GNU General Public License v3.0

Shell 40.04% Python 44.22% Perl 15.73%

mlconvgec2018's People

Contributors

Stargazers

Watchers

Forkers

shamilcm renhongkai apurvnagvenkar sxdkxgwan daywatch elenore1997 ml-lab bachstelze jevin754 karendeng linguistliu ehudbaumatz edwardzh matarrese huihuiustc lastmonkeyhan xixiddd kylezhanglk akai-kumako lumiqai pranjali97 fyvictor93 coopertian lidhcs yanshengjia anreu shubhampachori12110095 lpang36 vikaskaviya mrzpx gurunathparasaram guxd aiainui qxu1994 thogiti digifaire shirakad nautical kevinjesse angel8023 hzauccg yuxianmeng borgr xiaoshengjun jingjingyiyiguo wcgan ndmel wcc11404 drbugkiller ptolemyre qiaofeiw xiaoyeye1117 bcmi220 yaolezju iewang young1993 kevinqian97 helenahlz addie11 sirkays neverneverendup nikhilcherian 1601120453 wanghia evelinamorim fustilio marcelomata zhanzq eugene-renoir dsheng greitzmann vyoma-garg

mlconvgec2018's Issues

An error about Compile and install PyTorch(0.2.0)

@shamilcm Hi , When I run the setup.py following README https://github.com/shamilcm/fairseq-pyI got a error:
(py3.5) gpower@gpower:~/zhangtianjiu/NLP/mlconvgec2018-master/software/fairseq-py/pytorch$ git reset --hard a03e5cb40938b6b3f3e6dbddf9cff8afdff72d1b HEAD is now at a03e5cb40 Remind users to submodule update.
My development environment is Linux+ Anaconda3 + Python 3.5 + CUDA 9.0
Can you help me ? Thank you very much.

BPE code used in source and target data

Hi, Shamil Chollampatt.
In the Model and Training Details Section of your paper, you said that "Each of the source and target vocabularies consists of 30K most frequent BPE tokens from the source and target side of the parallel data, respectively.", but according to this line in the preprocessing scripts(i.e. training/preprocess.sh), it seems that you only use the target-end data to learn BPE codes and then, apply it to both source and target data.

About training data

I made training data using your script(prepare_data.sh , preprocess.sh). My training data contains a lot of noise. Did I make a mistake in using your script?

How you do validation/stop training

Hi,
In Model and Training Details section of your paper, you said that "We use early stopping and select the best model based on the F_0.5 score on the development set.". But to my knowledge, it seems that FairSeq can not do validation based on F_0.5, so my question(clarification) is

In FairSeq, the Early-Stop is only triggered when current learning rate < min-lr ?
after the FairSeq training process finished, the model paramters will be seved as checkpoint_1.pt , checkpoint_2.pt, ......, checkpoint_best.pt, checkpoint_last.pt, then you use the checkpoint_best.pt file directly, or manually select a checkpoint among all *.pt files as "best model" based on the F_0.5 score on development set ?

Many thanks

ImportError: cannot import name 'libbleu'

Hi there. When I ran ./run.sh I got an error below.
I've been working on this issue for 5days yet I still cannot figure it out.
so pls help me!

Here is what I put
./run.sh data/test/conll14st-test/conll14st-test.tok.src outputs 0 models/mlconv/model1.pt

And here is what I got

14st-test.tok.src outputs 0 models/mlconv/model1.pt
++ source paths.sh
+++++ dirname paths.sh
++++ cd .
++++ pwd
+++ BASE_DIR=/mnt/data/okabe/mlconvgec2018
+++ DATA_DIR=/mnt/data/okabe/mlconvgec2018/data
+++ MODEL_DIR=/mnt/data/okabe/mlconvgec2018/models
+++ SCRIPTS_DIR=/mnt/data/okabe/mlconvgec2018/scripts
+++ SOFTWARE_DIR=/mnt/data/okabe/mlconvgec2018/software
++ '[' 4 -ge 4 ']'
++ input_file=data/test/conll14st-test/conll14st-test.tok.src
++ output_dir=outputs
++ device=0
++ model_path=models/mlconv/model1.pt
++ '[' 4 -eq 6 ']'
++ [[ -d models/mlconv/model1.pt ]]
++ [[ -f models/mlconv/model1.pt ]]
++ models=models/mlconv/model1.pt
++ FAIRSEQPY=/mnt/data/okabe/mlconvgec2018/software/fairseq-py
++ NBEST_RERANKER=/mnt/data/okabe/mlconvgec2018/software/nbest-reranker
++ beam=12
++ nbest=12
++ threads=12
++ mkdir -p outputs
++ /mnt/data/okabe/mlconvgec2018/scripts/apply_bpe.py -c /mnt/data/okabe/mlconvgec2018/models/bpe_model/train.bpe.model
++ CUDA_VISIBLE_DEVICES=0
++ python /mnt/data/okabe/mlconvgec2018/software/fairseq-py/generate.py --no-progress-bar --path models/mlconv/model1.pt --beam 12 --nbest 12 --interactive --workers 12 /mnt/data/okabe/mlconvgec2018/models/data_bin
ERROR: missing libbleu.so. run python setup.py install
Traceback (most recent call last):
File "/mnt/data/okabe/mlconvgec2018/software/fairseq-py/generate.py", line 13, in
from fairseq import bleu, options, utils, tokenizer
File "/mnt/data/okabe/mlconvgec2018/software/fairseq-py/fairseq/bleu.py", line 18, in
raise e
File "/mnt/data/okabe/mlconvgec2018/software/fairseq-py/fairseq/bleu.py", line 14, in
from fairseq import libbleu
ImportError: cannot import name 'libbleu'

As this error says can't import libbleu, and I assume it is caused by something about fairseq.
But as the guideline said I followed steps and installed PyTorch from source so I have no idea what is wrong.

I'll put info just in case,
Environmental: miniconda on Linux
PyTorch version: 1.3.0
Python version: 3.6.2

Thank you.

Fairseq-py Installation Issue

Hiii!!!!!
My error is more related to question asked before in:
#14 (comment)
I have pytorch 0.4.1 and trying to run
python setup.py install
or
python setup.py build
both for fairseq-py but getting the same issue in both of them. My pytorch installed is cuda enabled as you suggested in that mentioned link...

I am so stuck at this since more than one week.. Can you please tell me what wrong I am doing. My log file for python setup.py install is : log.txt
and for python setup.py build is: log.txt

Output directory error while running run.sh in google colab

Hi there,
I'm running this command in google colab:
!chmod +x run.sh
!./run.sh '/content/gdrive/My Drive/Colab Notebooks/mlconvgec2018/data/tmp/conll14st-test-data/noalt/official-2014.0.conll.ann' '/content/gdrive/My Drive/Colab Notebooks/mlconvgec2018/abc' 1 '/content/gdrive/My Drive/Colab Notebooks/mlconvgec2018/models/mlconv/model1.pt'

And this is what i get:

++ source paths.sh
+++++ dirname paths.sh
++++ cd .
++++ pwd
+++ BASE_DIR='/content/gdrive/My Drive/Colab Notebooks/mlconvgec2018'
+++ DATA_DIR='/content/gdrive/My Drive/Colab Notebooks/mlconvgec2018/data'
+++ MODEL_DIR='/content/gdrive/My Drive/Colab Notebooks/mlconvgec2018/models'
+++ SCRIPTS_DIR='/content/gdrive/My Drive/Colab Notebooks/mlconvgec2018/scripts'
+++ SOFTWARE_DIR='/content/gdrive/My Drive/Colab Notebooks/mlconvgec2018/software'
++ '[' 4 -ge 4 ']'
++ input_file='/content/gdrive/My Drive/Colab Notebooks/mlconvgec2018/data/tmp/conll14st-test-data/noalt/official-2014.0.conll.ann'
++ output_dir='/content/gdrive/My Drive/Colab Notebooks/mlconvgec2018/abc'
++ device=1
++ model_path='/content/gdrive/My Drive/Colab Notebooks/mlconvgec2018/models/mlconv/model1.pt'
++ '[' 4 -eq 6 ']'
++ [[ -d /content/gdrive/My Drive/Colab Notebooks/mlconvgec2018/models/mlconv/model1.pt ]]
++ [[ -f /content/gdrive/My Drive/Colab Notebooks/mlconvgec2018/models/mlconv/model1.pt ]]
++ models='/content/gdrive/My Drive/Colab Notebooks/mlconvgec2018/models/mlconv/model1.pt'
++ FAIRSEQPY='/content/gdrive/My Drive/Colab Notebooks/mlconvgec2018/software/fairseq-py'
++ NBEST_RERANKER='/content/gdrive/My Drive/Colab Notebooks/mlconvgec2018/software/nbest-reranker'
++ beam=12
++ nbest=12
++ threads=12
++ mkdir -p /content/gdrive/My Drive/Colab Notebooks/mlconvgec2018/abc
mkdir: cannot create directory ‘/content/gdrive/My’: Operation not supported

I'm getting errors because mkdir does not detect the blank spaces. How i should do it?
Thank you

spell check

The paper contains a spell checked result (table 1).
May I know if this repository contains the spell checking model?
Thanks a lot.

Python error: <stdin> is a directory, cannot continue

hello:
I get some error when I run the command of "./run.sh /home/.../mlconvgec2018/data/test/conll14st-test /home/.../mlconvgec2018/outputs 1 /home/...mlconvgec2018/models/mlconv".It shows Python error: is a directory, cannot continue.Can you help me?

About testing models

I use trained model testing conll2014-test, output four files are input.bpe.txt,output.bpe.nbest.txt,output.bpe.txt,output.tok.txt,Which file should I use to evaluate?What is the script？Thank you very much.

about train

when i run ./train.sh,I encountered this problem:
1、```
(python3.6_env) [renhongkai@xxkx-gpu1 training]# ./train.sh

set -e
source ../paths.sh
++++ dirname ../paths.sh
+++ cd ..
+++ pwd
++ BASE_DIR=/home/renhongkai/project/mlconvgec2018
++ DATA_DIR=/home/renhongkai/project/mlconvgec2018/data
++ MODEL_DIR=/home/renhongkai/project/mlconvgec2018/models
++ SCRIPTS_DIR=/home/renhongkai/project/mlconvgec2018/scripts
++ SOFTWARE_DIR=/home/renhongkai/project/mlconvgec2018/software
FAIRSEQPY=/home/renhongkai/project/mlconvgec2018/software/fairseq-py
SEED=1000
DATA_BIN_DIR=processed/bin
OUT_DIR=models/mlconv/model1000/
mkdir -p models/mlconv/model1000/
PYTHONPATH=/home/renhongkai/project/mlconvgec2018/software/fairseq-py:
CUDA_VISIBLE_DEVICES=0,1,2
python3.5 /home/renhongkai/project/mlconvgec2018/software/fairseq-py/train.py --save-dir models/mlconv/model1000/ --encoder-embed-dim 500 --decoder-embed-dim 500 --decoder-out-embed-dim 500 --dropout 0.2 --clip-norm 0.1 --lr 0.25 --min-lr 1e-4 --encoder-layers '[(1024,3)] * 7' --decoder-layers '[(1024,3)] * 7' --momentum 0.99 --max-epoch 100 --batch-size 32 --no-progress-bar --seed 1000 processed/bin
Traceback (most recent call last):
File "/home/renhongkai/project/mlconvgec2018/software/fairseq-py/train.py", line 15, in
from fairseq import data, options, utils
File "/home/renhongkai/project/mlconvgec2018/software/fairseq-py/fairseq/options.py", line 11, in
from fairseq import models
File "/home/renhongkai/project/mlconvgec2018/software/fairseq-py/fairseq/models/init.py", line 14, in
from . import fconv, lstm
File "/home/renhongkai/project/mlconvgec2018/software/fairseq-py/fairseq/models/fconv.py", line 16, in
from fairseq.modules import BeamableMM, GradMultiply, LearnedPositionalEmbedding, LinearizedConvolution
File "/home/renhongkai/project/mlconvgec2018/software/fairseq-py/fairseq/modules/init.py", line 10, in
from .conv_tbc import ConvTBC
File "/home/renhongkai/project/mlconvgec2018/software/fairseq-py/fairseq/modules/conv_tbc.py", line 13, in
from fairseq import utils
File "/home/renhongkai/project/mlconvgec2018/software/fairseq-py/fairseq/utils.py", line 19, in
from fairseq import criterions, progress_bar, tokenizer
File "/home/renhongkai/project/mlconvgec2018/software/fairseq-py/fairseq/progress_bar.py", line 17, in
from tqdm import tqdm
ImportError: No module named 'tqdm'

#  **but I have pip install tqdm ,and i can import tqdm in `python:**`

(python3.6_env) [renhongkai@xxkx-gpu1 training]# python
Python 3.6.3 (default, Oct 6 2017, 08:44:35)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.

import tqdm

2、when I testing with pre-trained models,I encountered this problem:

(python3.6_env) [renhongkai@xxkx-gpu1 mlconvgec2018]# ./run.sh ./data/test/conll14st-test/conll14st-test.m2 ./log/log1.txt 0,1 ./models/mlconv_embed/model1.pt eolm
++ source paths.sh
+++++ dirname paths.sh
++++ cd .
++++ pwd
+++ BASE_DIR=/home/renhongkai/project/mlconvgec2018
+++ DATA_DIR=/home/renhongkai/project/mlconvgec2018/data
+++ MODEL_DIR=/home/renhongkai/project/mlconvgec2018/models
+++ SCRIPTS_DIR=/home/renhongkai/project/mlconvgec2018/scripts
+++ SOFTWARE_DIR=/home/renhongkai/project/mlconvgec2018/software
++ '[' 5 -ge 4 ']'
++ input_file=./data/test/conll14st-test/conll14st-test.m2
++ output_dir=./log/log1.txt
++ device=0,1
++ model_path=./models/mlconv_embed/model1.pt
++ '[' 5 -eq 6 ']'
++ '[' -d ./models/mlconv_embed/model1.pt ']'
++ '[' -f ./models/mlconv_embed/model1.pt ']'
++ model=./models/mlconv_embed/model1.pt
++ FAIRSEQPY=/home/renhongkai/project/mlconvgec2018/software/fairseq-py
++ NBEST_RERANKER=/home/renhongkai/project/mlconvgec2018/software/nbest-reranker
++ beam=12
++ nbest=12
++ threads=12
++ mkdir -p ./log/log1.txt
++ /home/renhongkai/project/mlconvgec2018/scripts/apply_bpe.py -c /home/renhongkai/project/mlconvgec2018/models/bpe_model/train.bpe.model
++ CUDA_VISIBLE_DEVICES=0,1
++ python3.5 /home/renhongkai/project/mlconvgec2018/software/fairseq-py/generate.py --no-progress-bar --path --beam 12 --nbest 12 --interactive --workers 12 /home/renhongkai/project/mlconvgec2018/models/data_bin
ERROR: missing libbleu.so. run python setup.py install
Traceback (most recent call last):
File "/home/renhongkai/project/mlconvgec2018/software/fairseq-py/generate.py", line 12, in
from fairseq import bleu, data, options, tokenizer, utils
File "/home/renhongkai/project/mlconvgec2018/software/fairseq-py/fairseq/bleu.py", line 18, in
raise e
File "/home/renhongkai/project/mlconvgec2018/software/fairseq-py/fairseq/bleu.py", line 14, in
from fairseq import libbleu
ImportError: cannot import name 'libbleu'

# **but I can use libbleu in python:**

(python3.6_env) [renhongkai@xxkx-gpu1 mlconvgec2018]# python
Python 3.6.3 (default, Oct 6 2017, 08:44:35)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.

from fairseq import libbleu

who can help me resolve it. thank you.

Using interactive instead of generate to evaluate

Try to test my trained model with interactive.p I get an empty results... it's probably something I'm missing, until now I've been using generate.py and it worked ok.

this is what I'm running:
python3.6 /workspace/mlconvgec2018/software/fairseq-py/interactive.py --path models/mlconv_embed/model_exp51000/checkpoint_best.pt --beam 1000 --nbest 1000 processed/bin
< $output_dir/input.bpe.txt > $output_dir/output.bpe.nbest.txt

in $output_dir/input.bpe.txt I have the input after I applied the bpe model and in $output_dir/output.bpe.nbest.txt I have the output of interactive.py

when I check the output file:
cat output.bpe.nbest.txt
Namespace(beam=1000, buffer_size=1, cpu=False, data=['processed/bin'], diverse_beam_groups=1, diverse_beam_strength=0.5, fp16=False, fp16_init_scale=128, fp16_scale_window=None, gen_subset='test', left_pad_source='True', left_pad_target='False', lenpen=1, log_format=None, log_interval=1000, max_len_a=0, max_len_b=200, max_sentences=1, max_source_positions=1024, max_target_positions=1024, max_tokens=None, min_len=1, model_overrides='{}', nbest=1000, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, num_shards=1, path='models/mlconv_embed/model_exp51000/checkpoint_best.pt', prefix_size=0, print_alignment=False, quiet=False, raw_text=False, remove_bpe=None, replace_unk=None, sampling=False, sampling_temperature=1, sampling_topk=-1, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', unkpen=0, unnormalized=False, upsample_primary=1)
| [src] dictionary: 28264 types
| [trg] dictionary: 28200 types
| loading model(s) from models/mlconv_embed/model_exp51000/checkpoint_best.pt
| Found 17435/28264 types in embedding file.
| Found 17409/28200 types in embedding file.
| Type the input sentence and press return:
| WARNING: 1 samples have invalid sizes and will be skipped, max_positions=(1022, 1022), first few sample ids=[0]

Any ideas?

pre-trainined embeddings.

hi, when I run the scripts 'train_embed.sh', something is wrong, and I use the pytorch 0.4.1, fairseq 0.5.0. Could you help me? Thank you.

set -e
source ../paths.sh
++++ dirname ../paths.sh
+++ cd ..
+++ pwd
++ BASE_DIR=/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5
++ DATA_DIR=/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/data
++ MODEL_DIR=/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/models
++ SCRIPTS_DIR=/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/scripts
++ SOFTWARE_DIR=/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/software
FAIRSEQPY=/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/software/fairseq-py
EMBED_PATH=/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/models/embeddings/wiki_model.vec
'[' '!' -f /search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/models/embeddings/wiki_model.vec ']'
SEED=1000
DATA_BIN_DIR=processed/bin
OUT_DIR=models/mlconv_embed/model1000/
mkdir -p models/mlconv_embed/model1000/
PYTHONPATH=/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/software/fairseq-py:
CUDA_VISIBLE_DEVICES=7
python3.6 /search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/software/fairseq-py/train.py --save-dir models/mlconv_embed/model1000/ --encoder-embed-dim 500 --encoder-embed-path /search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/models/embeddings/wiki_model.vec --decoder-embed-dim 500 --decoder-embed-path /search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/models/embeddings/wiki_model.vec --decoder-out-embed-dim 500 --dropout 0.2 --clip-norm 0.1 --lr 0.25 --min-lr 1e-4 --encoder-layers '[(1024,3)] * 7' --decoder-layers '[(1024,3)] * 7' --momentum 0.99 --max-epoch 100 --batch-size 96 --no-progress-bar --seed 1000 --arch fconv processed/bin
Namespace(arch='fconv', clip_norm=0.1, criterion='cross_entropy', data='processed/bin', decoder_attention='True', decoder_embed_dim=500, decoder_embed_path='/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/models/embeddings/wiki_model.vec', decoder_layers='[(1024,3)] * 7', decoder_out_embed_dim=500, device_id=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.2, encoder_embed_dim=500, encoder_embed_path='/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/models/embeddings/wiki_model.vec', encoder_layers='[(1024,3)] * 7', fp16=False, keep_interval_updates=-1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.25], lr_scheduler='reduce_lr_on_plateau', lr_shrink=0.1, max_epoch=100, max_sentences=96, max_sentences_valid=96, max_source_positions=1024, max_target_positions=1024, max_tokens=6000, max_update=0, min_lr=0.0001, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=True, no_save=False, normalization_constant=0.5, optimizer='nag', raw_text=False, restore_file='checkpoint_last.pt', save_dir='models/mlconv_embed/model1000/', save_interval=1, save_interval_updates=0, seed=1000, sentence_avg=False, share_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', train_subset='train', update_freq=[1], valid_subset='valid', validate_interval=1, weight_decay=0.0)
| [src] dictionary: 29693 types
| [trg] dictionary: 29793 types
| processed/bin train 1298911 examples
| processed/bin valid 5448 examples
Traceback (most recent call last):
File "/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/software/fairseq-py/train.py", line 353, in
main(args)
File "/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/software/fairseq-py/train.py", line 38, in main
model = task.build_model(args)
File "/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/software/fairseq-py/fairseq/tasks/fairseq_task.py", line 43, in build_model
return models.build_model(args, self)
File "/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/software/fairseq-py/fairseq/models/init.py", line 25, in build_model
return ARCH_MODEL_REGISTRY[args.arch].build_model(args, task)
File "/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/software/fairseq-py/fairseq/models/fconv.py", line 67, in build_model
encoder_embed_dict = utils.parse_embedding(args.encoder_embed_path)
File "/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/software/fairseq-py/fairseq/utils.py", line 260, in parse_embedding
embed_dict[pieces[0]] = torch.Tensor([float(weight) for weight in pieces[1:]])
File "/search/odin/liuxiaolong2019/mlconvgec2018-fairseq0.5/software/fairseq-py/fairseq/utils.py", line 260, in
embed_dict[pieces[0]] = torch.Tensor([float(weight) for weight in pieces[1:]])
ValueError: could not convert string to float: 'l2018)

AssertionError when testing

I used versions of fairseq that were provided by ./download.sh script in software dir, but got into trouble:

/local/path/mlconvgec2018/software/fairseq-py/generate.py", line 43, in main
    dataset = data.load_dataset(args.data, [args.gen_subset], args.source_lang, args.target_lang)
  File "/local/path/mlconvgec2018/software/fairseq-py/fairseq/data.py", line 54, in load_dataset
    assert src is not None and dst is not None, 'Source and target languages should be provided'

AssertionError: Source and target languages should be provided`

So I modified the line in run.sh and added --source-lang en --target-lang en but now it wants a dictionary from me.
Can you help?

'Levenshtein greater than source size' when training re-ranker

Hi!

I am trying to train your model but I keep running into different errors when trying to train the re-ranker. If I use the dev.m2 file (which contains all the original annotations, without applying BPE), I can finish training but I get a lot of "Levenshtein distance is greater than source size" warnings. I did the initial preprocessing of the dataset I'm using with my own scripts (mostly for correcting the sentences), so I'm not sure whether the dev.m2 file used is the same as mine. I have checked the prepare_data.sh script and it seems it might just be the original annotation file? I have also tried using the .src and .trg dev sets, which are split with BPE unlike the .m2 file, and the error changes to 'segmentation fault' (which makes sense as the nbest outputs have been 'de-bpe-ized'). This is what I get when I try using the src file just in case:

+ python2.7 /home/slima/projects/mlconvgec2018/software/nbest-reranker/train.py -i /DATA/slima_data/models/dev.output.tok.nbest.reformat.augmented.txt -r processed/dev.src -c /DATA/slima_data/models/training//rerank_config.ini --threads 12 --tuning-metric m2 --predictable-seed -o /DATA/slima_data/models/training/ --moses-dir ../../mosesdecoder --no-add-weight [INFO] [01-10-2019 10:45:29] Arguments: [INFO] [01-10-2019 10:45:29] alg: mert [INFO] [01-10-2019 10:45:29] command: /home/slima/projects/mlconvgec2018/software/nbest-reranker/train.py -i /DATA/slima_data/models/dev.output.tok.nbest.reformat.augmented.txt -r processed/dev.src -c /DATA/slima_data/models/training//rerank_config.ini --threads 12 --tuning-metric m2 --predictable-seed -o /DATA/slima_data/models/training/ --moses-dir ../../mosesdecoder --no-add-weight [INFO] [01-10-2019 10:45:29] init_value: 0.05 [INFO] [01-10-2019 10:45:29] input_config: /DATA/slima_data/models/training//rerank_config.ini [INFO] [01-10-2019 10:45:29] input_nbest: /DATA/slima_data/models/dev.output.tok.nbest.reformat.augmented.txt [INFO] [01-10-2019 10:45:29] metric: m2 [INFO] [01-10-2019 10:45:29] moses_dir: ../../mosesdecoder [INFO] [01-10-2019 10:45:29] no_add_weight: True [INFO] [01-10-2019 10:45:29] out_dir: /DATA/slima_data/models/training/ [INFO] [01-10-2019 10:45:29] pred_seed: True [INFO] [01-10-2019 10:45:29] ref_paths: processed/dev.src [INFO] [01-10-2019 10:45:29] threads: 12 [INFO] [01-10-2019 10:45:29] Reading weights from config file [INFO] [01-10-2019 10:45:29] Feature weights: ['F0= 0.5', 'EditOps0= 0.2 0.2 0.2'] [INFO] [01-10-2019 10:45:29] Extracting stats and features [WARNING] The optional arguments of extractor are not used yet [INFO] [01-10-2019 10:45:29] Executing command: ../../mosesdecoder/bin/extractor --sctype M2SCORER --scconfig ignore_whitespace_casing:true -r processed/dev.src -n /DATA/slima_data/models/training//augmented.nbest --scfile /DATA/slima_data/models/training//statscore.data --ffile /DATA/slima_data/models/training//features.data Binary write mode is NOT selected Scorer type: M2SCORER name: ignore_whitespace_casing value: true Segmentation fault (core dumped) [INFO] [01-10-2019 10:45:29] Running MERT [INFO] [01-10-2019 10:45:29] Command: ../../mosesdecoder/bin/mert -d 4 -S /DATA/slima_data/models/training//statscore.data -F /DATA/slima_data/models/training//features.data --ifile /DATA/slima_data/models/training//init.opt --threads 12 -r 1 --sctype M2SCORER --scconfig ignore_whitespace_casing:true shard_size = 0 shard_count = 0 Seeding random numbers with 1 name: ignore_whitespace_casing value: true Data::m_score_type M2Scorer Data::Scorer type from Scorer: M2Scorer Loading Data from: /DATA/slima_data/models/training//statscore.data and /DATA/slima_data/models/training//features.data loading feature data from /DATA/slima_data/models/training//features.data loading score data from /DATA/slima_data/models/training//statscore.data Data loaded : [Wall 0.000461 CPU 0.000457] seconds. Creating a pool of 12 threads terminate called recursively terminate called after throwing an instance of 'std::runtime_error' Aborted (core dumped) [INFO] [01-10-2019 10:45:29] Optimization complete. Traceback (most recent call last): File "/home/slima/projects/mlconvgec2018/software/nbest-reranker/train.py", line 93, in <module> assert os.path.isfile('weights.txt') AssertionError

Any idea of what the problem might be? I think it's probably that the file I used for the -r argument of the re-ranker train.py script might be different to the one you used because of the preprocessing. I also trained my own BPE model and my own embeddings. As you can see in the error excerpt, right now I'm trying to train using only eo features, if that matters.

Thank you in advance, and thank you for making your code open too. It's very helpful.

Edit: nevermind, it was my fault because I forgot I shuffled the dev sentences when preprocessing them. Thank you anyway and sorry!

Edit 2: closed too early because I thought it was just a stupid mistake but the problem persists... should the m2 file not include the annotations?

How to use language model (94Bcclm.trie)?

Hi,
When I running this command:

./run.sh data/test/conll14st-test/conll14st-test.tok.src \
             outputs_embed_eolm \
             5 \
             models/mlconv_embed \
             models/reranker_weights/mlconv_embed_4ens_eo_lm.weights.txt \
             eolm

I have encountered this problem

It looks like that: "The format of language model is something wrong".
So, I google some suggests. They suggest that: change "*.trie" files to "*.arpa" files.
But, here is new problems:

It seems that: I need the vocab of the language model. But there is no vocab files.

Finally, could you tell me, "How to use the language model file(94cclm.trie)"?
THANKS YOU.

ImportError: cannot import name 'libbleu' from 'fairseq'

Hi. When I ran ./run.sh I got an error below.
I saw other guys faced the same problem, but their python version is 3.6, my python version is 3.7.2
so pls help me!

Here is what I put
./run.sh data/test/conll14st-test/conll14st-test.tok.src test 0 models/mlconv/model1.pt

and the error happens just like this
++ source paths.sh
+++++ dirname paths.sh
++++ cd .
++++ pwd
+++ BASE_DIR=/Volumes/SHIMIN/research/mlconvgec2018
+++ DATA_DIR=/Volumes/SHIMIN/research/mlconvgec2018/data
+++ MODEL_DIR=/Volumes/SHIMIN/research/mlconvgec2018/models
+++ SCRIPTS_DIR=/Volumes/SHIMIN/research/mlconvgec2018/scripts
+++ SOFTWARE_DIR=/Volumes/SHIMIN/research/mlconvgec2018/software
++ '[' 4 -ge 4 ']'
++ input_file=data/test/conll14st-test/conll14st-test.tok.src
++ output_dir=test
++ device=0
++ model_path=models/mlconv/model1.pt
++ '[' 4 -eq 6 ']'
++ [[ -d models/mlconv/model1.pt ]]
++ [[ -f models/mlconv/model1.pt ]]
++ models=models/mlconv/model1.pt
++ FAIRSEQPY=/Volumes/SHIMIN/research/mlconvgec2018/software/fairseq-py
++ NBEST_RERANKER=/Volumes/SHIMIN/research/mlconvgec2018/software/nbest-reranker
++ beam=12
++ nbest=12
++ threads=12
++ mkdir -p test
++ /Volumes/SHIMIN/research/mlconvgec2018/scripts/apply_bpe.py -c /Volumes/SHIMIN/research/mlconvgec2018/models/bpe_model/train.bpe.model
++ CUDA_VISIBLE_DEVICES=0
++ python3 /Volumes/SHIMIN/research/mlconvgec2018/software/fairseq-py/generate.py --no-progress-bar --path models/mlconv/model1.pt --beam 12 --nbest 12 --interactive --workers 12 /Volumes/SHIMIN/research/mlconvgec2018/models/data_bin
ERROR: missing libbleu.so. run python setup.py install
Traceback (most recent call last):
File "/Volumes/SHIMIN/research/mlconvgec2018/software/fairseq-py/generate.py", line 13, in
from fairseq import bleu, options, utils, tokenizer
File "/Volumes/SHIMIN/research/mlconvgec2018/software/fairseq-py/fairseq/bleu.py", line 18, in
raise e
File "/Volumes/SHIMIN/research/mlconvgec2018/software/fairseq-py/fairseq/bleu.py", line 14, in
from fairseq import libbleu
ImportError: cannot import name 'libbleu' from 'fairseq' (/Volumes/SHIMIN/research/mlconvgec2018/software/fairseq-py/fairseq/init.py)

I tried to download the latest version of fairseq-py, but the error exists with no change.

thanks for help!

Accuracy of trained model?

I've trained the mlconv model using train_embed.sh, with hyperparameters in the script.
The training ended without error in 5 epochs.

But I cant reproduce the F0.5 score in the paper.
My model achieved F0.5 score of 0.18.
(far from reported F0.5 of 0.45)
The result (output.tok.txt) was also terrible.

Has anyone suffered this problem?

Here's my training log

set -e
source ../paths.sh
++++ dirname ../paths.sh
+++ cd ..
+++ pwd
++ BASE_DIR=/home/account/torch_gec/mlconvgec2018
++ DATA_DIR=/home/account/torch_gec/mlconvgec2018/data
++ MODEL_DIR=/home/account/torch_gec/mlconvgec2018/models
++ SCRIPTS_DIR=/home/account/torch_gec/mlconvgec2018/scripts
++ SOFTWARE_DIR=/home/account/torch_gec/mlconvgec2018/software
FAIRSEQPY=/home/account/torch_gec/mlconvgec2018/software/fairseq-py
EMBED_PATH=/home/account/torch_gec/mlconvgec2018/models/embeddings/wiki_model.vec
'[' '!' -f /home/account/torch_gec/mlconvgec2018/models/embeddings/wiki_model.vec ']'
SEED=1000
DATA_BIN_DIR=processed/bin
OUT_DIR=models/mlconv_embed/model1000/
mkdir -p models/mlconv_embed/model1000/
PYTHONPATH=/home/account/torch_gec/mlconvgec2018/software/fairseq-py:
CUDA_VISIBLE_DEVICES=0
python3.5 /home/account/torch_gec/mlconvgec2018/software/fairseq-py/train.py --save-dir models/mlconv_embed/model1000/ --encoder-embed-dim 500 --encoder-embed-path /home/account/torch_gec/mlconvgec2018/models/embeddings/wiki_model.vec --decoder-embed-dim 500 --decoder-embed-path /home/account/torch_gec/mlconvgec2018/models/embeddings/wiki_model.vec --decoder-out-embed-dim 500 --dropout 0.2 --clip-norm 0.1 --lr 0.25 --min-lr 1e-4 --encoder-layers '[(1024,3)] * 7' --decoder-layers '[(1024,3)] * 7' --momentum 0.99 --max-epoch 100 --batch-size 32 --seed 1000 processed/bin
Namespace(arch='fconv', batch_size=32, clip_norm=0.1, data='processed/bin', decoder_attention='True', decoder_embed_dim=500, decoder_embed_path='/home/account/torch_gec/mlconvgec2018/models/embeddings/wiki_model.vec', decoder_layers='[(1024,3)] * 7', decoder_out_embed_dim=500, dropout=0.2, encoder_embed_dim=500, encoder_embed_path='/home/account/torch_gec/mlconvgec2018/models/embeddings/wiki_model.vec', encoder_layers='[(1024,3)] * 7', force_anneal=0, label_smoothing=0, log_interval=1000, lr=0.25, lrshrink=0.1, max_epoch=100, max_positions=1024, max_tokens=0, min_lr=0.0001, model='fconv', momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, restore_file='checkpoint_last.pt', sample_without_replacement=0, save_dir='models/mlconv_embed/model1000/', save_interval=-1, seed=1000, source_lang=None, target_lang=None, test_batch_size=32, test_subset='test', train_subset='train', valid_batch_size=32, valid_script=None, valid_subset='valid', weight_decay=0.0, workers=1)
| [src] dictionary: 30004 types
| [trg] dictionary: 30004 types
| processed/bin valid 5448 examples
| processed/bin train 1298763 examples
| processed/bin test 5448 examples
| using 1 GPUs (with max tokens per GPU = None)
| model fconv
| Loading encoder embeddings from /home/account/torch_gec/mlconvgec2018/models/embeddings/wiki_model.vec
| Found 25760/30004 types in embeddings file.
| Loading decoder embeddings from /home/account/torch_gec/mlconvgec2018/models/embeddings/wiki_model.vec
| Found 25678/30004 types in embeddings file.
training epoch: 1
| epoch 001: 0%| | 0/40587 [00:00<?, ?it/s]/home/account/torch_gec/mlconvgec2018/software/fairseq-py/fairseq/models/fconv.py:172: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
x = F.softmax(x.view(sz[0] * sz[1], sz[2]))
| epoch 001 | train loss 2.77 | train ppl 6.82 | s/checkpoint 4717 | words/s 5024 | words/batch 584 | bsz 32 | lr 0.250000 | clip 100% | gnorm 0.7062
| epoch 001 | valid on 'valid' subset: 0%| | 0/2738 [00:00<?, ?it/s]/home/account/torch_gec/mlconvgec2018/software/fairseq-py/fairseq/utils.py:143: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
return Variable(tensor, volatile=volatile)
/home/account/torch_gec/mlconvgec2018/software/fairseq-py/fairseq/multiprocessing_trainer.py:213: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
return loss.data[0]
| epoch 001 | valid on 'valid' subset | valid loss 10.83 | valid ppl 1823.06
| epoch 001 | saving checkpoint
| epoch 001 | saving best checkpoint
| epoch 001 | saving last checkpoint
training epoch: 2
| epoch 002 | train loss 2.12 | train ppl 4.35 | s/checkpoint 4726 | words/s 5014 | words/batch 584 | bsz 32 | lr 0.250000 | clip 100% | gnorm 0.4339
| epoch 002 | valid on 'valid' subset | valid loss 12.48 | valid ppl 5704.88
| epoch 002 | saving checkpoint
| epoch 002 | saving last checkpoint
training epoch: 3
| epoch 003 | train loss 1.80 | train ppl 3.47 | s/checkpoint 4731 | words/s 5009 | words/batch 584 | bsz 32 | lr 0.025000 | clip 100% | gnorm 0.3773
| epoch 003 | valid on 'valid' subset | valid loss 11.59 | valid ppl 3079.46
| epoch 003 | saving checkpoint
| epoch 003 | saving last checkpoint
training epoch: 4
| epoch 004 | train loss 1.73 | train ppl 3.31 | s/checkpoint 4734 | words/s 5006 | words/batch 584 | bsz 32 | lr 0.002500 | clip 100% | gnorm 0.3848
| epoch 004 | valid on 'valid' subset | valid loss 10.10 | valid ppl 1096.12
| epoch 004 | saving checkpoint
| epoch 004 | saving best checkpoint
| epoch 004 | saving last checkpoint
training epoch: 5
| epoch 005 | train loss 1.72 | train ppl 3.29 | s/checkpoint 4742 | words/s 4997 | words/batch 584 | bsz 32 | lr 0.002500 | clip 100% | gnorm 0.3883
| epoch 005 | valid on 'valid' subset | valid loss 12.75 | valid ppl 6908.31
| epoch 005 | saving checkpoint
| epoch 005 | saving last checkpoint
training epoch: 6
| epoch 006 | train loss 1.71 | train ppl 3.28 | s/checkpoint 4744 | words/s 4995 | words/batch 584 | bsz 32 | lr 0.000250 | clip 100% | gnorm 0.3894
| epoch 006 | valid on 'valid' subset | valid loss 10.83 | valid ppl 1823.29
| epoch 006 | saving checkpoint
| epoch 006 | saving last checkpoint
| done training in 28644.4 seconds
/home/account/torch_gec/mlconvgec2018/software/fairseq-py/fairseq/utils.py:143: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
return Variable(tensor, volatile=volatile)
/home/account/torch_gec/mlconvgec2018/software/fairseq-py/fairseq/models/fconv.py:172: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
x = F.softmax(x.view(sz[0] * sz[1], sz[2]))
/home/account/torch_gec/mlconvgec2018/software/fairseq-py/fairseq/sequence_generator.py:357: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
probs = F.softmax(decoder_out[:, -1, :]).data
| Test on test with beam=1: BLEU4 = 79.83, 90.3/82.8/76.5/70.9 (BP=1.000, ratio=0.991, syslen=148117, reflen=146808)
| Test on test with beam=5: BLEU4 = 80.41, 91.0/83.4/77.1/71.5 (BP=1.000, ratio=0.992, syslen=147983, reflen=146808)
| Test on test with beam=10: BLEU4 = 80.58, 91.1/83.6/77.3/71.7 (BP=1.000, ratio=0.991, syslen=148081, reflen=146808)
| Test on test with beam=20: BLEU4 = 80.64, 91.1/83.6/77.4/71.8 (BP=1.000, ratio=0.991, syslen=148140, reflen=146808)
`

TypeError: iter() returned non-iterator of type 'NBestList'

Hi, I have encountered a type error when I run run_trained_model.sh with EO reranker, the error message is as follows:
Traceback (most recent call last): File "/mnt/yangzonglin-gec/projects/mlconvgec2018/software/nbest-reranker/rerank.py", line 59, in <module> for group in input_aug_nbest: TypeError: iter() returned non-iterator of type 'NBestList'

I also noticed that the train_reranker.sh script printed a lot of messages, saying: Levenshtein distance is greater than source size.

Error loading state_size in FConvModel : size mismatch in encoder-decoder weights

@shamilcm @gurunath-p

Hello there again. I was actually trying something new with the mlconvgec models where I trained them with latest fairseq model with different bpe bert and tokenizer moses.

Training was all fine. But there seems to error with the generate with test data jfleg for getting the GLEU scores.

python generate.py test/jfleg --path checkpoints/lang8-nucle-bert-moses/checkpoint_best.pt --batch-size 128 --beam 5 --nbest 12 --lang-model-data data-bin/wiki103 --lang-model-path data-bin/wiki103/wiki103.pt --source-lang en --target-lang gec --bpe bert --tokenizer moses --dataset-impl raw

The error is with a problem in size mismatch in the layers.

Any help would be highly appreciated. Thanks in advance.

the size of training dataset?

in the project, 2210277 sentence pairs have been used including lang-8 and NUCLE. But the paper says 1.3M sentence pairs used to train. so I want to know the detailed size of lang-8 and NUCLE you used to train the models.

Can this model be used for any language?

Is the Multilayer Convolutional encoder-decoder model is language-independent model?

Thanks in advance

Error with checkpoint

Hi,

I have this problem when run python generate.py with the old version of fairseq-py (I used a model in https://github.com/nusnlp/mlconvgec2018 and train with pre-trained word embeddings ) and new version of PyTorch. Could you let me know how to solve it ? ( It seems close to #52)

Thank you for your support

michelle:~$ screen -r 117073.pts-19

++ source paths.sh
+++++ dirname paths.sh
++++ cd .
++++ pwd
+++ BASE_DIR=/home/michelle/mlc/mlconvgec2018
+++ DATA_DIR=/home/michelle/mlc/mlconvgec2018/data
+++ MODEL_DIR=/home/michelle/mlc/mlconvgec2018/models
+++ SCRIPTS_DIR=/home/michelle/mlc/mlconvgec2018/scripts
+++ SOFTWARE_DIR=/home/michelle/mlc/mlconvgec2018/software
++ '[' 4 -ge 4 ']'
++ input_file=/home/michelle/mlc/mlconvgec2018/data/dev.tok.src
++ output_dir=/home/michelle/mlc/mlconvgec2018/result/augment_y_1
++ device=2
++ model_path=/home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000
++ '[' 4 -eq 6 ']'
++ '[' -d /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000 ']'
+++ tr '\n' ' '
+++ sed 's| ([^$])| --path \1|g'
+++ ls /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint10.pt /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint11.pt
/home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint12.pt /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint13.pt /home/v
trinh/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint1.pt /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint2.pt /home/michelle/mlc
/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint3.pt /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint4.pt /home/michelle/mlc/mlconvge
c2018/training/models/mlconv_embed/model1000/checkpoint5.pt /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint6.pt /home/michelle/mlc/mlconvgec2018/tra
ining/models/mlconv_embed/model1000/checkpoint7.pt /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint8.pt /home/michelle/mlc/mlconvgec2018/training/mod
els/mlconv_embed/model1000/checkpoint9.pt /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint_best.pt /home/michelle/mlc/mlconvgec2018/training/models/m
lconv_embed/model1000/checkpoint_last.pt
++ models='/home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint10.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/check
point11.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint12.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model100
0/checkpoint13.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint1.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/mo
del1000/checkpoint2.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint3.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_emb
ed/model1000/checkpoint4.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint5.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint6.pt
--path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint7.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint8.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint9.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint_best.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint_last.pt
/home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint10.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint11.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint12.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint13.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint1.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint2.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint3.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint4.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint5.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint6.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint7.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint8.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint9.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint_best.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint_last.pt
++ FAIRSEQPY=/home/michelle/mlc/mlconvgec2018/software/fairseq-py
++ NBEST_RERANKER=/home/michelle/mlc/mlconvgec2018/software/nbest-reranker
++ beam=12
++ nbest=12
++ threads=12
++ mkdir -p /home/michelle/mlc/mlconvgec2018/result/augment_y_1
++ /home/michelle/mlc/mlconvgec2018/scripts/apply_bpe.py -c /home/michelle/mlc/mlconvgec2018/models/bpe_model/train.bpe.model
++ CUDA_VISIBLE_DEVICES=2
++ python3.6 /home/michelle/mlc/mlconvgec2018/software/fairseq-py/generate.py --no-progress-bar --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint10.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint11.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint12.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint13.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint1.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint2.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint3.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint4.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint5.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint6.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint7.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint8.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint9.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint_best.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint_last.pt --beam 12 --nbest 12 --interactive --workers 12 /home/michelle/mlc/mlconvgec2018/models/data_bin
Traceback (most recent call last):
File "/home/michelle/anaconda3/envs/michelle/lib/python3.6/site-packages/torch/nn/modules/module.py", line 514, in load_state_dict
own_state[name].copy_(param)
RuntimeError: inconsistent tensor size, expected tensor [30004 x 500] and src [28799 x 500] to have the same number of elements, but got 15002000 and 14399500 elements respectively at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/TH/generic/THTensorCopy.c:86

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/michelle/mlc/mlconvgec2018/software/fairseq-py/generate.py", line 167, in
main()
File "/home/michelle/mlc/mlconvgec2018/software/fairseq-py/generate.py", line 41, in main
models, dataset = utils.load_ensemble_for_inference(args.path, args.data)
File "/home/michelle/mlc/mlconvgec2018/software/fairseq-py/fairseq/utils.py", line 128, in load_ensemble_for_inference
model.load_state_dict(state['model'])
File "/home/michelle/anaconda3/envs/michelle/lib/python3.6/site-packages/torch/nn/modules/module.py", line 519, in load_state_dict
.format(name, own_state[name].size(), param.size()))
RuntimeError: While copying the parameter named encoder.embed_tokens.weight, whose dimensions in the model are torch.Size([30004, 500]) and whose dimensions in the checkpoint are torch.Size([28799, 500]).

About prepare_data.sh

I saw we have to remove empty target sentences for the NUCLE development data.
Do we have to do the same for the NUCLE training data?
Thank you very much.

pytorch version?

Hi, which version of pytorch is use? Thank you.

General question: GEC data preprocessing

Dear @ALL,

I'm sorry this is not an issue, it just a general question about GEC data pre-processing.

I'm a little confused about the standard GEC dataset format (error-annotated data .M2 format), How we can use the correction labels on the target side to improve the GEC model? instead of release it and feed it to the model as the pure parallel dataset.

Files in outputs Dir

Is the file(best_with-spellcheck.conll14st-test.tok.out) in outputs dir corresponding to the best results in paper? when I run the m2score with this file and gold truth, I only got F0.5 < 30.

Running mlconvgec2018 for inference ?

OS : Ubuntu 16.04
Python Version : 3.5
CUDA : 8

Whenever I setup farseq-py by running python3 setup.py install (The one which gets downloaded by download.sh script)
I get

cmd_obj.run()
File "setup.py", line 49, in run
  conv_tbc.build()
File "/home/ubuntu/.local/lib/python3.5/site-packages/torch/utils/ffi/__init__.py", line 184, in build
  _build_extension(ffi, cffi_wrapper_name, target_dir, verbose)
File "/home/ubuntu/.local/lib/python3.5/site-packages/torch/utils/ffi/__init__.py", line 108, in _build_extension
  outfile = ffi.compile(tmpdir=tmpdir, verbose=verbose, target=libname)
File "/home/ubuntu/.local/lib/python3.5/site-packages/cffi/api.py", line 697, in compile
  compiler_verbose=verbose, debug=debug, **kwds)
File "/home/ubuntu/.local/lib/python3.5/site-packages/cffi/recompiler.py", line 1520, in recompile
  compiler_verbose, debug)
File "/home/ubuntu/.local/lib/python3.5/site-packages/cffi/ffiplatform.py", line 22, in compile
  outputfilename = _build(tmpdir, ext, compiler_verbose, debug)
File "/home/ubuntu/.local/lib/python3.5/site-packages/cffi/ffiplatform.py", line 58, in _build
  raise VerificationError('%s: %s' % (e.__class__.__name__, e))
cffi.error.VerificationError: CompileError: command 'x86_64-linux-gnu-gcc' failed with exit status 1

But as per issue#2 I used latest fairseq but now I am running in lot of conflict like

bash run.sh data/test/conll14st-test/conll14st-test.tok.src ~/mlconvgec2018/ 0 models/
+ source paths.sh
++++ dirname paths.sh
+++ cd .
+++ pwd
++ BASE_DIR=/home/ubuntu/mlconvgec2018
++ DATA_DIR=/home/ubuntu/mlconvgec2018/data
++ MODEL_DIR=/home/ubuntu/mlconvgec2018/models
++ SCRIPTS_DIR=/home/ubuntu/mlconvgec2018/scripts
++ SOFTWARE_DIR=/home/ubuntu/mlconvgec2018/software
+ '[' 4 -ge 4 ']'
+ input_file=data/test/conll14st-test/conll14st-test.tok.src
+ output_dir=/home/ubuntu/mlconvgec2018/
+ device=0
+ model_path=models/
+ '[' 4 -eq 6 ']'
+ [[ -d models/ ]]
++ ls 'models//*pt'
++ tr '\n' ' '
++ sed 's| \([^$]\)| --path \1|g'
ls: cannot access 'models//*pt': No such file or directory
+ models=
+ echo

+ FAIRSEQPY=/home/ubuntu/mlconvgec2018/software/fairseq-py
+ NBEST_RERANKER=/home/ubuntu/mlconvgec2018/software/nbest-reranker
+ beam=12
+ nbest=12
+ threads=12
+ mkdir -p /home/ubuntu/mlconvgec2018/
+ /home/ubuntu/mlconvgec2018/scripts/apply_bpe.py -c /home/ubuntu/mlconvgec2018/models/bpe_model/train.bpe.model
+ CUDA_VISIBLE_DEVICES=0
+ python3.5 /home/ubuntu/mlconvgec2018/software/fairseq-py/generate.py --no-progress-bar --path --beam 12 --nbest 12 --interactive --workers 12 /home/ubuntu/mlconvgec2018/models/data_bin
usage: generate.py [-h] [--no-progress-bar] [--log-interval N]
                   [--log-format {json,none,simple,tqdm}] [--seed N] [--fp16]
                   [--task TASK] [--skip-invalid-size-inputs-valid-test]
                   [--max-tokens N] [--max-sentences N] [--gen-subset SPLIT]
                   [--num-shards N] [--shard-id ID] [--path FILE]
                   [--remove-bpe [REMOVE_BPE]] [--cpu] [--quiet] [--beam N]
                   [--nbest N] [--max-len-a N] [--max-len-b N] [--min-len N]
                   [--no-early-stop] [--unnormalized] [--no-beamable-mm]
                   [--lenpen LENPEN] [--unkpen UNKPEN]
                   [--replace-unk [REPLACE_UNK]] [--score-reference]
                   [--prefix-size PS] [--sampling] [--sampling-topk PS]
                   [--sampling-temperature N] [--diverse-beam-groups N]
                   [--diverse-beam-strength N] [--print-alignment]
                   [--model-overrides DICT]
generate.py: error: argument --path: expected one argument

Now the question is how to use that ./run.sh -------------------------------------------------- to make it inference.

reranker error

When I run ./run.sh conlltest2014/test2014.en conlltest2014_reranker/ 0,1 models/mlconv models/reranker_weights/mlconv_embed_4ens_eo_lm.weights.txt 'eolm', something error happended:

++ python2 /home/nlp/WJF/mlconvgec2018/software/nbest-reranker/rerank.py -i conlltest2014_reranker//output.tok.nbest.reformat.augmented.txt -w models/reranker_weights/mlconv_embed_4ens_eo_lm.weights.txt -o conlltest2014_reranker/ --clean-up
[INFO] [02-08-2019 10:43:35]  Arguments:
[INFO] [02-08-2019 10:43:35]    clean_up: True
[INFO] [02-08-2019 10:43:35]    command: /home/nlp/WJF/mlconvgec2018/software/nbest-reranker/rerank.py -i conlltest2014_reranker//output.tok.nbest.reformat.augmented.txt -w models/reranker_weights/mlconv_embed_4ens_eo_lm.weights.txt -o conlltest2014_reranker/ --clean-up
[INFO] [02-08-2019 10:43:35]    input_nbest: conlltest2014_reranker//output.tok.nbest.reformat.augmented.txt
[INFO] [02-08-2019 10:43:35]    out_dir: conlltest2014_reranker/
[INFO] [02-08-2019 10:43:35]    quiet: False
[INFO] [02-08-2019 10:43:35]    weights: models/reranker_weights/mlconv_embed_4ens_eo_lm.weights.txt
Traceback (most recent call last):
  File "/home/nlp/WJF/mlconvgec2018/software/nbest-reranker/rerank.py", line 72, in <module>
    output_1best.write(group[sorted_indices[0]].hyp + "\n")
IndexError: list index out of range

I use fairseq0.5,torch0.4. What is the problem? Thank you.

Error in installing fairseq-py from setup.py

Error with pre-trained word embeddings

Hi,

When I run test with your pre-trained word embeddings: .
./run.sh "/home/michelle/mlc/mlconvgec2018/data/test/conll14st-test/conll14st-test.tok.src" "/home/michelle/mlc/test" 2 "/home/michelle/mlc/mlconvgec2018/models/mlconv_embed"

I have below error. Could you please let me know how to solve it ? And how to get the M2 score instead of BLEU score ?

(michelle) michelle@k:~/mlc/mlconvgec2018$ ./run.sh "/home/michelle/mlc/mlconvgec2018/data/test/conll14st-test/conll14st-test.tok.src" "/home/michelle/mlc/test" 2 "/home/michelle/mlc/mlconvgec2018/models/mlconv_embed"
++ source paths.sh
+++++ dirname paths.sh
++++ cd .
++++ pwd
+++ BASE_DIR=/home/michelle/mlc/mlconvgec2018
+++ DATA_DIR=/home/michelle/mlc/mlconvgec2018/data
+++ MODEL_DIR=/home/michelle/mlc/mlconvgec2018/models
+++ SCRIPTS_DIR=/home/michelle/mlc/mlconvgec2018/scripts
+++ SOFTWARE_DIR=/home/michelle/mlc/mlconvgec2018/software
++ '[' 4 -ge 4 ']'
++ input_file=/home/michelle/mlc/mlconvgec2018/data/test/conll14st-test/conll14st-test.tok.src
++ output_dir=/home/michelle/mlc/test
++ device=2
++ model_path=/home/michelle/mlc/mlconvgec2018/models/mlconv_embed
++ '[' 4 -eq 6 ']'
++ '[' -d /home/michelle/mlc/mlconvgec2018/models/mlconv_embed ']'
+++ ls /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model1.pt /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model2.pt /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model3.pt /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model4.pt
+++ tr '\n' ' '
+++ sed 's| ([^$])| --path \1|g'
++ models='/home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model1.pt --path /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model2.pt --path /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model3.pt --path /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model4.pt '
++ echo /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model1.pt --path /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model2.pt --path /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model3.pt --path /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model4.pt
/home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model1.pt --path /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model2.pt --path /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model3.pt --path /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model4.pt
++ FAIRSEQPY=/home/michelle/mlc/mlconvgec2018/software/fairseq-py
++ NBEST_RERANKER=/home/michelle/mlc/mlconvgec2018/software/nbest-reranker
++ beam=12
++ nbest=12
++ threads=12
++ mkdir -p /home/michelle/mlc/test
++ /home/michelle/mlc/mlconvgec2018/scripts/apply_bpe.py -c /home/michelle/mlc/mlconvgec2018/models/bpe_model/train.bpe.model
++ CUDA_VISIBLE_DEVICES=2
++ python3.6 /home/michelle/mlc/mlconvgec2018/software/fairseq-py/generate.py --no-progress-bar --path /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model1.pt --path /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model2.pt --path /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model3.pt --path /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model4.pt --beam 12 --nbest 12 --interactive --workers 12 /home/michelle/mlc/mlconvgec2018/models/data_bin
Traceback (most recent call last):
File "/home/michelle/mlc/mlconvgec2018/software/fairseq-py/generate.py", line 167, in
main()
File "/home/michelle/mlc/mlconvgec2018/software/fairseq-py/generate.py", line 41, in main
models, dataset = utils.load_ensemble_for_inference(args.path, args.data)
File "/home/michelle/mlc/mlconvgec2018/software/fairseq-py/fairseq/utils.py", line 127, in load_ensemble_for_inference
model = build_model(args, dataset)
File "/home/michelle/mlc/mlconvgec2018/software/fairseq-py/fairseq/utils.py", line 31, in build_model
return getattr(models, args.model).build_model(args, dataset)
File "/home/michelle/mlc/mlconvgec2018/software/fairseq-py/fairseq/models/fconv.py", line 541, in build_model
dictionary=dataset.src_dict
File "/home/michelle/mlc/mlconvgec2018/software/fairseq-py/fairseq/models/fconv.py", line 100, in init
self.embed_tokens = load_embeddings(embed_path, dictionary, self.embed_tokens)
File "/home/michelle/mlc/mlconvgec2018/software/fairseq-py/fairseq/models/fconv.py", line 22, in load_embeddings
with open(embed_path) as f_embed:
FileNotFoundError: [Errno 2] No such file or directory: '/home.local/shamil/wiki/wiki.bpe.fasttext/model.vec'

Requesting NUCLE dataset 3.2

Hello ShamiCm

I am glad to research more in the direction of your paper. I am doing a thesis in the University of Hildesheim. I would request the link to NUCLE 3.2 tar file. I filled the registration in the NUS website, but did not get any reply.

Please send it to my mail: [email protected]

about "scripts/nbest_reformat.py"

Should it be 'O' instead of 'S' in line 17?
Thank you very much.