kanyun-inc / fairseq-gec Goto Github PK

Source code for paper: Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data

License: Other

Python 98.10% C++ 0.44% Shell 0.91% Lua 0.54%

grammar nlp

fairseq-gec's People

Stargazers

Watchers

Forkers

yuxianmeng gurunathparasaram wangwang110 ianliyi1996 nonva salensoft jariojose binbinjin katsumata420 tcxdgit wulouzhu hnn123 mtfelix xiaoshengjun vijaynadimpalli arvindsg huhaven lndoremi jessewong333 jind11 qyz-thu melisa-writer suhe36 pidugusundeep anuragsharma111 slbinilkumar danhorvath geor7 kkxcam bfsujason awasthiabhijeet vandykai lkfo415579 dongwookshin vtrn988 bcmi220 young1993 ralph-bot rajevv mm2116 owenluo123 qujingying wxrui neverneverendup tanmaypandey7 aayc oscarwang114 wanghia emptybeauty521 sukannyapurkayastha cmathx zhengyuan-nlp aleksandervainer zhanzq johnson7788 zsqrt overwindows kent0304 appleyc aiejf e3u3 wkwkgg knitts kkeleve sophiarong youngerzjj

fairseq-gec's Issues

Questions regarding checkpoints and interactive.sh

To run interactive.sh, it requires a model file ./out_big_art/models_denoise/checkpoint5.pt. I created such directory and copied the checkpoint from out/model/checkpoint5.pt and it can still run and give seemingly good results. How is this supposedly existing ./out/models_denoise/checkpoint5.pt different from checkpoints in out/model/, i.e., can I use checkpoints from out/model to run interactive.sh and will it give different results? And what are the ./out/modelscheckpointema.pt for? How are they different from normal checkpoints in the same directory?

Thank you so much for all the amazing work and gracious help!

Some problems about run pretrain.sh

Hello, I want to use subtoken instead of the original token to modify this code, and then after generating training data, the following error occurred when running the code pretrain.sh.

Exception: process 1 terminated with signal SIGKILL

Have you had any similar error? And how much memory did you use when you ran 100 million data?

bash train.sh got error

I run bash train.sh 0 name ... but I got error:

RuntimeError: Output 0 of SplitBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.

Can help me ?

Is negative loss normal?

I trained my own model as the train.sh scripts, but found loss is negative after many epochs. I'm wondering is negative loss possible to appear or not?

How to make training data.

Thank you for your excellent job of the GEC task. I am troubled with preprocessing of training data.
Please tell me how to preprocess training data. Specifically, I want to know how to make training data.

About the validation set

Thanks for your excellent work!
After running train.sh, I found that there were about 5,000 valid samples. Since there are only 1381 sentences in CoNLL-2013 test set, can you please tell me where those valid samples come from?

How to make data/train_merge and data/valid?

Hello, thank you for your great work.
I am trying to replicate this work and do further work,
using this copy-attention model for GEC as a baseline model.

Currently I am trying to switch the dataset (to Korean).
I already have the preprocessed data.

But, I have trouble making the input for the model.
I think that I have to make things that are in out/data_bin,
which are [train/valid].src-tgt.[src/tgt].[bin/idx] and [train/valid].label.[src/tgt].txt.

By analyzing the code, I found that I can make the label file and the binary file by running preprocess.sh.
But I always get the error:
FileNotFoundError: [Errono 2] No such file or directory: 'data/train_merge.src'
by further analyzing your code, I found that we need "trainpref" and "validpref", which are listed as
'data/train_merge' and 'data/valid', to generate label and binary file.
but I couldn't find the code that generates this, which means that I have to make this by myself.

My question is this.
Overall: How can I make input for training the model?

How can I make data/train_merge.src, data/train_merge.tgt, data/valid, and the alignfile(data/train_merge.forward)? What are the formats?
(example: dict.src.txt have [word] [frequency] format of the training data)
Is there other files that I need other than data/train_merge.[src/forward], data/valid, and dicts/dict.src.txt, In order to run preprocess.sh and get all the inputs needed?

Also, it will be so much helpful if you tell me the general process about how to run the model with different preprocessed dataset.
Here, "preprocessed" means that I have done all of this(#14) to make training data, and I now have a clean sentence pair of [grammatically correct/ grammatically incorrect] dataset.

Thank you for reading my question. I will be waiting for your answer.

Multi-Task

Hello~ We are so interested in this paper's thoughts, while we cannot find the code of multi-task, and if you will release this code next~

running `bash generate.sh` creates model and results folders but there is nothing in them

Anyone else ran into this issue?

Got an Error while retraining model using ./train.sh

Hi,
I am trying to retrain the given model with a new dataset for my thesis. Preprocessing worked fine but now I get the following error when trying to run train.sh:

neg_target = target.new_tensor(target).masked_fill_(target_label, self.padding_idx)
RuntimeError: The expanded size of the tensor (384) must match the existing size (832) at non-singleton dimension 0. Target sizes: [384]. Tensor sizes: [832]

I didn't change anything besides the pretrained model path in train.sh.

I previously fixed this error

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 1, 1536]], which is output 0 of AddBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

by changing q *= self.scaling to q = q * self.scaling in line 109 of the multihead_attention.py of fairseq.

Thank you.

Question about difference of formula of composition_score and copy_score between paper and code

As defined in Equation 5 of the paper, the final probability distribution P_t is a mix of generation distribution and copy distribution.
p_t(w) = (1−α_copy)∗p_gen(w)+(α_copy)∗p_copy(w)
However, I found the formula used in code as follows:
composite_scores = copy_alpha * composite_scores
copy_scores = (1 - copy_alpha) * copy_attn
I am concered about whether the difference would influence the effect of model and how it would effect on the performance.

Question about Single Model Ablation Study on CoNLL 2014 Test Data Set in the paper

I experimented with your program and data. However, I can not get the result like the paper.
What kind of system does Transformer of the item of Model include?

Got an error while run `sh generate.sh`

generate.py: error: unrecognized arguments: --copy-ext-dict

May I know what‘s the reason?

Gpu predicts switching to Cpu

Hello, I want to change the GPU to CPU and change those parameters during the prediction and training？

How do I test the trained model with custom sentences?

I followed the mentioned steps in readme and succesfully trained the model. Now I want to test custom sentences and get the proposed hypothesis of the model

question about the sentences copy task

First，thank you for your good work，I want to know， in sentence-copy task，how to construct one batch， half of the correct sentence pairs and half of the edited sentence pairs？ or one batch only containing the correct sentence pairs and the next batch only containing the edited sentence pairs

Cross entropy loss giving RuntimeError: CUDA error: device-side assert triggered

Hi,

I tried to train a model from scratch on the data that I have downloaded from the link mentioned on your readme. However, when I am trying to train the model I am facing a runtime error. The issue I think is perhaps due to the format of data because when I tried to run the same code on a different dataset (eng-german), the code seems to run without any glitch.
Currently, I am using python 3.7 and pytorch version 1.3.
Any help would be deeply appreciated.

How to run this model

Thx for your program.
I have watched your instruction video in ACL, and I want to run this program in my research. But I donnot know how to use it.
Thx.

Question about model parameters, UNK words, loss function and spell error correction system

Hi,
I find some parameters, learning rate and weight decay, mentioned in Section 5.2 are not consistent with the released train.sh script. So, for all single models shown in Table 5, how do you set for these parameters when you did Single Model Ablation Study?

Besides, if all single models shown in Table 5 use the edit-weighted MLE? And "Ignoring UNK words as edits" means replacing the with the source word, i.e., using the "--replace-unk" parameter or just dropping the token.

I notice you use the a statistical-based spell error correction system to pre-process the training data. How can I find this system?

Why do you need alignfile for our implementation? Please document better.

An alignfile is needed to train the model on new data, however in the alignfile I see:

source ./config.sh
mkdir data_align

trainpref='data/train_merge'
trainpref='data/valid'

python scripts/build_sym_alignment.py --fast_align_dir ~/software/fast_align/build/ --mosesdecoder_dir fakkk --source_file $trainpref.src --target_file $trainpref.tgt --output_dir data_align 

cp data_align/align.forward $trainpref.forward
cp data_align/align.backward $trainpref.backward

rm -rf data_align

I'm assuming setting trainpref twice is a bug, so should I remove it? Do I need a validation alignfile as well? Why is an alignfile even necessary? When I actually do run it I get:

sh: 1: /h/user/software/fast_align/build/fast_align: not found
Traceback (most recent call last):
  File "scripts/build_sym_alignment.py", line 101, in <module>
    main()
  File "scripts/build_sym_alignment.py", line 75, in main
    assert os.system(fwd_fast_align_cmd) == 0
AssertionError
cp: cannot stat 'data_align/align.backward': No such file or directory

Why aren't your paths relative instead of absolute? Which commit of the fast_align implementation are you using, I'm assuming just https://github.com/clab/fast_align master?

Please actually test your code for the use-cases you're advertising before releasing. For reference, https://github.com/kanekomasahiro/bert-gec is also based on the fairseq library and manages to get up and running painlessly in minutes. Please improve your documentation if you actually want people to adopt your work!!!!!

Get raw train/validation text dataset

Is there any method could transform the binary train/vali dataset files to raw text?

Inference results using "generate.sh"

Hi ☺

Thank you for your great work in advance.

I tried to generate inference results using your pretrained model and my dataset (It consists of two files: source and target).

But after preprocessing ("preprocess.sh"), I've encountered a problem when i run "generate.sh":

>>> bash generate.sh 0 _expr_ht
_last
Traceback (most recent call last):
  File "generate.py", line 196, in <module>
    cli_main()
  File "generate.py", line 192, in cli_main
    main(args)
  File "generate.py", line 111, in main
    hypos = task.inference_step(generator, models, sample, prefix_tokens)
  File "/data02/jeiyoon_park/fairseq-gec/fairseq/tasks/fairseq_task.py", line 243, in inference_step
    return generator.generate(models, sample, prefix_tokens=prefix_tokens)
  File "/data02/jeiyoon_park/anaconda3/envs/ydfu/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/data02/jeiyoon_park/fairseq-gec/fairseq/sequence_generator.py", line 382, in generate
    scores.view(bsz, beam_size, -1)[:, :, :step],
  File "/data02/jeiyoon_park/fairseq-gec/fairseq/search.py", line 83, in step
    torch.div(self.indices_buf, vocab_size, out=self.beams_buf)
RuntimeError: result type Float can't be cast to the desired output type Long

I checked generated "outputema_last.nbest.txt" file and found some logs

Label file not found: /data02/jeiyoon_park/fairseq-gec/ht_result/test.label.src.txt
Label file not found: /data02/jeiyoon_park/fairseq-gec/ht_result/test.label.tgt.txt

I just want to infer the results using your pretrained model and my dataset, not train a model

Could you let me know how to infer the results?

Thank you!

Could you provide training and validating data?

I download the file from your google drive but I cannot find raw training data. Or could you provide your preprocess method in detail?

I can't run running bash train.sh

Hello , :)

I got this error when tried to run the train.sh , how to fix this issue ?

`aiman@ta:~/fairseq-gec-master$ bash train.sh 0 aiman
fatal: not a git repository (or any of the parent directories): .git
GIT: unknown unknown
2021-12-01 20:14:16
--------------------------------------------------------------------------------
Namespace(adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer', attention_dropout=0.2, bucket_cap_mb=25, clip_norm=2.0, copy_attention=True, copy_attention_dropout=0.2, copy_attention_heads=1, copy_ext_dict=False, cpu=False, criterion='cross_entropy', curriculum=0, data=['out/data_bin'], ddp_backend='c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.2, ema_decay=0.9999, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.001], lr_period_updates=73328.0, lr_scheduler='triangular', lr_shrink=0.95, max_epoch=9, max_lr=0.004, max_sentences=64, max_sentences_valid=64, max_source_positions=1024, max_target_positions=1024, max_tokens=3000, max_update=0, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-05, momentum=0.99, no_ema=False, no_epoch_checkpoints=False, no_progress_bar=True, no_save=False, no_token_positional_embeddings=False, num_workers=0, optimizer='nag', optimizer_overrides='{}', positive_label_weight=1.2, pretrained_model='./out/models_pretrain/checkpoint9.pt', raw_text=False, relu_dropout=0.2, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='out/modelsaiman', save_interval=1, save_interval_updates=0, seed=4321, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, shrink_min=True, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, weight_decay=0.0)
Traceback (most recent call last):
  File "train.py", line 435, in <module>
    cli_main()
  File "train.py", line 431, in cli_main
    main(args)
  File "train.py", line 42, in main
    task = tasks.setup_task(args)
  File "/home/aiman/fairseq-gec-master/fairseq/tasks/__init__.py", line 19, in setup_task
    return TASK_REGISTRY[args.task].setup_task(args)
  File "/home/aiman/fairseq-gec-master/fairseq/tasks/translation.py", line 112, in setup_task
    raise Exception('Could not infer language pair, please provide it explicitly')
Exception: Could not infer language pair, please provide it explicitly`

Kind regards
Aiman Solyman

Question about evaluation?

Can I do the evaluations without any training? If so, where can I find the script g.sh? Or do I have to train it with the pre-trained model first and then it will generate a g.sh script? Or do I have to create a g.sh script myself by calling some of the scripts? Thank you for the gracious help!

Hi could you link the final model for inference

Hi thank you for your work. Could you link the final model to use at inference for comparison?

The different number of test data.

This code provides a test set with 1331 examples. However, it should be 1312 I think.

Details about training corpus, availability of pre-trained model and data

Thanks for open sourcing the code. I have a few queries:

For pretraining and adding noise, 'train_1b.tgt' is used. Is this the same as concatenating all the files from . Was any further pre-processing done?
Can a link to the pre-processed data and pre-trained files be provided?
Thanks!

How should I use pretrained model to process new data

I want to use the pretrained model checkpoint9.pt directly to process the new data, how do I organize the code.(The methods I find on baidu cannot be used, so I ask you for some suggestion ,like torch.load() function)

How to train this on a new language?

Hello,

thanks for sharing the repository. What are the steps to follow to train this model on the GEC task for languages different from english (german, italian, french, etc...)?

Details about the token-level labeling Task

Thank you for your excellent job of the GEC task. I saw you mentioned the token-level labeling task. I didn't find it in this code. Could you give me more details about how to combine the token-level labeling task with the present work. Thank you for your amazing works again.

Where is "g.sh"?

According README.md, for evaluating on the CoNLL-2014 testset, it should run the following script:

sh g.sh \${device_id} \${experiment_name}

But, where is the "g.sh"?? I don't find it in this repository.

The paper mentions that an ensemble model achieved the state of the art results however there is no mention of how the seperate models were trained

The readme only mentions of a training one single model if I'm not wrong. How to go about training 4 models as mentioned by the results table of your paper?

About the model performance

I have tried this code without pretrain and the performance is presented as follow:
P:0.6580 R:0.2570 F0.5:0.5015
However, I find the paper report this performance should be about 54(Transformer + Copy).

How to visualize attention as mentioned in paper for my task(Hindi GEC)?

How to visualize attention as mentioned in the paper for my task(Hindi GEC)? Like as shown in the figure.

Details about training the best single model described in the paper

In the Copynet paper, in Table-5: Single Model ablation study, it is mentioned that F-score for the copy-augmented transformer architecture with the denoising auto-encoder(pre-trained) is 58.80.

I tried downloading the code, pre-trained models and the data you have provided in the repo and got an F-score of 52.76
Does the current repo with the given train.sh, pre-trained model file: checkpoint9.pt and data provided in the download link replicate the result of 58.8 F-score with 9 epochs as mentioned in the train.sh script?
Should we change any of the hyper-parameters to achieve the F-score of 58.8 mentioned in the paper?

Thanks.

About pre-train the DAE

I find you generated 9 noised data from one-billion data. Do you use all these data to pre-train? And, in the paper "We set Λ = 3 when we train the denoising auto-encoder, and set Λ =[1; 1:8] when we train GEC models", but in the code, Λ = 1.3, which has better performance?

Questions about training data

I found two fields in the training data, source_label and target_label, each item consists of 0 and 1, and has a length equal to src_tokens and target.
Can you please explain the meaning of these two fields? How are they used during training？

when single node with multi gpus , it does not run .

hi ,
question descirption :
states = torch.load(args.pretrained_model)['model'] --> error line
error message:
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
please give me a hand .

questions about multi-tasks

I didn't find multi-tasks learning in the code , I wonder whether the code will be released later ?

Could you share the fairseq commits id where you init your code?

I'd appreciate it if you could share the fairseq commits id where you initialize your code, since I could remain a clean commits line from the original fairseq repository.

Dropout causing negative loss values.

Hello,

Pytorch implements inverted dropout which scales up the values. This means that the weights in the MultiHeadAttention module from fairseq can go over 1 since dropout is applied after the softmax.

Since the model uses the attention weights from the copy module to compute the final probability distribution, p_t(w), values over 1 are also possible. In turn, this makes it possible for the cross entropy loss to be negative. Cross entropy expect a probability distribution which is not what is given during the training of this model.

Is it something that you considered?

I've noticed that no negative loss occur using your training data which surprise me. Maybe there is something I'm not seeing in the way you compute the loss. It seems to be related to the labels made from the train.forward file. However, using my own training data (and your code), it does happen.