kanyun-inc / fairseq-gec Goto Github PK
View Code? Open in Web Editor NEWSource code for paper: Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data
License: Other
Source code for paper: Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data
License: Other
To run interactive.sh, it requires a model file ./out_big_art/models_denoise/checkpoint5.pt. I created such directory and copied the checkpoint from out/model/checkpoint5.pt and it can still run and give seemingly good results. How is this supposedly existing ./out/models_denoise/checkpoint5.pt different from checkpoints in out/model/, i.e., can I use checkpoints from out/model to run interactive.sh and will it give different results? And what are the ./out/modelscheckpointema.pt for? How are they different from normal checkpoints in the same directory?
Thank you so much for all the amazing work and gracious help!
Hello, I want to use subtoken instead of the original token to modify this code, and then after generating training data, the following error occurred when running the code pretrain.sh.
Exception: process 1 terminated with signal SIGKILL
Have you had any similar error? And how much memory did you use when you ran 100 million data?
I run bash train.sh 0 name
... but I got error:
RuntimeError: Output 0 of SplitBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.
Can help me ?
I trained my own model as the train.sh
scripts, but found loss is negative after many epochs. I'm wondering is negative loss possible to appear or not?
Thank you for your excellent job of the GEC task. I am troubled with preprocessing of training data.
Please tell me how to preprocess training data. Specifically, I want to know how to make training data.
Thanks for your excellent work!
After running train.sh, I found that there were about 5,000 valid samples. Since there are only 1381 sentences in CoNLL-2013 test set, can you please tell me where those valid samples come from?
Hello, thank you for your great work.
I am trying to replicate this work and do further work,
using this copy-attention model for GEC as a baseline model.
Currently I am trying to switch the dataset (to Korean).
I already have the preprocessed data.
But, I have trouble making the input for the model.
I think that I have to make things that are in out/data_bin,
which are [train/valid].src-tgt.[src/tgt].[bin/idx] and [train/valid].label.[src/tgt].txt.
By analyzing the code, I found that I can make the label file and the binary file by running preprocess.sh.
But I always get the error:
FileNotFoundError: [Errono 2] No such file or directory: 'data/train_merge.src'
by further analyzing your code, I found that we need "trainpref" and "validpref", which are listed as
'data/train_merge' and 'data/valid', to generate label and binary file.
but I couldn't find the code that generates this, which means that I have to make this by myself.
My question is this.
Overall: How can I make input for training the model?
Also, it will be so much helpful if you tell me the general process about how to run the model with different preprocessed dataset.
Here, "preprocessed" means that I have done all of this(#14) to make training data, and I now have a clean sentence pair of [grammatically correct/ grammatically incorrect] dataset.
Thank you for reading my question. I will be waiting for your answer.
Hello~ We are so interested in this paper's thoughts, while we cannot find the code of multi-task, and if you will release this code next~
Anyone else ran into this issue?
Hi,
I am trying to retrain the given model with a new dataset for my thesis. Preprocessing worked fine but now I get the following error when trying to run train.sh:
neg_target = target.new_tensor(target).masked_fill_(target_label, self.padding_idx)
RuntimeError: The expanded size of the tensor (384) must match the existing size (832) at non-singleton dimension 0. Target sizes: [384]. Tensor sizes: [832]
I didn't change anything besides the pretrained model path in train.sh.
I previously fixed this error
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 1, 1536]], which is output 0 of AddBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
by changing q *= self.scaling to q = q * self.scaling in line 109 of the multihead_attention.py of fairseq.
Thank you.
As defined in Equation 5 of the paper, the final probability distribution P_t is a mix of generation distribution and copy distribution.
p_t(w) = (1−α_copy)∗p_gen(w)+(α_copy)∗p_copy(w)
However, I found the formula used in code as follows:
composite_scores = copy_alpha * composite_scores
copy_scores = (1 - copy_alpha) * copy_attn
I am concered about whether the difference would influence the effect of model and how it would effect on the performance.
I experimented with your program and data. However, I can not get the result like the paper.
What kind of system does Transformer of the item of Model include?
generate.py: error: unrecognized arguments: --copy-ext-dict
May I know what‘s the reason?
Hello, I want to change the GPU to CPU and change those parameters during the prediction and training?
I followed the mentioned steps in readme and succesfully trained the model. Now I want to test custom sentences and get the proposed hypothesis of the model
First,thank you for your good work,I want to know, in sentence-copy task,how to construct one batch, half of the correct sentence pairs and half of the edited sentence pairs? or one batch only containing the correct sentence pairs and the next batch only containing the edited sentence pairs
Hi,
I tried to train a model from scratch on the data that I have downloaded from the link mentioned on your readme. However, when I am trying to train the model I am facing a runtime error. The issue I think is perhaps due to the format of data because when I tried to run the same code on a different dataset (eng-german), the code seems to run without any glitch.
Currently, I am using python 3.7 and pytorch version 1.3.
Any help would be deeply appreciated.
Thx for your program.
I have watched your instruction video in ACL, and I want to run this program in my research. But I donnot know how to use it.
Thx.
Hi,
I find some parameters, learning rate and weight decay, mentioned in Section 5.2 are not consistent with the released train.sh script. So, for all single models shown in Table 5, how do you set for these parameters when you did Single Model Ablation Study?
Besides, if all single models shown in Table 5 use the edit-weighted MLE? And "Ignoring UNK words as edits" means replacing the with the source word, i.e., using the "--replace-unk" parameter or just dropping the token.
I notice you use the a statistical-based spell error correction system to pre-process the training data. How can I find this system?
An alignfile is needed to train the model on new data, however in the alignfile I see:
source ./config.sh
mkdir data_align
trainpref='data/train_merge'
trainpref='data/valid'
python scripts/build_sym_alignment.py --fast_align_dir ~/software/fast_align/build/ --mosesdecoder_dir fakkk --source_file $trainpref.src --target_file $trainpref.tgt --output_dir data_align
cp data_align/align.forward $trainpref.forward
cp data_align/align.backward $trainpref.backward
rm -rf data_align
I'm assuming setting trainpref
twice is a bug, so should I remove it? Do I need a validation alignfile as well? Why is an alignfile even necessary? When I actually do run it I get:
sh: 1: /h/user/software/fast_align/build/fast_align: not found
Traceback (most recent call last):
File "scripts/build_sym_alignment.py", line 101, in <module>
main()
File "scripts/build_sym_alignment.py", line 75, in main
assert os.system(fwd_fast_align_cmd) == 0
AssertionError
cp: cannot stat 'data_align/align.backward': No such file or directory
Why aren't your paths relative instead of absolute? Which commit of the fast_align implementation are you using, I'm assuming just https://github.com/clab/fast_align master?
Please actually test your code for the use-cases you're advertising before releasing. For reference, https://github.com/kanekomasahiro/bert-gec is also based on the fairseq library and manages to get up and running painlessly in minutes. Please improve your documentation if you actually want people to adopt your work!!!!!
Is there any method could transform the binary train/vali dataset files to raw text?
Hi ☺
Thank you for your great work in advance.
I tried to generate inference results using your pretrained model and my dataset (It consists of two files: source and target).
But after preprocessing ("preprocess.sh"), I've encountered a problem when i run "generate.sh":
>>> bash generate.sh 0 _expr_ht
_last
Traceback (most recent call last):
File "generate.py", line 196, in <module>
cli_main()
File "generate.py", line 192, in cli_main
main(args)
File "generate.py", line 111, in main
hypos = task.inference_step(generator, models, sample, prefix_tokens)
File "/data02/jeiyoon_park/fairseq-gec/fairseq/tasks/fairseq_task.py", line 243, in inference_step
return generator.generate(models, sample, prefix_tokens=prefix_tokens)
File "/data02/jeiyoon_park/anaconda3/envs/ydfu/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/data02/jeiyoon_park/fairseq-gec/fairseq/sequence_generator.py", line 382, in generate
scores.view(bsz, beam_size, -1)[:, :, :step],
File "/data02/jeiyoon_park/fairseq-gec/fairseq/search.py", line 83, in step
torch.div(self.indices_buf, vocab_size, out=self.beams_buf)
RuntimeError: result type Float can't be cast to the desired output type Long
I checked generated "outputema_last.nbest.txt" file and found some logs
Label file not found: /data02/jeiyoon_park/fairseq-gec/ht_result/test.label.src.txt
Label file not found: /data02/jeiyoon_park/fairseq-gec/ht_result/test.label.tgt.txt
I just want to infer the results using your pretrained model and my dataset, not train a model
Could you let me know how to infer the results?
Thank you!
I download the file from your google drive but I cannot find raw training data. Or could you provide your preprocess method in detail?
Hello , :)
I got this error when tried to run the train.sh
, how to fix this issue ?
`aiman@ta:~/fairseq-gec-master$ bash train.sh 0 aiman
fatal: not a git repository (or any of the parent directories): .git
GIT: unknown unknown
2021-12-01 20:14:16
--------------------------------------------------------------------------------
Namespace(adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer', attention_dropout=0.2, bucket_cap_mb=25, clip_norm=2.0, copy_attention=True, copy_attention_dropout=0.2, copy_attention_heads=1, copy_ext_dict=False, cpu=False, criterion='cross_entropy', curriculum=0, data=['out/data_bin'], ddp_backend='c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.2, ema_decay=0.9999, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.001], lr_period_updates=73328.0, lr_scheduler='triangular', lr_shrink=0.95, max_epoch=9, max_lr=0.004, max_sentences=64, max_sentences_valid=64, max_source_positions=1024, max_target_positions=1024, max_tokens=3000, max_update=0, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-05, momentum=0.99, no_ema=False, no_epoch_checkpoints=False, no_progress_bar=True, no_save=False, no_token_positional_embeddings=False, num_workers=0, optimizer='nag', optimizer_overrides='{}', positive_label_weight=1.2, pretrained_model='./out/models_pretrain/checkpoint9.pt', raw_text=False, relu_dropout=0.2, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='out/modelsaiman', save_interval=1, save_interval_updates=0, seed=4321, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, shrink_min=True, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, weight_decay=0.0)
Traceback (most recent call last):
File "train.py", line 435, in <module>
cli_main()
File "train.py", line 431, in cli_main
main(args)
File "train.py", line 42, in main
task = tasks.setup_task(args)
File "/home/aiman/fairseq-gec-master/fairseq/tasks/__init__.py", line 19, in setup_task
return TASK_REGISTRY[args.task].setup_task(args)
File "/home/aiman/fairseq-gec-master/fairseq/tasks/translation.py", line 112, in setup_task
raise Exception('Could not infer language pair, please provide it explicitly')
Exception: Could not infer language pair, please provide it explicitly`
Kind regards
Aiman Solyman
Can I do the evaluations without any training? If so, where can I find the script g.sh? Or do I have to train it with the pre-trained model first and then it will generate a g.sh script? Or do I have to create a g.sh script myself by calling some of the scripts? Thank you for the gracious help!
Hi thank you for your work. Could you link the final model to use at inference for comparison?
This code provides a test set with 1331 examples. However, it should be 1312 I think.
I want to use the pretrained model checkpoint9.pt directly to process the new data, how do I organize the code.(The methods I find on baidu cannot be used, so I ask you for some suggestion ,like torch.load() function)
Hello,
thanks for sharing the repository. What are the steps to follow to train this model on the GEC task for languages different from english (german, italian, french, etc...)?
Thank you for your excellent job of the GEC task. I saw you mentioned the token-level labeling task. I didn't find it in this code. Could you give me more details about how to combine the token-level labeling task with the present work. Thank you for your amazing works again.
According README.md, for evaluating on the CoNLL-2014 testset, it should run the following script:
sh g.sh \${device_id} \${experiment_name}
But, where is the "g.sh"?? I don't find it in this repository.
The readme only mentions of a training one single model if I'm not wrong. How to go about training 4 models as mentioned by the results table of your paper?
I have tried this code without pretrain and the performance is presented as follow:
P:0.6580 R:0.2570 F0.5:0.5015
However, I find the paper report this performance should be about 54(Transformer + Copy).
In the Copynet paper, in Table-5: Single Model ablation study
, it is mentioned that F-score for the copy-augmented transformer architecture with the denoising auto-encoder(pre-trained) is 58.80.
train.sh
, pre-trained model file: checkpoint9.pt
and data provided in the download link replicate the result of 58.8 F-score with 9 epochs as mentioned in the train.sh script?Thanks.
I find you generated 9 noised data from one-billion data. Do you use all these data to pre-train? And, in the paper "We set Λ = 3 when we train the denoising auto-encoder, and set Λ =[1; 1:8] when we train GEC models", but in the code, Λ = 1.3, which has better performance?
I found two fields in the training data, source_label and target_label, each item consists of 0 and 1, and has a length equal to src_tokens and target.
Can you please explain the meaning of these two fields? How are they used during training?
hi ,
question descirption :
states = torch.load(args.pretrained_model)['model'] --> error line
error message:
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
please give me a hand .
I didn't find multi-tasks learning in the code , I wonder whether the code will be released later ?
I'd appreciate it if you could share the fairseq commits id where you initialize your code, since I could remain a clean commits line from the original fairseq repository.
Hello,
Pytorch implements inverted dropout which scales up the values. This means that the weights in the MultiHeadAttention module from fairseq can go over 1 since dropout is applied after the softmax.
Since the model uses the attention weights from the copy module to compute the final probability distribution, p_t(w), values over 1 are also possible. In turn, this makes it possible for the cross entropy loss to be negative. Cross entropy expect a probability distribution which is not what is given during the training of this model.
Is it something that you considered?
I've noticed that no negative loss occur using your training data which surprise me. Maybe there is something I'm not seeing in the way you compute the loss. It seems to be related to the labels made from the train.forward file. However, using my own training data (and your code), it does happen.
Thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.