grammatical / magec-wnut2019 Goto Github PK

Models and training scripts for the English, German and Russian MAGEC systems described in R. Grundkiewicz, M. Junczys-Dowmunt: Minimally-Augmented Grammatical Error Correction, W-NUT 2019.

Shell 40.88% Python 47.19% Makefile 11.93%

magec-wnut2019's People

Contributors

Stargazers

Watchers

Forkers

saiprasanth385

magec-wnut2019's Issues

word weighting

Hi,
I'm just in the process of training my own German model with your project. I managed to create most files needed, but I wondered how you do the word-weighting for the w2 file. What exactly is the process behind that?

Question regarding validation script during fine-tuning

As described in this paper, I completed pre-training for the low-resource model using 100 million sentences using Marian(without edit-weighting) and got an F-score of 25 on the WI-LOCNESS dev-set (Experiments in the paper achieved around 26.7 F-score). While fine-tuning(I use edit-weighting for fine-tuning), I observe that the validation score for the translation metric obtained from the validation script starts at around 4 F-score, but increases slowly with training, on WI-LOCNESS dev-set, whereas the same model.npz when used for inference using marian-decoderon WI-dev set using the command:
cat devset.err | ../../marian-dev/build/marian-decoder -c config.yml -d 1 --quiet-translation -o $OP_FILE
yields an F-score of 25-27. The config can be found here: config.log

I observe that the F-scores from the translation metric obtained via validate.sh and that from doing inference using marian-decoder command are in different range altogether. Why do the scores vary? Is this expected?
I checked the outputs generated by the model (before and after remove_repetitions.py) while using the validation script in devset.out and they seem to differ from those generated while using marian-decoder. ~~Will using marian-decoder command mentioned above, in the validation script, help me achieve the 38 F-score on the WI-LOCNESS dev-set after fine-tuning as achieved in the paper?~~
Any other suggestions to fine-tune the model? Thanks in advance.

synthetic data for German

Could you please add downolad.sh file for German synthetic dataset?

Handling of Special Characters

Hi,

I've noticed that the performance of your german model isn't that good when it comes to special characters such as & or * in text. (I haven't tested the other languages)
I've also finetuned the model with own examples (about 300 000 lines) that also include samples with special characters. Still the transformer was not really able to handle them. In most cases the special character just gets turned into a letter or word. Sometimes it gets a bit crazier and the transformer creates some random long sequence of letters or word repetitions out of the special characters.

Is that a behaviour you have observed as well and do you think it can be solved by more examples?
I know that might be hard to answer with certainty, but I just wanted to know your opinion about it.

Unable to download models

Following the instructions

cd models/de
bash download.sh

I'm unable to download the model file. Could you please fix this?

grammatical / magec-wnut2019 Goto Github PK

magec-wnut2019's People

Contributors

Stargazers

Watchers

Forkers

magec-wnut2019's Issues

word weighting

Question regarding validation script during fine-tuning

synthetic data for German

Handling of Special Characters

Unable to download models

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent