neulab / awesome-align Goto Github PK

View Code? Open in Web Editor NEW

318.0 318.0 46.0 701 KB

A neural word aligner based on multilingual BERT

Home Page: https://arxiv.org/abs/2101.08231

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

awesome-align's People

Contributors

Stargazers

Watchers

Forkers

15091444119 asali-cs jivnesh copperdong gitlost-murali bitpogo zdou0830 bonetskaya vyraun bramvanroy ftakanashi lalashiwoya aleksandrakudaeva ishine appleternity jgcb00 larrylawl jinyiyang-jhu ck-imut-501 main-bench lizezhonglaile quocthang0507 felixgithub2017 fengwills manandey juncaofish mmyslyvyi sologa tjudoubi l-m-sherlock yanghaocsg zhenyangiacas wenyuan-wu kismit makermotion stelmath borrisonxiao brisksea khalo-sa nivedita5 kukas oscarjia khanhi2r mlg4035

awesome-align's Issues

Trying to train "ixa-ehu/ixambert-base-cased" model

Hi!

You have done a great job!! I have been training two different models. the one mentioned in the title ("ixa-ehu/ixambert-base-cased") and multibert_cased. With the multibert I didn't have any problems with the training, however, when I try to train the other model it says that I have a missmatch with the shape of the vocabulary size.

In the config file of the "ixa-ehu/ixambert-base-cased" model the vocabulary size is the following one:
08/18/2022 09:41:28 - INFO - awesome_align.configuration_utils - Model config BertConfig {
"architectures": null,
"attention_probs_dropout_prob": 0.1,
"bos_token_id": null,
"do_sample": false,
"eos_token_ids": null,
"finetuning_task": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_labels": 2,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": null,
"repetition_penalty": 1.0,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 119099
}

When I begin with the training i get this error:
Iteration: 0%| | 0/40000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/mnt/datuak/virtualenvs/transformers/bin/awesome-train", line 8, in
sys.exit(main())
File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/awesome_align/run_train.py", line 848, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/awesome_align/run_train.py", line 370, in train
loss = model(inputs_src=inputs_src, labels_src=labels_src)
File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/awesome_align/modeling.py", line 660, in forward
masked_lm_loss = loss_fct(prediction_scores_src.view(-1, self.config.vocab_size), labels_src.view(-1))
RuntimeError: shape '[-1, 119101]' is invalid for input of size 5716752

As you can see the vocab_size has increased by 2 from 119099 to 119101. This is due to the CLS and SEP tokens, however, I don't know why I get this error. I have tried to manually decrease the vocab_size in the code, but this leads to some other errors when I make the alignments.

I leave you here the awesome-train command I have used for training:
CUDA_VISIBLE_DEVICES=1 awesome-train
--output_dir=$OUTPUT_DIR
--model_name_or_path=ixa-ehu/ixambert-base-cased
--extraction 'softmax'
--do_train
--train_mlm
--train_tlm
--train_tlm_full
--train_so
--train_psi
--train_co
--train_data_file=$TRAIN_FILE
--per_gpu_train_batch_size 2
--gradient_accumulation_steps 4
--num_train_epochs 1
--learning_rate 2e-5
--save_steps 10000
--max_steps 40000 \

Could you please help me solve this issue?

Thanks!

Can I use awesome-align for monosyllabic language

Can I use awesome-align for monosyllabic language (e.g. Vietnamese).
For example:
Instead of:
"sinh"==="student"
"viên"==="student"
Want to:
"sinh viên"==="student"
Thanks for your project.

一些问题

你好，阅读完你的论文和代码，我有几个问题想向你请教一下。
1、为什么基于XLM-R的对齐效果比mbert差很多呢？在其他跨语言任务上，比如分类任务，XLM-R的效果应该是远远超过mbert的，但是在对齐任务上XLM-R效果很差。请问可能的原因是什么呢？
2、我注意到代码中scr和tgt是分别进行encoding的，但是直觉上如果把src和tgt进行concat再一起encoding，这样得到的交互信息是不是更多呢？但是经过我自己的尝试，这样的方法效果反而更差了。请问你有没有尝试过这种方法呢，为什么效果更差了呢？

Torch.save() for large training Dataset

Hello,

I know this is not directly related to awesome-align, but I have a large training set of 10M source/target pairs and it takes 4 hours to process them before the actual model training even starts. So, I used the --cache_data option provided, but when the code reaches torch.save(self.examples, cache_fn) it just takes forever and it uses all my RAM (60GB is not enough on a VM - it swaps into HDD on my 32gb RAM PC and then takes forever to complete). Would you know any alternative way to save the Dataset and then reuse it before training to skip the loading overhead time?

Many thanks

Abort experiment in Table 8

Hellow ,I noticed that you compaired your result with GIZA++/fast_align in the expirment AnnotationProjection . So how do you get the paralled sentences of english-spanish to train this models? Using GoogleTranslate to translate English train dataset to Spanish train dataset?

Trying to train on an existing model

Hi!
This is a really great tool and it's been fun using it.
I am trying to train the model 'bert-base-multilingual-uncased' using a tokenized dataset in the correct format. But every time I run the script, it loads the file and weights and promptly stops as weights of the pre trained model aren't initialised.
This is the message I get:

07/21/2022 13:48:50 - WARNING - main - Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
07/21/2022 13:48:50 - INFO - awesome_align.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-config.json from cache at /Users/devi/.cache/torch/awesome-align/45629519f3117b89d89fd9c740073d8e4c1f0a70f9842476185100a8afe715d1.65df3cef028a0c91a7b059e4c404a975ebe6843c71267b67019c0e9cfa8a88f0
07/21/2022 13:48:50 - INFO - awesome_align.configuration_utils - Model config BertConfig {
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": null,
"directionality": "bidi",
"do_sample": false,
"eos_token_ids": null,
"finetuning_task": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_labels": 2,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": 0,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"repetition_penalty": 1.0,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 119547
}

07/21/2022 13:48:51 - INFO - awesome_align.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt from cache at /Users/devi/.cache/torch/awesome-align/96435fa287fbf7e469185f1062386e05a075cadbf6838b74da22bf64b080bc32.99bcd55fc66f4f3360bc49ba472b940b8dcf223ea6a345deb969d607ca900729
07/21/2022 13:48:52 - INFO - awesome_align.modeling_utils - loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-pytorch_model.bin from cache at /Users/devi/.cache/torch/awesome-align/5b5b80054cd2c95a946a8e0ce0b93f56326dff9fbda6a6c3e02de3c91c918342.7131dcb754361639a7d5526985f880879c9bfd144b65a0bf50590bddb7de9059
07/21/2022 13:48:56 - INFO - awesome_align.modeling_utils - Weights of BertForMaskedLM not initialized from pretrained model: ['cls.predictions.decoder.bias', 'psi_cls.bias', 'psi_cls.transform.weight', 'psi_cls.transform.bias', 'psi_cls.decoder.weight', 'psi_cls.decoder.bias']
07/21/2022 13:48:56 - INFO - awesome_align.modeling_utils - Weights from pretrained model not used in BertForMaskedLM: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
07/21/2022 13:48:56 - INFO - main - Training/evaluation parameters Namespace(train_data_file=‘de-en_tmx_align.txt', output_dir='align/train_model', train_mlm=False, train_tlm=False, train_tlm_full=False, train_so=False, train_psi=False, train_co=False, train_gold_file=None, eval_gold_file=None, ignore_possible_alignments=False, gold_one_index=False, cache_data=False, align_layer=8, extraction='softmax', softmax_threshold=0.001, eval_data_file='examples/deen_param_test', should_continue=False, model_name_or_path='bert-base-multilingual-cased', mlm_probability=0.15, config_name=None, tokenizer_name=None, cache_dir=None, block_size=512, do_train=False, do_eval=False, per_gpu_train_batch_size=2, per_gpu_eval_batch_size=2, gradient_accumulation_steps=4, learning_rate=2e-05, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=0, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, overwrite_output_dir=False, overwrite_cache=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, n_gpu=0, device=device(type='cpu'))

Please help with a solution or if I'm doing something wrong!
Thanks

No caching

I am trying to run the script as suggested in the README. I have started it a couple of times, and every time the model needs to be downloaded again and the features need to be recreated. There is no caching at all.

I like this library and its potential (if the results will be reproducible), but its implementation is not very robust and seems implemented in a bit of a hacky, untested, way. This is often the case in research projects (in mine as well), but if the intent is to open-source it and have people use it, it should probably be more robust. I am sure that if you get in touch with the people over at transformers that they can help with better integration in the library and maybe even add the architecture to the library itself! You can tag me there.

How to test fine tuned model on parallel data?

Hi, I have collected about 2.75 million parallel data of Bengali to English sentences. I have already trained with 1000 sentences for testing purposes but I am not understanding how can I will be able to test my trained model! How can I test the demo on my trained model? How can I load my trained model?
It would be a great help for me if anyone can help me with this issue.
Thanks in advance for the help.

problem with mask_token

I'm trying to fine-tune on parallel data, and I get an error in the mask_token function in run_train.py

The error message is: Mask tensor can only take 0 and 1 values. (Line 146).

Could you offer some advice on how to fix this?

There is a small problem about the training data format.

Because I have never trained the pre-training model, I have a small question about what the paralleldata input format looks likeTRAIN_FILE=/path/to/train/file. Do you need a separator between src and tgt? What is the format?
In addition, can you fine-tune the xlm-roberta-large model?

And the xlm-roberta-base file contains these contents:config.json 、gitattributes 、pytorch_model.bin 、sentencepiece.bpe.model 、tokenizer.json

This error occurred while running：

10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/vocab.txt. We won't load it.
10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/added_tokens.json. We won't load it.
10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/special_tokens_map.json. We won't load it.
10/26/2021 19:03:46 - INFO - awesome_align.tokenization_utils - Didn't find file xlm-roberta-base/tokenizer_config.json. We won't load it.

Now I really want to use bilingual data to continue training xlm-roberta-base model and ask for advice through TLM task.

Question on Figure 2 in your paper

In the intersection matrix, why is the entry pointed out to be one?

Do I miss some technical details?

Thanks a lot!

How to get test results?

Hi, thanks for your work. I downloaded the model[Ours (multilingually fine-tuned w/o --train_co, softmax) ]you released, but the test results in the Chinese and English directions are different from those described in the paper.

I used the following command:
DATA_FILE=/share/awesome-align/examples/zhen.src-tgt
MODEL_NAME_OR_PATH=/share/awesome-align/model_without_co
OUTPUT_FILE=/share/awesome-align/out/output.src-tgt
OUTPUT_WORD_FILE=/share/awesome-align/out/output.words.src-tgt

CUDA_VISIBLE_DEVICES=0 awesome-align \
--output_file=$OUTPUT_FILE \
--model_name_or_path=$MODEL_NAME_OR_PATH \
--data_file=$DATA_FILE \
--extraction 'softmax' \
--output_word_file=$OUTPUT_WORD_FILE \
--batch_size 32

python tools/aer.py examples/zhen.gold out/output.src-tgt

The test results were as follows:
out/output.src-tgt: 69.3% (30.8%/30.6%/11385) F-Measure: 0.307

Is it possible align phrase by phrase, not just word by word?

For many languages, there are lots of unpaired words, but lots of paired phrase.

Setting seq-to-seq model as our pretrained model

Is it possible to load seq-to-seq model to make word alignments with this work?
I'm stuck on getting proper out_src and out_tgt layers to work with for next step.
I know its is for mBERT only implementation, but I'm trying to see if is it possible to work on same way on seq-to-seq models.
If you have any hint in what direction I need to go or code to share, please do.
This is not paper work, I'm just curious.

Index out of range

I'm trying to finetune on a subtitles corpus. I keep getting a 'list index out of range' error on (b2w_src[i], b2w_tgt[j]), even on some examples in this repo

My command line:

! cd awesome-align && CUDA_VISIBLE_DEVICES=0 python run_align.py \
    --output_file=../eval2.txt \
    --model_name_or_path=../my-custom-model \
    --data_file=examples/enfr.src-tgt \
    --extraction 'softmax' \
    --batch_size 32 \

The error:

Loading the dataset...
Extracting:   0% 0/447 [00:00<?, ?it/s]Traceback (most recent call last):
  File "run_align.py", line 194, in <module>
    main()
  File "run_align.py", line 191, in main
    word_align(args, model, tokenizer)
  File "run_align.py", line 101, in word_align
    word_aligns_list = model.get_aligned_word(ids_src, ids_tgt, bpe2word_map_src, bpe2word_map_tgt, args.device, 0, 0, align_layer=args.align_layer, extraction=args.extraction, softmax_threshold=args.softmax_threshold, test=True)
  File "/content/awesome-align/modeling.py", line 680, in get_aligned_word
    aligns.add( (b2w_src[i], b2w_tgt[j]) )
IndexError: list index out of range
Extracting:   0% 0/447 [00:00<?, ?it/s]

I tried shortening arrays but there always seems to be some issue

Training with gold alignment

Hi,

Thanks again for such a great alignment tool!

I was trying to train the model with gold alignments, but I keep getting the following error message:

awesome-train: error: unrecognized arguments: --train_gold_file=examples/enfr.gold

I used the Supervised Settings example command from README. My TRAIN_FILE is formated as tokenized src ||| tgt, and the TRAIN_GOLD_FILE is gold alignment pairs (eg: enfr.gold).

Am I missing anything?

Thanks!

Continue with checkpoint

Hi, I was using the script for supervised training. My dataset is fairly large, and some checkpoints were made during the training. However, the training was not completed, and I was wondering how can I use the checkpoints to continue the training?

I tried setting the checkpoint as the model name, but it seems like the training will start from the begining. Here is the script I used:

python3 run_train.py
--output_dir=$OUTPUT_DIR
--model_name_or_path=checkpoint-12000
--extraction 'softmax'
--do_train
--train_so
--train_data_file=$TRAIN_FILE
--train_gold_file=$TRAIN_GOLD_FILE
--per_gpu_train_batch_size 2
--gradient_accumulation_steps 4
--num_train_epochs 5
--learning_rate 1e-4
--save_steps 2000

De-En dataset is missing

Hello,

I could not find the De-En dataset on the provided link. It is not on the examples folder either. Where can I find this dataset?

Thank you

some confusions

Hello, I have read your paper and GitHub project carefully. I find that you do not use a large number of Chinese English Parallel Corpus in your paper, but only use annotated Chinese English data. Is that so?

traning is increasing my aer

I have traning bert-base-cases with wmt data of de-en ,fr-en ,es-en ,ja-en ,ro-en and zh-en and my are has increased
enfr_output: 14.5% (87.3%/82.9%/5906)
jaen_output: 72.1% (49.3%/19.5%/5358)
zhen_output: 72.1% (70.0%/17.4%/2802)
roen_output: 34.5% (78.8%/56.1%/4409)

from base
enfr_output: 5.6% (94.7%/94.0%/6015)
jaen_output: 45.6% (67.9%/45.4%/9079)
zhen_output: 17.9% (82.8%/81.3%/11173)
roen_output: 27.9% (85.2%/62.5%/4548)

can anyone help me with where it went wrong

Statistics between models

Do you have any statistics or model performance between mBERT, model_without_co and model_with_co?
I see that you have alignment error rates on your models between aligners. So I'm wondering if you have any information on how your model is different from mBERT.

Relationship between num_train_epochs and max_steps

I did training with num_train_epochs=1 and max_steps=20000. It did 1 epoch of 20k steps, all good.
Then I did training with num_train_epochs=2 and max_steps=20000. I expected it to do 2 epochs of 20k steps each, but instead it only did 20k steps total.

So if I want to train longer, should I just change max_steps to say 40000? and leave num_train_epochs=1? but what does num_train_epochs do then?

Alignment prob matrix visualization tool?

Hi,

Thank you so much for the great alignment tool! I was wondering that would it be possible for you to share some tools/code that you use to produce the alignment prob matrix like the following picture?

Thanks!

Multiple GPUs

When using multiple GPUs, the train script does not work.

AttributeError: 'DataParallel' object has no attribute 'get_aligned_word'

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Hello,

I tried to run alignments using the provided model (w/out train_co) and the example data (zhen.src-tgt), but am receiving an error as shown below:

DATA_FILE=./examples/zhen.src-tgt
MODEL_NAME_OR_PATH=./model_without_co/pytorch_model.bin
OUTPUT_FILE=./output/zhen.awesome-align.out

CUDA_VISIBLE_DEVICES=0 python3 run_align.py \
    --output_file=$OUTPUT_FILE \
    --model_name_or_path=$MODEL_NAME_OR_PATH \
    --data_file=$DATA_FILE \
    --extraction 'softmax' \
    --batch_size 32 \

Traceback (most recent call last):
  File "run_align.py", line 194, in <module>
    main()
  File "run_align.py", line 167, in main
    config = config_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
  File "/Users/xxx/Downloads/awesome-align-master/configuration_utils.py", line 175, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/Users/xxx/Downloads/awesome-align-master/configuration_utils.py", line 227, in get_config_dict
    config_dict = cls._dict_from_json_file(resolved_config_file)
  File "/Users/xxx/Downloads/awesome-align-master/configuration_utils.py", line 313, in _dict_from_json_file
    text = reader.read()
  File "/Users/xxx/.pyenv/versions/3.7.9/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Is it possible to incorporate POS tag info to aid alignment?

Hello and many thanks for sharing the project

I have an open question/discussion: would it be possible to incorporate the POS information of each token during training? For example, by having a new loss function that tries to minimize POS tag mismatching from source to target token. This comes from the idea that if a source token is a Noun in the source language, it will most likely also be a Noun in the target language. Same would go for Verbs etc. or other high-level POS tags. What are your thoughts on this?

Thank you

Not able to reproduce the AERs in the table

Hi, I'm new to this toolkit and trying to run a simple test with your pretrained model in the table (the last row: "Ours (multilingually fine-tuned w/ --train_co, softmax)"). I use this model as following:
python tools/aer.py examples/roen.gold examples/roen.awesome-align.out
the results I got is: examples/roen.awesome-align.out: 59.5% (45.3%/36.6%/5014), F-Measure: 0.405.
For the other language pairs, I got "En-Fr" -> 42.9%, "Ja-En"->81.3%, "Zh-En"-> 69.0%, which are all much worse comparing to the numbers reported in your table.
I tried to run the command awesome-align with *.src-tgt data under "examples/" and the same model, as guided in the README page, the results are similar (or slightly worse than) the ones shown above.
Could you let me know what the issue might be?

Can I use the XML-R model?

Hi
Thank you awesome-align team for the nice tool and the demo. I am trying to use XML-R with your model instead of M-BERT. I get an error with this line:
out_src = model(ids_src.unsqueeze(0), output_hidden_states=True)[2][align_layer][0, 1:-1]

My guess is that the hidden_states and model output for XML-R is different. I think the index for the word embeddings in XML-R is index [-1]. Could you please give your direction on how to solve this, if it is possible?

Thanks

Repeated single-sentence inferences on an in-memory model?

Ideally I'd like to keep the model in memory and call it with something approaching the syntax used by Simalign:

myaligner = SentenceAligner(model="model_path", token_type="bpe", **model_parameters)

# ... and later ...

while True:
    alignments = myaligner.get_word_aligns(src_sentence_as_list_of_strings, trg_sentence_as_list_of_strings)
    # ... wait until next request comes in ...

Is there a way to do this? The use-case is where a user is requesting alignments from a gui, so they can't be pre-computed in a batch.

A bug(maybe)

Hi, dear ziyi~
I found in your code, the bert output weights are not set to be the same as the input embedding, which can be proved in here(In detail, the code didnt set the weight of BertLMPredictionHead.decoder to be the same as the weight of BertModel.embeddings).
I think maybe it is you who deliberately modified the source code of bert LM into this way. Why? Will it influence the final result?

DistributedDataParallel does not work on some PyTorch versions

awesome-align/awesome_align/modeling.py

Line 379 in c4e5993

self.decoder.bias = self.bias

and

awesome-align/awesome_align/modeling.py

Line 408 in c4e5993

self.decoder.bias = self.bias

would induce errors for DistributedDataParallel on some old PyTorch versions (<=1.7.1).

As in pytorch/pytorch#41324, one workaround is to change the above lines to

self.decoder.bias = nn.Parameter(self.bias.clone())

Can I evaluate separately?

Hi,

I had AttributeError: 'DataParallel' object has no attribute 'get_aligned_word' error while using multiple GPUs to fine-tune the model. I noticed that this error occurs at the beginning of the evaluation, and even the evaluation failed, there was still a trained model created. My question is that can I evaluate this newly created model separately?

I was thinking something like below. Am I missing anything?
./awesome-train --output_dir=path/to/output --model_name_or_path=trained-model-with-failed-evaluation --do_eval --eval_data_file=$EVAL_FILE

Thanks!

pip packaging?

With the new setup.py (#7), it might be nice to also have awesome-align be installable through pip as well?

Packaging

Hey,

First of all - thanks for the awesome work with awesome aligner. I am currently working on a small annotation projection project, which makes use of awesome-align. However, it would great if you scope at last the auxiliary files (like modeling.py) into a dedicated folder such as awesomealign. Thanks!

一些问题

问一下中英的这个已经微调好的模型可以在英中方向上做单词对齐吗？还是需要重新无监督微调英中方向的？

Reproduction of Ja-En

Hi, thanks a lot for providing this code.

I tried to reproduce the result Ja-En (fine-tune, bilingual) but got worse results (AER= 44.3). I used the following command:

CUDA_VISIBLE_DEVICES=0 python awesome_align/run_train.py \ --output_dir=$OUTPUT_DIR \ --model_name_or_path=bert-base-multilingual-cased \ --extraction 'softmax' \ --do_train \ --train_mlm \ --train_tlm \ --train_tlm_full \ --train_so \ --train_psi \ --train_data_file=$TRAIN_FILE \ --per_gpu_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --num_train_epochs 1 \ --learning_rate 2e-5 \ --save_steps 10000 \

I omitted "max_steps 40000" to ensure that the model is trained for one epoch (But the checkpoint-40000 model produced the similar result: AER 44.7). For the training data, I concatenated KFTT train, dev, tune and test (all pre-tokenized), which amounts to 444k sentences as written in the paper. As "examples/jaen.src-tgt" is lowercased, I finetuned mBERT on the training data with or without lowercasing, and got similar results.

I also used the same command to reproduce the result of Zh-En, and I got similar results reported in the paper (AER = 15.0).

Fasten the extraction process

Is there a way to speedup the extraction process for alignments? Like a parameter to parallelize things or something. Right now I'm extracting alignments for a dataset of 25k samples and it takes more than 1.5 hrs on CPU. In case there's a way to run this on GPU that would also be helpful. Below is the current set of parameters I'm using:

awesome-align
--output_file=$align_dest/$align_fn
--model_name_or_path=bert-base-multilingual-cased
--data_file=$trans_fn
--extraction 'softmax'
--cache_dir ../cache/
--batch_size 32

Error reading fine-tuned models

I am trying to load one of your multilingually fine-tuned models to use it on a new language pair.
I downloaded both with and without --train_co options to my google drive and try to launch them from colab with

%%bash
DATA_FILE=/content/drive/MyDrive/WordAlign/gold1/ruzh.char.src-tgt
MODEL_NAME_OR_PATH=/content/drive/MyDrive/WordAlign/models/aa_fine_model
OUTPUT_FILE=/content/drive/MyDrive/WordAlign/AA_output/gold1.char.aa_fine.phar

CUDA_VISIBLE_DEVICES=0 python /content/drive/MyDrive/WordAlign/awesome/run_align.py \
    --output_file=$OUTPUT_FILE \
    --model_name_or_path=$MODEL_NAME_OR_PATH \
    --data_file=$DATA_FILE \
    --extraction 'softmax' \
    --batch_size 32 \

I get the following error (this is for w/o --train_co; the other one produces similar output except byte and position are different in the last line):

2021-02-06 16:23:39.143584: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
File "/content/drive/MyDrive/WordAlign/awesome/run_align.py", line 194, in
main()
File "/content/drive/MyDrive/WordAlign/awesome/run_align.py", line 167, in main
config = config_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
File "/content/drive/MyDrive/WordAlign/awesome/configuration_utils.py", line 175, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/content/drive/MyDrive/WordAlign/awesome/configuration_utils.py", line 227, in get_config_dict
config_dict = cls._dict_from_json_file(resolved_config_file)
File "/content/drive/MyDrive/WordAlign/awesome/configuration_utils.py", line 313, in _dict_from_json_file
text = reader.read()
File "/usr/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 52: invalid start byte

When I try to use base BERT without fine-tuning with MODEL_NAME_OR_PATH=bert-base-multilingual-case, everything works as expected.

Parallel corpus data format for fine-tuning on parallel data

Is there any example dataset for parallel data fine-tuning?

setting output prob file and output word file

how do i set output prob file and output word file to get the alignemnents as said in this snap ?
<tgtword. Can you give me an example. There is no option for this.

Training details

Hello, I'm interested in your work. While i'd like to know the training parameters for the deen dataset. It seems too big, so the max_step 40000 seems not use all the deen training data (it have about 1900000 data, equal to 237550 steps if the total batch size ==8 as the scripts show ) .So i'd like to know the extract parameters for the deen training such as batch size ,real training steps,thanks!

Inputs shoult be tokenized only for training/evaluation sets?

Hello,

Your README states:

Inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). You can see some examples in the examples folder.

Is this the case only for the training set and the optional evaluation set? During inference/prediction, do we also need to pass the source/target pair, pretokenized? Your demo uses the example:

src = 'awesome-align is awesome !'
tgt = '牛对齐是牛！'

where the ! is pretokenized, as there is a space between it and the previous word ("awesome" in this case). Also, does this requirement stem from the original mBERT or is this your implementation requirement? Thank you!

AttributeError: Can't pickle local object 'word_align.<locals>.collate'

Hi,

When I try to extract the alignments using the command given in the readme here, I get the following error:

Loading the dataset...
Extracting: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "C:\Users\I355109\Anaconda3\Scripts\awesome-align-script.py", line 33, in <module>
    sys.exit(load_entry_point('awesome-align==0.1.6', 'console_scripts', 'awesome-align')())
  File "C:\Users\I355109\Anaconda3\lib\site-packages\awesome_align-0.1.6-py3.8.egg\awesome_align\run_align.py", line 294, in main
  File "C:\Users\I355109\Anaconda3\lib\site-packages\awesome_align-0.1.6-py3.8.egg\awesome_align\run_align.py", line 171, in word_align
  File "C:\Users\I355109\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 352, in __iter__
    return self._get_iterator()
  File "C:\Users\I355109\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 294, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\I355109\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 801, in __init__
    w.start()
  File "C:\Users\I355109\Anaconda3\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\I355109\Anaconda3\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\I355109\Anaconda3\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Users\I355109\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\I355109\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'word_align.<locals>.collate'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\I355109\Anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\I355109\Anaconda3\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
Extracting: 0it [00:01, ?it/s]

Can you please help me figure out if I am doing anything wrong?

Wrong mapping with non-matching sentences

Hi awesome-align team,

First, thanks for the great tool. It has really great potential.

I am following your Colab demo, and I tried to align English to Arabic.
Here are the 2 sentences:

src = 'I will meet you there. It is a very cool weather today.'
tgt = 'سوف أقابلك هناك.'

The Arabic sentence matches the first English sentence in src, i.e. "I will meet you there".
The second sentence in src "It is a very cool weather today." doesn't exist in Arabic.

When I run the code, I get a very strange result, and I am not sure where the culprit is.

This is what I get:

I===سوف
I===أقابلك
will===أقابلك
meet===أقابلك
you===أقابلك
there.===هناك.
today.===هناك.

For some reason, most of the second English sentence is not showing up, plus there are now wrong mappings "(today) is wrongly mapped to (there)' for example.

If I remove the second sentence in src, the result looks really good.

I want to use Awesome-Align to detect non-matching strings in a bilingual dataset, so I can exclude the wrong and non-aligned sentences.

Is there a way to add alignment scores, so it is easy to filter out bad aligned sentences?

Also, is there a way to visualize the mapping? Something similar to SimAlign mapping.

After all, it could be that Awesome-Align is not designed for my purpose, but I hope you consider this idea in a future release.

Thanks in advance for your support, and thanks for the awesome tool :-)

package `requests` missing from the requirements file

How can I resume training from a checkpoint?

Hi,
First of all thanks for these amazing tools and amazing documentation. I am training awesome_align using a parallel corpus. My dataset is quite large. Is there any way to resume training from checkpoint? During training, many checkpoints are being created in the Output directory (checkpoint-4000, checkpoint-8000 etc). How can I use those checkpoints?

Thanks in advance.

Extracting dataset and AttributeError

Trying to run
py run_align.py --output_file=D:\MT\dataFiles\NO-BA.txt --model_name_or_path=bert-base-multilingual-cased --data_file=D:\MT\dataFiles\NO-BA-output.txt --extraction 'softmax' --batch_size 32
And getting error :

"Loading the dataset...
Extracting: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "run_align.py", line 297, in <module>
    main()
  File "run_align.py", line 294, in main
    word_align(args, model, tokenizer)
  File "run_align.py", line 171, in word_align
    for batch in dataloader:
  File "C:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 368, in __iter__
    return self._get_iterator()
  File "C:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 314, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 927, in __init__
    w.start()
  File "C:\Python38\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Python38\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Python38\lib\multiprocessing\context.py", line 326, in _Popen
    return Popen(process_obj)
  File "C:\Python38\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Python38\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'word_align.<locals>.collate'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Python38\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
Extracting: 0it [00:01, ?it/s]"

Not sure what am I doing wrong and how can I run smoothly script,
I try couple solutions what I find on internet but without success .

neulab / awesome-align Goto Github PK

awesome-align's People

Contributors

Stargazers

Watchers

Forkers

awesome-align's Issues

Recommend Projects

Recommend Topics

Recommend Org