Giter Site home page Giter Site logo

csebuetnlp / crosssum Goto Github PK

View Code? Open in Web Editor NEW
48.0 48.0 7.0 5.85 MB

This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+ Language Pairs" published in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL’23), July 9-14, 2023.

Python 91.73% Shell 0.62% Jupyter Notebook 7.31% Makefile 0.02% Dockerfile 0.06% Jsonnet 0.01% CSS 0.06% JavaScript 0.19%
cross-lingual-summarization cross-lingual-transfer multilingual-nlp

crosssum's Issues

Finetuning on HF using Seq2SeqTrainer.

I need to finetune csebuetnlp/mT5_m2m_crossSum using the Seq2SeqTrainer on HuggingFace. What special symbols need to be inserted during tokenization of input-output pairs?

Questions about replicating m2m model

Hi there, happy new year and thanks for releasing the code for your nice work! I have some questions about the training configs and I hope you could clarify them for me.

Specifically, I would like to replicate the released m2m model and I am using the provided trainer.sh. As I use two A100s, I set PER_DEVICE_TRAIN_BATCH_SIZE=32 so as to keep the effective batch size to 256. I keep the rest configs intact.

Then I do inference on the Chinese Simplified-English test set with the checkpoint at the 25K step. It gets 21.65 on ROUGE-L, while the released m2m model gets 26.75.

After inspecting the model outputs, I found that the model of my replication sometimes generates summaries of non-target languages. For example, for the Chinese Simplified-English test set, around 10% of the generated summaries are Chinese, while the released model is able to generate only English summaries. This may explain the above performance gap.

Another observation is that the above-mentioned problem is more severe with the checkpoint at the 20K step, so I wonder if this is due to underfitting, and it may vanish with more training steps (e.g., 30K). I have not validated this assumption yet as I would like to adopt your original training configs if possible.

I would appreciate it if you could shed some light on how to correctly replicate your m2m model. Are there any particular training configs that I should adopt? It would also help if you could share which checkpoint (training steps) the released m2m model is.

Many thanks!

Some questions about training settings

Dear authors:
I have read your paper for many times, it's a awesome work!
I tried to reproduce your model for my future works. However, I can't get the same result as you provided models.
Would you like to share the training settings so that I can better reproduce your work? Such like: training steps, learning rate and scheduler type choice.
Thanks a lot!

Enhanced model

Hi,

Thank you for this great contribution to the problem of cross-lingual summarization.

I evaluated both mT5_m2m_crossSum and mT5_m2m_crossSum_enhanced in a few language pairs and confirmed that the enhanced model actually has significantly higher ROUGE scores. I am curious to learn more about the specific training process of the enhanced model. Was the improvement primarily the result of an extensive hyperparameter search, or did you employ any additional training techniques that were not detailed in the paper?

Thank you in advance for your response.

Best,
Diogo

Something about Input and output

Hello, I have read your paper 4 times, awesome works!
I got confused about the input and output. If we want to summarize an article that uses English, into French and Chinese. Can we use the same model to do that? Or Train two models, one is English-French, another is English-Chinese?

How was pearson correlation calculated in the experiments ?

Hi, thanks for the great works !

There are a bit of details regarding correlation of LaSE in the paper that I did not quite understand. For each target language, the top-5 source languages were used to evaluate LaSE's correlation with ROUGE-2 for out-lang scenario.

Let's assume those 5 languages are lan1, lan2 .. lan5, with the target language being tgt_lan0. I'm assuming that the procedure is like this: generate summaries in tgt_lan with 5 src_lans to obtain 5 prediction sets pred1, pred2 .. pred5. Aggregate those prediction sets, and evaluate LaSE score with references similar to source languages (ref1, ref2 .. ref5), then calculate ROUGE-2 with ref0. In total, we have len(pred1)+len(pred2) + .. len(pred5) scores each for LaSE and ROUGE-2.

After this, we calculate pearson correlation based on two 1D arrays formed of these two lines of scores. Is this interpretation correct ? If it is, since scores of different references-predictions pair might differ (e.g. a similar score of 0.5 might be bad for certain pairs, but considered good for some other pairs), do you think aggregating them this way is suboptimal ?

Could you help clarify this @Tahmid04 ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.