Giter Site home page Giter Site logo

wietsedv / xpos Goto Github PK

View Code? Open in Web Editor NEW
18.0 3.0 7.0 6.5 MB

Make the Best of Cross-lingual Transfer: Evidence from POS Tagging with over 100 Languages (ACL 2022)

Home Page: https://aclanthology.org/2022.acl-long.529/

Python 5.04% Fortran 0.16% Jupyter Notebook 94.61% R 0.20%

xpos's Introduction

Make the Best of Cross-lingual Transfer: Evidence from POS Tagging with over 100 Languages

Wietse de VriesMartijn WielingMalvina Nissim

Abstract: Cross-lingual transfer learning with large multilingual pre-trained models can be an effective approach for low-resource languages with no labeled training data. Existing evaluations of zero-shot cross-lingual generalisability of large pre-trained models use datasets with English training data, and test data in a selection of target languages. We explore a more extensive transfer learning setup with 65 different source languages and 105 target languages for part-of-speech tagging. Through our analysis, we show that pre-training of both source and target language, as well as matching language families, writing systems, word order systems, and lexical-phonetic distance significantly impact cross-lingual performance.

Status: Published at ACL 2022. Final version: click here.

@inproceedings{de-vries-etal-2022-make,
    title = "Make the Best of Cross-lingual Transfer: Evidence from {POS} Tagging with over 100 Languages",
    author = "de Vries, Wietse  and
      Wieling, Martijn  and
      Nissim, Malvina",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.529",
    pages = "7676--7685",
}

Demo and Models

The main results and discussion of the paper are based on the predictions of 65 fine-tuned XLM-RoBERTa models. A demo and the 65 models can be found in this Hugging Face Space:

Code

Dependencies

If you use Conda environments, you can replicate the exact dependency versions that were used for the experiments:

conda create -n xpos --file conda-linux-64.lock  # if 64-bit Linux
conda create -n xpos --file conda-osx-arm64.lock  # if Apple Silicon
conda activate xpos

Training

You can then train the models with:

python src/train.py udpos --learning_rate=5e-5 --eval_steps=1000 --per_device_batch_size=10 --max_steps=1000 --multi

Tip: append --dry_run to the previous command to only download and cache base models and data without training any models.

Tip: append --language_source {lang_code} (to any command) to only train for one source language. Check language codes in configs/udpos.yaml.

Cross-lingual prediction

Predictions for the best trained models for every target language can be generated with:

python src/predict.py udpos

Tip: append --language_source {lang_code} or --language_target {lang_code} to generate predictions for specific languages.

Tip: append --digest {digest} to generate predictions for a specific training configuration. The digest is the random string of 8 characters in the output path of each model.

Cross-lingual results

Export a csv with test accuracies for every source/target combination with:

python src/results.py udpos -a -e results.csv

Tip: Just like with training and prediction, you can specify specific languages or a specific digest.

Models

Export trained models with:

python src/export.py udpos -e models

Tip: Just like with training and prediction, you can specify specific languages or a specific digest.

xpos's People

Contributors

wietsedv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

xpos's Issues

Tokenization/merging

Hi there

Skimmed through the paper and the demo and absolutely love it. Great work! I was playing with the demo, and I came across a small but important bug. To be honest, I do not think this is on the part of your code-base but simply a quirk in the Hugging Face tokenizers. If it is on their part, I'll escalate to their repo.

When I run the following the sentence through the demo, I find that the "retokenization"/merging has an error in it.

Soms wil ik naar de vogeltjes kijken , maar wil jij dat wel ?

The output is as below, where "jij dat" is merged into "jijdat". This is quite undesirable of course.

output

Looking at app.py, it seems that you get the output back from the pipeline - but I just want to check whether you changed anything particular about the XLM-R tokenizer? Otherwise, I assume it's tokenizer.decode() that is messing things up, particularly its clean_up_tokenization_spaces parameter (untested, might be wrong).

how to draw a heatmap

Hi @wietsedv,

Thanks for your great work!

Could you please point me where is the code for drawing the heatmap in you paper?

Best,
Chiyu

Having trouble in code reimplementing.

Hi @wietsedv,
I'm currently working on implementing transfer score regression by utilizing each language feature as outlined in the paper. However, I'm encountering some challenges in this process:

  1. The code does not work with the data currently attached.
  2. There are no results for writing system type.
  3. Additionally, I'm faced with multiple regression expressions, and I'm unsure about which one would be the most appropriate to use.

Can I ask you to do some research on writing system type?
Along with this, I'm looking forward to your answers to my other questions.
Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.