Giter Site home page Giter Site logo

cambridgeltl / prompt4bli Goto Github PK

View Code? Open in Web Editor NEW
8.0 6.0 2.0 101 KB

On Bilingual Lexicon Induction with Large Language Models (EMNLP 2023). Keywords: Bilingual Lexicon Induction, Word Translation, Large Language Models, LLMs.

License: MIT License

Shell 1.29% Python 98.71%
bilingual-dictionary-induction bilingual-lexicon-extraction bilingual-lexicon-induction large-language-models llms machine-translation multilingual-models multilingual-nlp word-translation low-resource-machine-translation

prompt4bli's Introduction

Prompt4BLI

This repository is the official PyTorch implementation of the following paper:

Yaoyiran Li, Anna Korhonen, and Ivan Vulić. 2023. On Bilingual Lexicon Induction with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023). [Paper] [OpenReview]

Prompt4BLI aims to address the Bilingual Lexicon Induction (BLI) / Word Translation tasks with autoregressive Large Language Models (LLMs). We for the first time demonstrate that prompting multilingual LLMs for BLI outperforms traditional BLI approaches which rely on calculating cross-lingual word embeddings (CLWEs). While we show that prompting off-the-shelf LLMs can already establish new state-of-the-art BLI performance on many BLI language pairs (our main experimental setup), the Prompt4BLI repo also provides code for BLI-oriented fine-tuning which can further improve the results (as a side experiment, demonstrated on smaller-scale LLMs).

Traditional methods rely on learning parameterized CLWE mappings or cross-lingual word pair scoring functions and usually tackle BLI in three setups: (1) Supervised, 5K seed translation pairs; (2) Semi-Supervised, 1K seed translation pairs; (3) Unsupervised, 0 seed translation pairs. (cf. our previous work ContrastiveBLI and BLICEr). Different from traditional methods, Prompt4BLI only makes use of off-the-shelf LLMs, not requiring LLM fine-tuning nor updating any learnable parameters. Our work considers the following prompting setups:

  • Few-Shot Prompting: We propose to retrieve a subset of the seed translation pairs (nearest neighbour retrieval) as in-context examples for prompting. Corresponds to the traditional Supervised and Semi-Supervised BLI setups where the seed bilingual dictionary size is 5K and 1K respectively.
  • Zero-Shot Prompting: No in-context examples are used. Corresponds to the traditional Unsupervised BLI setup.

Dependencies

  • PyTorch>=1.10.1
  • Transformers>=4.28.1

LLMs Used in Our Work

LLM (Hugging Face) Model ID
mT5-small "google/mt5-small"
mT5-base "google/mt5-base"
mT5-large "google/mt5-large"
mT5-xl "google/mt5-xl"
mT5-xxl "google/mt5-xxl"
mT0-small "bigscience/mt0-small"
mT0-base "bigscience/mt0-base"
mT0-large "bigscience/mt0-large"
mT0-xl "bigscience/mt0-xl"
mT0-xxl "bigscience/mt0-xxl"
XGLM-564M "facebook/xglm-564M"
XGLM-1.7B "facebook/xglm-1.7B"
XGLM-2.9B "facebook/xglm-2.9B"
XGLM-4.5B "facebook/xglm-4.5B"
XGLM-7.5B "facebook/xglm-7.5B"
mGPT "sberbank-ai/mGPT"
LLaMA-7B "huggyllama/llama-7b"
LLaMA-13B "huggyllama/llama-13b"

Data

Following ContrastiveBLI and BLICEr, our data is obtained from the XLING (8 languages, 56 BLI directions in total) and PanLex-BLI (15 lower-resource languages, 210 BLI directions in total).

Get XLING data:

sh get_xling_data.sh

For PanLex-BLI, please see ./get_panlex_data, where we provide the code for deriving the monolingual word embeddings.

Run the Code

Prepare BLI Data and Extract In-Context Examples for Few-Shot Prompting (XLING):

python run_extract_vocabularies.py
python run_extract_bli_data.py

Prepare BLI Data and Extract In-Context Examples for Few-Shot Prompting (PanLex-BLI):

python run_extract_vocabularies_panlex.py
python run_extract_bli_data_panlex.py

(Optional) Run BLI-Oriented LLM Fine-Tuning (define LLM dirs, learning rate, batch size, and random seed in run_training.py):

python run_prepare_training_data.py
python run_training.py

Run BLI Evaluation (define seed dictionary size, n_shot, LLM dir, and language pairs to evaluate manually in run_bli.py):

python run_bli.py

Citation

Please cite our paper if you find Prompt4BLI useful.

@inproceedings{li-etal-2023-bilingual,
    title     = {On Bilingual Lexicon Induction with Large Language Models},
    author    = {Li, Yaoyiran and Korhonen, Anna and Vuli{\'c}, Ivan},
    booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},    
    year      = {2023}
}

prompt4bli's People

Contributors

yaoyiran avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.