Giter Site home page Giter Site logo

luciusssss / zhuangbench Goto Github PK

View Code? Open in Web Editor NEW
9.0 3.0 0.0 3.05 MB

Teaching Large Language Models an Unseen Language on the Fly

License: MIT License

Shell 4.25% Python 95.75%
large-language-models llm low-resource-languages low-resource-nlp zhuang

zhuangbench's Introduction

ZhuangBench

Data and code for the following papers:

ACL'24 Findings (Full-Length Paper) Teaching Large Language Models an Unseen Language on the Fly

ICLR'24 Tiny Paper Can LLMs Learn a New Language on the Fly? A Case Study on Zhuang

Dataset

We present ZhuangBench, a collection of NLP resources for Zhuang (壮语), a low-resource language spoken in China.

It consists of a Zhuang-Chinese dictionary, a Zhuang-Chinese parallel corpus, and Zhuang-Chinese machine translation test set.

Important: Preventing Test Set Contamination We encrypted the source files of ZhuangBench in data.zip to prevent test set contamination. The password is zhuangbench.

List of files:

  • dictionary_za2zh.jsonl: Zhuang-Chinese dictionary.
  • dictionary_zh2za.jsonl: Chinese-Zhuang dictionary.
  • parallel_corpus.json: Zhuang-Chinese parallel corpus.
  • test_translation_set.json: Zhuang-Chinese machine translation test set.
  • preprocessed/dictionary_za2zh_web+giza.jsonl: Zhuang-Chinese dictionary augmented with BLI from Giza++.
  • preprocessed/dictionary_zh2za_web+giza+synonym.jsonl: Chinese-Zhuang dictionary augmented with BLI from Giza++ and synonyms.

Beta Version

Our ICLR'24 Tiny Paper uses a beta version of the dataset, ZhuangBench-Beta. We provide the data in data-beta-version.zip (password: zhuangbench-beta). This data is for archival purposes only. We recommend using the newer data in data.zip, which is larger and includes typo corrections.

Code

We provide code of DiPMT++ to reproduce the results in the paper.

Install the dependencies:

pip install -r requirements.txt

Use the scripts in ./scripts to run the LLMs and evaluate the results.

License

The license for the code and data is MIT.

Citation

@article{zhang2024teaching,
  title={Teaching Large Language Models an Unseen Language on the Fly},
  author={Zhang, Chen and Liu, Xiao and Lin, Jiuheng and Feng, Yansong},
  journal={arXiv preprint arXiv:2402.19167},
  year={2024}
}
@inproceedings{zhang2024can,
  title={Can {LLM}s Learn a New Language on the Fly? A Case Study on Zhuang},
  author={Chen Zhang and Mingxu Tao and Quzhe Huang and Zhibin Chen and Yansong Feng},
  booktitle={The Second Tiny Papers Track at ICLR 2024},
  year={2024},
}

zhuangbench's People

Contributors

luciusssss avatar

Stargazers

Guntsv avatar VHT avatar Xiaoran Yu avatar  avatar Jiuheng Lin avatar Yixuan Wang avatar  avatar Xiao Liu avatar Zirui Wu avatar

Watchers

Yixuan Wang avatar Kostas Georgiou avatar  avatar

zhuangbench's Issues

Could you please provide the source code of fine-tuning LLaMA-2-7B-chat?

Hi @luciusssss

Many thanks for releasing the source code of your great paper. In Table 1 of your paper, you reported the performance of the LLaMA-2-7b-chat model when fine-tuning it. If possible, Could you please help to release this source code, as I am a beginner for fine-tuning LLMs.
Thank you very much for your consideration and your help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.