The qameleon from sabyaghossh

View Code? Open in Web Editor NEW

QAmeleon introduces synthetic multilingual QA data using PaLM, a 540B large language model. This dataset was generated by prompt tuning PaLM with only five examples per language. We use the synthetic data to finetune downstream QA models leading to improved accuracy in comparison to English-only and translation-based baselines.

qameleon's Introduction

QAmeleon introduces synthetic multilingual QA data contaning in 8 langauges using PaLM-540B, a large language model. This dataset was generated by prompt tuning PaLM with only five examples per language. We use the synthetic data to finetune downstream QA models leading to improved accuracy in comparison to English-only and translation-based baselines.

Data available at https://storage.googleapis.com/qameleon/qamelon_pt_accepted.csv

More details can be found in the paper which can be cited as follows:

@misc{agrawal2022qameleon,
      title={QAmeleon: Multilingual QA with Only 5 Examples}, 
      author={Priyanka Agrawal and Chris Alberti and Fantine Huot and Joshua Maynez and Ji Ma and Sebastian Ruder and Kuzman Ganchev and Dipanjan Das and Mirella Lapata},
      year={2022},
      eprint={2211.08264},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

This dataset contains a total of 47173 Question Answer instances across 8 langauges, following is the count per language.

Language	Count
ar	6966
bn	6084
fi	5028
id	6797
ko	6471
ru	5557
sw	5597
te	4673
Total	47173

The QAmeleon dataset is released under the CC-BY 4.0 license.

Recommend Projects

sabyaghossh / qameleon Goto Github PK

qameleon's Introduction

qameleon's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent