Giter Site home page Giter Site logo

africannlp---resources's Introduction

AfricaNLP resources

List of all the resources we developed in collaboration with LSV and Masakhane during my doctoral studies and beyond

Labelled Datasets for AfricaNLP

Dataset Name NLP Task Link to Publication Languages covered
MasakhaNER named entity recognition MasakhaNER: Named Entity Recognition for African Languages amh, hau, ibo, kin, lug, luo, pcm, swa, wol, yor
MAFAND-MT machine translation A Few Thousand Translations Go a Long Way amh, bam, bbj, ewe, fon, hau, ibo, kin, lug, luo, mos, nya, pcm, sna, swa, tsn, twi, wol, xho, yor, zul
ANTC news-topic classification multilingual adaptive fine-tuning (MAFT) lin, pcm, mlg, som, zul
MENYO-20K machine translation MENYO-20k: A Multi-domain English–Yoruba Corpus for Machine Translation yor
NaijaSenti sentiment classification NaijaSenti: A Nigerian Twitter Sentiment Corpus hau, ibo, pcm, yor
Hausa and Yoruba News Topic news-topic classification Transfer Learning and Distant Supervision for Multilingual Transformer Models hau, yor
Hausa VOA NER named entity recognition Transfer Learning and Distant Supervision for Multilingual Transformer Models hau, yor
Yoruba GV NER named entity recognition Massive vs. Curated Word Embeddings for Low-Resourced Languages yor

Unlabelled Corpus for AfricaNLP

Multilingual Pre-trained Language Models

The models below are created using multilingual adaptive fine-tuning (MAFT) on XLMR-distilled model, XLM-R, mT5, ByT5 and mBART. We list the model, model size (in millions), and architecture. We cover the following 20 languages: afr, amh, ara, eng, fra, hau, ibo, mlg, nya, orm, pcm, kin, run, sna, som, sot, swa, xho, yor, zul

Model Size (M) architecture
AfroXLMR-mini 117M Masked LM
AfroXLMR-small 140M Masked LM
AfroXLMR-base 270M Masked LM
AfroXLMR-large 550M Masked LM
AfriMT5 580M Seq-to-Seq
AfriByT5 580M Seq-to-Seq
AfriMBART 610M Seq-to-Seq

Language Adaptive Fine-tuning (LAFT) Models

The following PLMs are created by language adaptation to a language using monolingual corpus in that language. The monolingual corpus used to create them are described in the MasakhaNER paper and MAFT paper

Language mBERT XLM-R-base XLM-R-large
amh Davlan/bert-base-multilingual-cased-finetuned-amharic Davlan/xlm-roberta-base-finetuned-amharic
hau Davlan/bert-base-multilingual-cased-finetuned-hausa Davlan/xlm-roberta-base-finetuned-hausa
ibo Davlan/bert-base-multilingual-cased-finetuned-igbo Davlan/xlm-roberta-base-finetuned-igbo
kin Davlan/bert-base-multilingual-cased-finetuned-kinyarwanda Davlan/xlm-roberta-base-finetuned-kinyarwanda
lin Davlan/xlm-roberta-base-finetuned-lingala
lug Davlan/bert-base-multilingual-cased-finetuned-luganda Davlan/xlm-roberta-base-finetuned-luganda
luo Davlan/bert-base-multilingual-cased-finetuned-luo Davlan/xlm-roberta-base-finetuned-luo
mlg
nya Davlan/xlm-roberta-base-finetuned-chichewa
pcm Davlan/bert-base-multilingual-cased-finetuned-naija Davlan/xlm-roberta-base-finetuned-naija
sna Davlan/xlm-roberta-base-finetuned-shona
som Davlan/xlm-roberta-base-finetuned-somali
swa Davlan/bert-base-multilingual-cased-finetuned-swahili Davlan/xlm-roberta-base-finetuned-swahili
wol Davlan/bert-base-multilingual-cased-finetuned-wolof Davlan/xlm-roberta-base-finetuned-wolof
xho Davlan/xlm-roberta-base-finetuned-xhosa
yor Davlan/bert-base-multilingual-cased-finetuned-yoruba Davlan/xlm-roberta-base-finetuned-yoruba
zul Davlan/xlm-roberta-base-finetuned-zulu

FastText Embeddings for African languages

We provide better quality word embeddings than the pre-trained FastText embeddings trained on Common crawl and Wikipedia. While we did not evaluate the quality on all the languages, our evaluation on Yoruba and Twi shows that they give better performance on word similarity tasks. The FastText embeddings are trained on curated data from JW300, Bible, VOA, BBC, and other news websites. Details of the data sources are in my PhD dissertation.

We trained the FastText embeddings using Gensim 3.8.1. All embedding models can be downloaded from Zenodo. Please, find the links below.

Language Link to Model
amh Amharic FastText
bam Bambara FastText
bbj Ghomala FastText
ewe Ewe FastText
fon Fon FastText
hau Hausa FastText
ibo Igbo FastText
kin Kinyarwanda FastText
lug Luganda FastText
luo Luo FastText
mos Mossi FastText
nya Chichewa FastText
pcm Nigerian-Pidgin FastText
sna Setswana FastText
swa Swahili FastText
tsn Setswana FastText
twi Twi FastText
wol Wolof FastText
xho Xhosa FastText
yor Yoruba FastText
zul Zulu FastText

africannlp---resources's People

Contributors

dadelani avatar

Watchers

Oluwabusayo Awoyomi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.