Giter Site home page Giter Site logo

masakhane-community's Introduction

Masakhane - A living collection of NLP projects for Africans, by Africans

PRs Welcome Slack Status

MASAKHANE is an research effort for NLP for African languages that is OPEN SOURCE, CONTINENT-WIDE, DISTRIBUTED and ONLINE. This GitHub repository houses the data, code, results and research for building open baseline NLP results for African languages.

Website: masakhane.io

Our Mission

Masakhane is a grassroots organisation whose mission is to strengthen and spur NLP research in African languages, for Africans, by Africans. Despite the fact that 2000 of the world’s languages are African, African languages are barely represented in technology. The tragic past of colonialism has been devastating for African languages in terms of their support, preservation and integration. This has resulted in technological space that does not understand our names, our cultures, our places, our history.

Masakhane roughly translates to “We build together” in isiZulu. Our goal is for Africans to shape and own these technological advances towards human dignity, well-being and equity, through inclusive community building, open participatory research and multidisciplinarity

Our Values

  • Umuntu Ngumuntu Ngabantu - loosely translated from isiZulu means “a person is a person through another person” or “I am because you are”. This philosophy calls for collaboration and participation and community. It proposes relationality, over individualism for stronger social cohesions towards sustainable communities. It believes we share our successes and one’s personhood is evaluated based on their contributions to the community.

  • African-centricity. We centralize the narratives of Africans as a remedy to the effects of Euro-centricism on our beliefs. This way we reassert a new way of looking at information from a African perspective and shun any attempts to devalue our knowledge and stories

  • Ownership - We believe that Africans should be in charge of owning, driving and participating in the NLP research process, rather than as observers or data providers.

  • Openness - We believe in sharing our ideas and progress openly, especially on the African continent, for Africans. We’re against research that takes African contributions or data and puts them behind a paywall that is infeasible for Africans to access.

  • Multidisciplinarity - We truly believe that participation from all fields and experience and that multidisciplinarity leads to a more robust and more inclusive society

  • Everyone has valuable knowledge - We believe that each person’s individual experiences have value and each person is worth listening too and has something to contribute.

  • Kindness - We believe that being considerate, friendly and generous within our community is the best way to support it and encourage more inclusivity

  • Responsibility - We believe that each person in the technology process has an ethical responsibility to what they produce in the world. For this reason, we actively wreckon with the ethical impacts of our work

  • Data sovereignty - We believe Africans should be able to decide what data represents our communities globally, retain ultimate ownership of that data, and know how it is used

  • Reproducibility - We believe in reproducible research. As a result, we publish our code and data from our research so that others can reproduce and build upon it.

  • Sustainability - We believe that sustainability is necessary for societal change - that small daily efforts, over a long time are what truly change the world. To that, we aim for sustainability of our work, by being fully integrated with technological stakeholders to ensure the community continues to thrive into the future

Goals

  • For Africa: To build and facilitate a community of NLP researchers, connect and grow it, spurring and sharing further research, build helpful tools for applications in government, medicine, science and education, to enable language preservation and increase its global visibility and relevance.

  • For NLP Research: To build data sets and tools to facilitate NLP research on African languages, and to pose new research problems to enrich the NLP research landscape.

  • For the global researchers community: To discover best practices for distributed research, to be applied by other emerging research communities.

Progress

How can I contribute?

There are many ways to contribute to MASAKHANE.

  1. TRAIN A MODEL - Contribute a trained model and related code for your language
  2. ANALYSIS - Contribute analysis of data/models for any African languages. You do not need any technical experience for this! If you're a linguist, we can pair you up with a NLP practitioner and you can help contribute analysis
  3. DATA - Help build or find datasets for your language
  4. DOCUMENTATION - Help document our discussions, progress. This is VERY much needed. Or contribute to documentation of the base "notebook" that will improve the experience of others
  5. MENTORSHIP - Provide advice or help tune models for their languages and datasets, or help people get started
  6. ADMIN - Working with so many researchers can be quite a challenge! Help out with administrative tasks
  7. COMPUTE - Help with infrastructure and compute! Do you have spare compute to donate? Let us know! We're always looking for more!
  8. BRAINSTORM Join our weekly meetings, provide advice or ideas
  9. STORY-TELLING - Tell our stories to the world by doing talks about the community, contributing to our Medium publication, or engaging with media outlets
  10. MLOps & ML Engineering - Do you enjoy delving into the MLOps side of machine learning? Are you a software developer looking to hone-in on your ML engineer abilities? Join us to help build tools to support out reproducability, data gathering, and model sharing!

Want more details? Check out our current initiatives

How do I join?

  1. Join our Slack

  2. Request to join our Google Group - this will add you to our weekly meetings

  3. So we can feature you on our webpage masakhane.io, please fill in our membership form HERE:

Please be patient with a response via our email address, we're very behind on our administration, in the time of COVID-19.

Where do I start

  • If you're on slack, you'll see a number of channels which reflect our initiatives (described below). Join them and start engaging
  • Every week, we have an open meeting for our members. These are described on our meeting agenda where you can learn about the format, add and vote on topics. Make sure you've joined our google group
  • If you're not sure what value you can add, check out our growing message board to see if there are any tasks you can pick up!

Initiatives

Every week we have more ideas, and more impromptu projects that emerge. Keen on any initiatives? Join our slack and find the respective group.

Working on a Masakhane initiative that is not listed here? Please add it with a PR ❤️

Keen to help on any of these initiatives? Please see our message board

Initiative Description Slack Channel Repository
Machine Translation Benchmarks Continued expansion and iterations on our language benchmarks as documented on the main GitHUB README #benchmarks HERE
NER Datasets and Benhmarks We're busy releasing datasets and research around NER #ner HERE
Dataset Creation We never have enough data. More is always needed. We have a number of members finding creative ways to build datasets. #datasetcreation
Reproducibility The goal is to ensure reproducibility and comparability of models and results. #reproducibility
Takalani NLP Development of Language Models for South African languages #takalani-nlp
Wazobia Yoruba, Igbo, Hausa and Nigerian languages NMT #wazobia
Multilingual Chatbot Developing multilingual chatbots #multilingual-dialogue
Transfer Learning Transfer Learning & Multilingual Expansion of Benchmarks #transfer-learning
Evaluation of Masakhane Models How good are the Masakhane models? How can we measure it, besides looking at BLEU scores? #evaluation
Text-to-speech Corpora and models for text to speech synthesis (TTS) from audio bibles in Ewe, Hausa, Lingala, Asante Twi, Akuapem Twi and Yoruba #bible-speech HERE

Code of Conduct

See Code of Conduct

masakhane-community's People

Contributors

alpoktem avatar cdleong avatar dadelani avatar hackmd-deploy avatar ignatiusezeani avatar juliakreutzer avatar kpu avatar oldladypants avatar poppingtonic avatar ruohoruotsi avatar tosingithub avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

masakhane-community's Issues

Masakhane Speaker Topics

Sometimes it's nice to get someone in to come do a talk about a topic that is relevant to the community. If you have something you're specifically interested in learning more about, please post the Topic and a brief motivation as to why you think it would be important to the community.

As the community, please 👍 on the topics that are most important for you right now (try not upvote everything :P That won't help us prioritize which talks to organise first)

Custom Data Notebook: Spaces in file paths can cause issues with bash commands

For example, /content/drive/My Drive/masakhane/$src-$tgt-$tag can cause issues, but also the following situation caused an error for me:

source_file = f"/content/drive/My Drive/Research/Hani Machine Translation/hni_story_corpus/v2/hani_story_corpus_train.{source_language}"
target_file = f"/content/drive/My Drive/Research/Hani MachineTranslation/hni_story_corpus/v2/hani_story_corpus_train.{target_language}"

# They should both have the same length.
! wc -l $source_file
! wc -l $target_file

Mitigations we could do:

"MyDrive" instead of "My Drive" helps

Actually, it seems you can just change from using My Drive to MyDrive paths, which helps a lot so long as there aren't spaces elsewhere in the path, e.g. in my case where Hani Machine Translation was in the path to train.eng and train.hni

Add quotes around bash variables

For example
! wc -l "$source_file" instead of wc -l $source_file

and `

! head "$source_file"* instead of ! head "$source_file"*

but this doesn't completely solve it, and can get complicated when we've got some of the more complex cases later in the notebook, like

!cp -r joeynmt/models/${src}${tgt}_transformer/* "$gdrive_path/models/${src}${tgt}_transformer/"

or within the yaml file:

#load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint

Warn the user about whitespaces.

Add a section that checks all the paths for white spaces and warns the user that, maybe it would be easier if they just removed them?

Do all our file manipulations with Python

We could rewrite a lot of these to use pathlib

See also pjreddie/darknet#1672 and https://stackoverflow.com/questions/56640534/cannot-open-train-txt-with-white-space-my-drivehe

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.