Giter Site home page Giter Site logo

dsfsi / vukuzenzele-nlp Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dsfsi/dsfsi-dataset-template

6.0 0.0 4.0 5.4 GB

The dataset contains editions from the South African government magazine Vuk'uzenzele. Data was scraped from PDFs that have been placed in the data/raw folder. The PDFS were obtained from the Vuk'uzenzele website.

Home Page: https://doi.org/10.5281/zenodo.7598539

License: MIT License

Python 44.03% Makefile 1.41% Dockerfile 0.30% Shell 3.94% Jupyter Notebook 50.32%
african-languages africannlp south-africa african-language-data-liberation-front aldlf africanlp language dataset nlproc dsfsi-datasets

vukuzenzele-nlp's Introduction

The Vuk'uzenzele South African Multilingual Corpus

Github: https://github.com/dsfsi/vukuzenzele-nlp/

Zenodo: DOI

Arxiv Preprint: arXiv

Give Feedback ๐Ÿ“‘: DSFSI Resource Feedback Form{:target="_blank"}

About dataset

The dataset contains editions from the South African government magazine Vuk'uzenzele, created by the Government Communication and Information System (GCIS). Data was scraped from PDFs that have been placed in the data/raw folder. The PDFS were obtatined from the Vuk'uzenzele website.

The datasets contain government magazine editions in 11 languages, namely:

Language Code Language Code
English (eng) Sepedi (sep)
Afrikaans (afr) Setswana (tsn)
isiNdebele (nbl) Siswati (ssw)
isiXhosa (xho) Tshivenda (ven)
isiZulu (zul) Xitstonga (tso)
Sesotho (nso)

Number of Aligned Pairs with Cosine Similarity Score >= 0.65

src_lang trg_lang num_aligned_pairs
ssw xho 2202
ssw zul 2183
xho zul 2102
nso xho 2081
nso tso 2071
ssw tso 2034
nso ssw 2021
tsn tso 2020
tsn xho 2009
tso xho 2009
nso tsn 2002
ssw tsn 1987
tso zul 1957
nso zul 1953
tsn zul 1933
eng zul 1923
eng tso 1923
eng nso 1867
eng ssw 1821
afr xho 1816
eng xho 1801
nbl sep 1795
sep ven 1794
afr ssw 1783
eng tsn 1772
afr zul 1769
afr nso 1746
nbl ven 1699
afr eng 1661
afr tsn 1631
afr tso 1617
afr sep 551
afr ven 498
afr nbl 491
nso sep 410
nso ven 352
sep tso 326
sep tsn 319
tso ven 307
sep ssw 305
sep xho 300
ssw ven 290
tsn ven 285
nbl ssw 282
nbl nso 266
ven xho 260
eng sep 258
nbl xho 250
sep zul 249
nbl tso 238
eng ven 234
nbl tsn 230
nbl zul 226
ven zul 225
eng nbl 184

The dataset is present in several forms on the repo. Generally the dataset is split by edition, eg. 2020-01-ed1
The data directory is broken down as follows

./data
โ”œโ”€โ”€ external                # Data external to this repo
โ”œโ”€โ”€ interim                 # I am not really sure - looks like interim in regards to processed.
โ”œโ”€โ”€ processed               # The data from scraping the raw pdfs
โ”œโ”€โ”€ raw                     # The raw pdfs of the Vuk'uzenzele magazine
โ”œโ”€โ”€ sentence_align_output   # The output (csv) of the sentence alignment with LASER language encoders
โ””โ”€โ”€ simple_align_output     # The output (csv) of a simple one to one sentence alignment

The dataset is split by edition in the data/processed folder.

Disclaimer

This dataset contains machine-readable data extracted from PDF documents, from https://www.vukuzenzele.gov.za/, provided by the Government Communication Information System (GCIS). While efforts were made to ensure the accuracy and completeness of this data, there may be errors or discrepancies between the original publications and this dataset. No warranties, guarantees or representations are given in relation to the information contained in the dataset. The members of the Data Science for Societal Impact Research Group bear no responsibility and/or liability for any such errors or discrepancies in this dataset. The Government Communication Information System (GCIS) bears no responsibility and/or liability for any such errors or discrepancies in this dataset. It is recommended that users verify all information contained herein before making decisions based upon this information.

Authors

  • Vukosi Marivate - @vukosi
  • Andani Madodonga
  • Daniel Njini
  • Richard Lastrucci
  • Isheanesu Dzingirai
  • Jenalea Rajab

Citation

Paper

Preparing the Vuk'uzenzele and ZA-gov-multilingual South African multilingual corpora

@inproceedings{lastrucci-etal-2023-preparing, title = "Preparing the Vuk{'}uzenzele and {ZA}-gov-multilingual {S}outh {A}frican multilingual corpora", author = "Richard Lastrucci and Isheanesu Dzingirai and Jenalea Rajab and Andani Madodonga and Matimba Shingange and Daniel Njini and Vukosi Marivate", booktitle = "Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.rail-1.3", pages = "18--25" }

Dataset

Vukosi Marivate, Andani Madodonga, Daniel Njini, Richard Lastrucci, Isheanesu Dzingirai, Jenalea Rajab. The Vuk'uzenzele South African Multilingual Corpus, 2023

@dataset{marivate_vukosi_2023_7598540, author = {Marivate, Vukosi and Njini, Daniel and Madodonga, Andani and Lastrucci, Richard and Dzingirai, Isheanesu Rajab, Jenalea}, title = {The Vuk'uzenzele South African Multilingual Corpus}, month = feb, year = 2023, publisher = {Zenodo}, doi = {10.5281/zenodo.7598539}, url = {https://doi.org/10.5281/zenodo.7598539} }

Licences

vukuzenzele-nlp's People

Contributors

lastrucci01 avatar pkhoboko avatar andanimadodonga avatar vukosim avatar daniel-ind avatar peaceaz avatar derwin-ngomane avatar idzingirai avatar

Stargazers

Loyso avatar Nikolaus Schlemm avatar 0xDEADBEEF avatar Catherine Koshka avatar David Adelani avatar Ndamulelo Nemakhavhani avatar

vukuzenzele-nlp's Issues

Extract 2020 Stories

Description

Extract the 2022 stories from the PDFs and make them ready.

Tasks

Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at.

  • Run extraction pipeline to interim
  • Clean up interim to processed
  • Commit, do a pull request

Update 2022 PDFs

Description

Fetch latest 2022 PDFs up to December 2022

Tasks

Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at.

  • Go to vukuzenzele website
  • Download missing pdfs for 2022 add to repo in raw
  • Commit, create pull request

Update citation

Update the citation on the readme to reference the final paper instead of the preprint.

Download Vukuzenzele 2023 PDFs

Description

Download the Vukuzenzele PDF editions in all the SA languages from Jan to Dec 2023 and place them in the /data/raw

Tasks

Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at.

  • Download the Vukuzenzele PDF editions in all the SA languages
  • Place the PDF files in the /data/raw folder following the same folder structure as past editions, (2023-01-ed1, etc)

Sentence Alignment

Description

Perform sentence alignment on the data/processed folder using LASER sentence embedders

Tasks

Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at.

  • Clean/preprocess data and write to .txt

  • Perform sentence embeddings to produce sentence vectors

  • Align sentences using above vectors

  • Automation of this process

Update Sentence Alignment

Following the rework of the sentence alignment function in the gov-za cabinent statements repo, I think Vukuzenzele could benefit from the updated function as well.

  • Update sentence alignment function
  • Rerun on data
  • Check Action still works

Lang Code Mismatch: nso vs sot vs sep

The processed annotations feature txts for both sep & nso, but none for sot

This causes conflicts later in the alignment process when trying to decide which language goes with which language model.

  • Investigate where error originates
  • Is Sesotho under nso? or Sepedi under nso?
  • Figure out & Implement fix

Make comprehensive Annotation Instructions

In need of a doc to explain thoroughly:

  • How to extract stories from the pdf?
  • What is annotating Vukuzenzele stories?
  • What is the format required for the final annotated product?

Manually Extract 1 story that is in all languages

  • Check the PDFs and see which story has been extracted in all languages
  • Copy and paste the story into 11 files .txt files (1 for each language )
  • Develop a naming convention
  • Note: Must be able to align sentences in future.
  • Push example into repository
  • Document any challenges.

Sentence Align Action Failure

Description

The sentence align action failed on pip install, looks like a version mismatch on tensorboard.

Screenshots

image

Tasks

  • Investigate the issue
  • Update task
  • Implement Fix

Incorrect formatting of the final versions of files

Description

We have a template of how e want the processed files to look like. @PKhoboko you have submitted incorrect formats. See below screenshots

Screenshots

Incorrect
image

  • Not cleaned
  • Non removal of irrelavant text
  • does not use our structure
  • Please see the initial email we sent you with instructions.

Correct
image

Files

Your file

Proper formatting
#### To Reproduce
If this issue is describing a bug, include some steps to reproduce the behavior.

Tasks

Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at.

  • Fix all submitted files by 12 May 2023
  • Use git tool pull and push to github (create a branch to fix, then do a merge request)
  • Naming of files correct.

Make Data statement & README

Description

Data statement & README for the repo

Tasks

Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at.

  • Make a Data statement in the root folder
  • Make a readme in the root folder

Convert /data/processed txts into a JSON doc

Description

It would be beneficial to have the text data as a json object.
The processed vukuzenzele has the following structure.
title\n\nauthor\n\ntext
The file name has the month & edition number.

Tasks

  • Convert the data/processed/ to a JSON file.
[
   {
      'title' : '',
      'month' : '',
      'edition_no' : '',
      'author' : '',
      'text' : ''

   },
   {}, 
   {}
]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.