Giter Site home page Giter Site logo

batermj / malaya-dataset Goto Github PK

View Code? Open in Web Editor NEW

This project forked from huseinzol05/malay-dataset

0.0 1.0 0.0 270.03 MB

Text corpus for Bahasa Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html

License: Apache License 2.0

Jupyter Notebook 99.83% Python 0.17%

malaya-dataset's Introduction

logo

MIT License


Malaya-Dataset, We gather Bahasa Malaysia corpus! This repository to store corpus for Malaya. We will keep update this repository overtime.

How we gather these corpus?

  1. For news, articles and subtitles, we use crawler, you can get the code from here, Malaya/crawler
  2. For Bahasa, mostly we use Google Translator, you can get the code from here, Malaya/translator
  3. Using social media, I catch most of live data from Twitter, Facebook and Instagram using crawlers, So I just search using Elasticsearch query.

Table of contents

Corpus

200K English-Malay

Total size: 6.9 MB

90k synonym

Total size: 4.7 MB

English-Malay translation

Total size: 91.2 MB

Articles

Total size: 3.1 MB

  1. Filem
  2. Kerajaan
  3. Pembelajaran
  4. Pendidikan
  5. Sekolah

Audience Nationality

Total size: 246 KB

  1. constituency
  2. national

Dependency

Total size: 9.5 MB

Dictionary, 24550 unique words

Total size: 428 KB

Emotion

Total size: 8.5 MB

  1. Anger
  2. Fear
  3. Joy
  4. Love
  5. Sadness
  6. Surprise

Entities, JSON

Total size: 1.1 MB

  1. OTHER - Other
  2. law - law, regulation, related law documents, documents, etc
  3. location - location, place
  4. organization - organization, company, government, facilities, etc
  5. person - person, group of people, believes, etc
  6. quantity - numbers, quantity
  7. time - date, day, time, etc
  8. event - unique event happened, etc

Fake News

Total size: 68.2 MB

  1. Negative
  2. Positive

Gender

Total size: 2.2 MB

  1. Unknown
  2. Male
  3. Female
  4. Brand

Insincere question

Total size: 60.4 MB

  1. Negative
  2. Positive

Irony

Total size: 465 KB

  1. Positive
  2. Negative

Karangan sekolah

Total size: 221 KB

Language-detection, Wikipedia

Total size: 26.2 MB

(array(['OTHER', 'ara', 'ber', 'bul', 'ces', 'cmn', 'dan', 'deu', 'ell',
        'eng', 'epo', 'fin', 'fra', 'heb', 'hun', 'ind', 'ita', 'jpn',
        'kor', 'lat', 'lit', 'mar', 'mkd', 'nld', 'pol', 'por', 'rus',
        'spa', 'srp', 'swe', 'toki', 'tur', 'ukr', 'zlm'], dtype='<U5'),
 array([37910, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000,
        50000, 10000, 10000, 10000, 10000, 10000, 57327, 10000, 10000,
         3687, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000,
        10000, 10000, 10000, 10000, 10000, 10000, 53692]))

News, crawled

Total size: 28.9 MB

Complete list (51 news)
  1. Cuti sekolah
  2. isu 1MDB
  3. isu agama
  4. isu agong
  5. isu agrikultur
  6. isu air
  7. isu anwar ibrahim
  8. isu artis
  9. isu astro
  10. isu bahasa melayu
  11. isu barisan nasional
  12. isu cikgu
  13. isu cukai
  14. isu cyberjaya
  15. isu dunia
  16. isu ekonomi
  17. isu gst
  18. isu harakah
  19. isu harga
  20. isu icerd
  21. isu imigren
  22. isu kapitalis
  23. isu kerajaan
  24. isu kesihatan
  25. isu kuala lumpur
  26. isu lgbt
  27. isu mahathir
  28. isu makanan
  29. isu malaysia airlines
  30. isu malaysia
  31. isu minyak
  32. isu isu najib razak
  33. isu pelajar
  34. isu pelakon
  35. isu pembangkang
  36. isu perkauman
  37. isu permainan
  38. isu pertanian
  39. isu politik
  40. isu rosmah
  41. isu sabah
  42. isu sarawak
  43. isu sosial media
  44. isu sultan melayu
  45. isu teknologi
  46. isu TM
  47. isu ubat
  48. isu universiti
  49. isu wan azizah
  50. peluang pekerjaan
  51. perkahwinan

Normalize

Total size: 2.6 MB

Sentiment News

Total size: 496 KB

  1. Positive
  2. Negative

Sentiment Twitter

Total size: 50.6 MB

  1. Positive
  2. Negative

Sentiment Multidomain

Total size: 159 KB

  1. Amazon review, Positive and Negative
  2. IMDB review, Positive and Negative
  3. Yelp review, Positive and Negative

Part-of-Speech

Total size: 3.1 MB

  1. ADJ - Adjective, kata sifat
  2. ADP - Adposition
  3. ADV - Adverb, kata keterangan
  4. ADX - Auxiliary verb, kata kerja tambahan
  5. CCONJ - Coordinating conjuction, kata hubung
  6. DET - Determiner, kata penentu
  7. NOUN - Noun, kata nama
  8. NUM - Number, nombor
  9. PART - Particle
  10. PRON - Pronoun, kata ganti
  11. PROPN - Proper noun, kata ganti nama khas
  12. SCONJ - Subordinating conjunction
  13. SYM - Symbol
  14. VERB - Verb, kata kerja
  15. X - Other

Polarity

Total size: 1.3 MB

  1. Positive
  2. Negative

Political landscape

Total size: 2 MB

  1. Kerajaan
  2. Pembangkang

Question-Answer

Total size: 2.5 MB

1 mary pergi ke taman. 2 mary pergi ke dapur. 3 husein kembali ke pejabat.
4 husein perjalanan ke lorong. 5 jeff kembali ke bilik tidur. 6 fred berpindah ke lorong.
7 husein berpindah ke bilik mandi. 8 jeff kembali ke taman. 9 jeff kembali ke dapur.
10 fred kembali ke taman. 11 mary mendapat bola sepak di sana. 12 mary menyerahkan bola sepak kepada jeff.
13 apa yang mary berikan kepada jeff? <> bola sepak <> 12.
14 husein kembali ke lorong. 15 jeff kembali ke bilik tidur. 16 apa yang mary berikan kepada jeff? <> bola sepak <> 12.
17 fred berpindah ke bilik mandi. 18 mary mengambil susu di sana. 19 apa yang mary berikan kepada jeff? <> bola sepak <> 12.
20 fred pergi ke dapur. 21 mary menyerahkan susu itu kepada fred. 22 siapa yang memberikan susu itu kepada fred? <> mary <> 21.
23 fred berpindah ke lorong. 24 jeff pergi ke pejabat. 25 siapa yang mary memberikan susu itu? <> fred <> 21

Sarcastic news-headline

Total size: 1.78 MB

  1. Positive
  2. Negative

Stemmer

Total size: 6.5 MB

  1. News stemming
  2. Wikipedia stemming

Subjectivity

Total size: 1.4 MB

  1. Positive
  2. Negative

Toxicity

Total size: 70 MB

Toxicity is multilabel, prefer to use sigmoid based.

  1. toxic
  2. severe toxic
  3. obscene
  4. threat
  5. insult
  6. identity hate

Subtitle

Total size: 1.5 MB

Suggestion

  1. Always apply text augmentation, like swapping based words using synonyms or thesaurus. We gathered some synonyms if you want to use it, 90k synonyms.
  2. Malaya also provided interface for text augmentation using word2vec, Malaya-text-augmentation

Citation

  1. Please citate the repository if use these corpus.
  2. Please at least email us first before distributing these data. Remember all these hard workings we want to give it for free.
  3. What do you see just the data, but nobody can see how much we spent our cost to make it public.

Donation

  1. Husein really need money to stay survive, he is still a human. 7053174643, CIMB Click, Husein Zolkepli

malaya-dataset's People

Contributors

huseinzol05 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.