Giter Site home page Giter Site logo

thuraaung1601 / myspellcorrect Goto Github PK

View Code? Open in Web Editor NEW
6.0 1.0 1.0 7.88 MB

Unofficial Python Program for SymSpell4Burmese and Ngram Spelling Corrector

License: MIT License

Python 6.03% Shell 1.22% Jupyter Notebook 92.75%
myanmar-nlp spelling-correction burmese-nlp myanmar-language myanmar-tools symspell

myspellcorrect's Introduction

Spelling Correction for Burmese Language using Statistical Methods

Author : Thura Aung

ReadME in Myanmar Language : https://github.com/ThuraAung1601/mySpellCorrect/blob/main/myanmarREADME.md

Spelling Correction mini-project is one of my pet projects. Here, I am using Statistical approaches such as ngram and SymSpell rather than Ruled-based.

It is not the first use of SymSpell for Myanmar (မြန်မာ) Language. There is a systematically researched conference paper entitled SymSpell4Burmese[3]. Therefore, this would be unofficial implementation of SymSpell4Burmese.

We can correct alphabet errors like ( ကြက်ဉ -> ကြက်ဥ ). But some of the most common spelling errors are not corrected yet. That might happen because of the dictionary I used.

Dictionaries were built from ​myPOS ver3.0 [4] using myWord[5] tool. myPOS was collected from various domain and there might be some noise (common spelling errors) because the main purpose of the corpus was to tag POSTags correctly on the language we used as daily basis.

Dictionaries with correct spellings especially for alphabets and phonetic errors will help alot.

There are also common spelling errors apart from the standard intentionally. eg. သယ်ရင်း -> သူငယ်ချင်း

Also typo errors eg. မှ ူးမတ်​​ငေနာပတိ -> မှူးမတ်​သေနာပတိ

For them, deep learning approach is needed.[6]

If you want to use mySpell directly as I programmed, Myanmar sentences should be word-segmented.

Usage

Install requirements

pip install -r requirements.txt

If you want to correct the whole corpus

python ./mySpellCorrect.py -i test.txt

check the parameters using help command

python mySpellCorrect.py -h
usage: mySpellCorrect.py [-h] [-cp CORPUS] [-ut UNIGRAM_DICT] [-bt BIGRAM_DICT] [-i INPUT] [-o OUTPUT] [-m MODE]

Statistical Spelling Correction for Burmese language

optional arguments:
  -h, --help            show this help message and exit
  -cp CORPUS, --corpus CORPUS
                        corpus file for n-gram
  -ut UNIGRAM_DICT, --unigram_dict UNIGRAM_DICT
                        unigram frequency dictionary file
  -bt BIGRAM_DICT, --bigram_dict BIGRAM_DICT
                        bigram frequency dictionary file
  -i INPUT, --input INPUT
                        input file
  -o OUTPUT, --output OUTPUT
                        output file
  -m MODE, --mode MODE  s for symspell and n for n-gram spelling correction

You can also import to other programs as below:

from mySpellCorrect import ngramSpell
sentence = "မ ဟုတ် ဘူးး"
ngramSpell(sentence)

>> ['မ ဟုတ် ဘူး']
from mySpellCorrect import mySymSpell
sentence = "မ ဟုတ် ဘူးး"
mySymSpell(sentence)

>> ['မ ဟုတ် ဘူး']

GUI Demonstration is also available ...

streamlit run gui.py

I used streamlit for GUI demonstration.

Demo

Demo

Citation

Notice that all of the data I have used are not my property and please check the Licenses

If you want to use any data or dictionary under ./data folder in your research and we'd appreciate if you use the following three references:

  • Khin War War Htike, Ye Kyaw Thu, Zuping Zhang, Win Pa Pa, Yoshinori Sagisaka and Naoto Iwahashi, "Comparison of Six POS Tagging Methods on 10K Sentences Myanmar Language (Burmese) POS Tagged Corpus", at 18th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2017), April 17~23, 2017, Budapest, Hungary.*
  • Zar Zar Hlaing, Ye Kyaw Thu, Myat Myo Nwe Wai, Thepchai Supnithi, Ponrudee Netisopakul, "Myanmar POS resource extension effects on automatic tagging methods", In Proceedings of the 15th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2020), Nov 18 to Nov 20, 2020, Bangkok, Thailand, pp. 189-194.*
  • myWord: Syllable, Word and Phrase Segmenter for Burmese, Ye Kyaw Thu, Sept 2021, GitHub Link: https://github.com/ye-kyaw-thu/myWord**

References:

* I used myPOS ver3 (without POS-tags) for building dictionaries

** built dictionaries using myWord tool

myspellcorrect's People

Contributors

thuraaung1601 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.