Giter Site home page Giter Site logo

ye-kyaw-thu / sylbreak Goto Github PK

View Code? Open in Web Editor NEW
55.0 7.0 19.0 3.04 MB

Syllable segmentation tool for Myanmar language (Burmese) by Ye.

License: Apache License 2.0

Python 2.45% Shell 0.77% Perl 0.66% Java 0.62% HTML 80.67% JavaScript 0.30% Jupyter Notebook 9.42% PHP 3.29% TypeScript 0.40% C++ 0.95% Ruby 0.49%
myanmar burmese syllable regular-expressions word-segmentation

sylbreak's Introduction

sylbreak

Myanmar language (Burmese) README

Syllable segmenation is an important preprocess for many natural language processing (NLP) such as romanization, transliteration and graphame-to-phoneme (g2p) conversion.

"sylbreak" is a syllable segmentation tool for Myanmar language (Burmese) text encoded with Unicode (e.g. Myanmar3, Padauk). I used only one short line of regular expression (RE) as follow:

$line =~ s/((?<!$ssSymbol)[$myConsonant](?![$aThat$ssSymbol])|[$enChar$otherChar])/$sep$1/g;

Here, the point is (a consonant not after a subscript symbol AND not followed by a-That character or a subscript symbol)

Here, variables are declared as follows:

my $myConsonant = "က-အ";
my $enChar = "a-zA-Z0-9";
my $otherChar = "ဣဤဥဦဧဩဪဿ၌၍၏၀-၉၊။!-\/:-\@\[-`{-~\\s";
my $ssSymbol = "";
my $aThat = "";

Visualization of sylbreak RE

Fig. Visualization of sylbreak RE

If you use shell (sylbreak.sh), perl (sylbreak.pl) and python (sylbreak.py) scripts, no need to make installation.

Enjoy syllable breaking!

Ye@Lab

Demo/Explanation

In the paper titled "An Algorithm for Myanmar Syllable Segmentation based on the Official Standard Myanmar Unicode Text" presented at the ICCA-2023 conference, the authors make the following statement in Section VI, Performance Evaluation:

Furthermore, we compared the correctness of our algorithm with an existing algorithm, sylbreak3. As stated in Section II, the drawback of the sylbreak3 algorithm is that it cannot correctly segment syllables that contain consonants, ‘်’ and ‘့’. To evaluate this, we tested another set of 165 common syllables in 8 random Myanmar sentences shown Table IX. The results obtained should be seen in the Table X.

According to this experiment, it can be clearly seen that the sylbreak3 algorithm can correctly segment all Myanmar syllables including Parli and digits but it fails in detecting the boundary of syllables composed of ‘်’ and ‘့’.

The statement that "sylbreak fails in detecting the boundary of syllables that composed of ‘်’ and ‘့ ’" is wrong. When I read their paper carefully, I found that the test data is not correctly typed according to the Unicode typing of the Myanmar language. In details, they typed Auk-ka-myit ("့") and then A-that ("်") instead of A-that ("်") and then Auk-ka-myit ("့") order. I assumed they got wrong segmentation results because of this. Actually, sylbreak tool is working well if the user provided the Myanmar text that typed correct order based on the Unicode standard.

Here is the video file that I explained well by comparing the example words from their paper. Though I explained in Myanmar language, hope everyone can follow my explanation.

Video Link: https://vimeo.com/864665740?share=copy

Acknowledgement

Thanks to Swan Htet Aung who informed my typo mistake of $otherChar ... ဥဥ ---> ဥဦ
sylbreak RE example programs for Java and Java Script was written by Chan Mrate Ko Ko.

Reference

  1. Dr. Thein Tun, Acoustic Phonetics and The Phonology of the Myanmar Language
  2. Romanization: https://en.wikipedia.org/wiki/Romanization
  3. Myanmar Unicode: http://unicode.org/charts/PDF/U1000.pdf
  4. Syllable segmentation algorithm of Myanmar text: http://gii2.nagaokaut.ac.jp/gii/media/share/20080901-ZMM%20Presentation.pdf
  5. Zin Maung Maung and Yoshiki Makami,"A rule-based syllable segmentation of Myanmar Text", in Proceeding of the IJCNLP-08 workshop of NLP for Less Privileged Language, January, 2008, Hyderabad, India, pp. 51-58. Paper
  6. Tin Htay Hlaing, "Manually constructed context-free grammar for Myanmar syllable structure", in Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL '12), Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 32-37. Paper
  7. Ye Kyaw Thu, Andrew Finch, Yoshinori Sagisaka and Eiichiro Sumita, "A Study of Myanmar Word Segmentation Schemes for Statistical Machine Translation", in Proceedings of the 11th International Conference on Computer Applications (ICCA 2013), February 26~27, 2013, Yangon, Myanmar, pp. 167-179. Paper
  8. Ye Kyaw Thu, Andrew Finch, Win Pa Pa, and Eiichiro Sumita, "A Large-scale Study of Statistical Machine Translation Methods for Myanmar Language", in Proceedings of SNLP2016, February 10-12, 2016, Phranakhon Si Ayutthaya, Thailand. Paper
  9. Regular Expression: https://en.wikipedia.org/wiki/Regular_expression
  10. DebuggexBeter: https://www.debuggex.com/
  11. Run UTN11 normalization on Myanmar text? harfbuzz/harfbuzz#494

sylbreak's People

Contributors

chanmratekoko avatar saithiha2100 avatar sengkyaut avatar swanhtet1992 avatar ye-kyaw-thu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

sylbreak's Issues

Myanmar Language Syllable Break with PHP

Hello,

For SEO optimization, I need to run this function on a PHP project. This project is widely covered in various areas but unfortunately not PHP. So I did some research and make the PHP snippet. I hope this issue report and attached PHP snippet would help to Repository maintainer.

<?php
$string = 'သီဟိုဠ်မှ ဉာဏ်ကြီးရှင်သည် အာယုဝဍ္ဎနဆေးညွှန်းစာကို ဇလွန်ဈေးဘေး ဗာဒံပင်ထက် အဓိဋ္ဌာန်လျက် ဂဃနဏဖတ်ခဲ့သည်။';
$pattern = "/(?:(?<!\x{1039})([\\x{1000}-\\x{102A}\\x{103F}\\x{104A}-\\x{104F}]|[\\x{1040}-\\x{1049}]+|[^\\x{1000}-\\x{104F}]+)(?![\\x{103E}\\x{103B}]?[\\x{1039}\\x{103A}\\x{1037}]))/uim";
$replacement = '|$1';
echo preg_replace($pattern, $replacement, $string);
//Output |သီ|ဟိုဠ်|မှ| |ဉာဏ်|ကြီး|ရှင်|သည်| |အာ|ယု|ဝဍ္ဎ|န|ဆေး|ညွှန်း|စာ|ကို| |ဇ|လွန်|ဈေး|ဘေး| |ဗာ|ဒံ|ပင်|ထက်| |အ|ဓိဋ္ဌာန်|လျက်| |ဂ|ဃ|န|ဏ|ဖတ်|ခဲ့|သည်|။
?>

Regular expression are not support on Mac Safari and iPadOS and iOS Chrome and Safari Browser

Issue
This Sylbreak is simple, easy to use, and really awesome. But one thing while I am using this on my project it is not working on Mac Safari, Safari, and Chrome of iOS and iPad OS. After few research I find out look-behind assertions ((?<= ) and (?<! )) are not support on these browser.

My fix
As I am not very expert at "regular expression" I tried my best to fix the solution. Not sure it was right way but working on my device. Hope this bug report and suggested solution would be help.

//Original
const BREAK_PATTERN = new RegExp("((?<!" + ssSymbol + ")[" + myConsonant + "](?![" + aThat + ssSymbol + "])" + "|[" + enChar + otherChar + "])", "mg");
//My Fix
const BREAK_PATTERN = new RegExp("((?!" + ssSymbol + ")[" + myConsonant + "](?![" + aThat + ssSymbol + "])" + "|[" + enChar + otherChar + "])", "mg");

image

Reference https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp#Browser_compatibility

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.