pythainlp / pythainlp Goto Github PK

Thai Natural Language Processing in Python.

License: Apache License 2.0

Python 91.20% Makefile 0.17% Jupyter Notebook 8.59% Dockerfile 0.04%

python thai-nlp nlp-library thai-language natural-language-processing thai-nlp-library thai-soundex soundex word-segmentation thai

pythainlp's Introduction

PyThaiNLP: Thai Natural Language Processing in Python

Project Status: Active – The project has reached a stable, usable state and is being actively developed.

PyThaiNLP is a Python package for text processing and linguistic analysis, similar to NLTK with a focus on Thai language.

PyThaiNLP เป็นไลบารีภาษาไพทอนสำหรับประมวลผลภาษาธรรมชาติ คล้ายกับ NLTK โดยเน้นภาษาไทย ดูรายละเอียดภาษาไทยได้ที่ README_TH.MD

News

Now, You can contact with or ask any questions of the PyThaiNLP team.

Version	Description	Status
5.0.4	Stable	Change Log
`dev`	Release Candidate for 5.1	Change Log

Getting Started

PyThaiNLP requires Python 3.7+. Python 2.7 users can use PyThaiNLP 1.6. See 2.0 change log | Upgrading from 1.7 | Upgrading ThaiNER from 1.7
PyThaiNLP Get Started notebook | API document | Tutorials
Official website | PyPI | Facebook page
Who uses PyThaiNLP?
Model cards - for technical details, caveats, and ethical considerations of the models developed and used in PyThaiNLP

Capabilities

PyThaiNLP provides standard linguistic analysis for Thai language and standard Thai locale utility functions. Some of these functions are also available via the command-line interface (run thainlp in your shell).

Partial list of features:

Convenient character and word classes, like Thai consonants (pythainlp.thai_consonants), vowels (pythainlp.thai_vowels), digits (pythainlp.thai_digits), and stop words (pythainlp.corpus.thai_stopwords) -- comparable to constants like string.letters, string.digits, and string.punctuation
Linguistic unit segmentation at different levels: sentence (sent_tokenize), word (word_tokenize), and subword (subword_tokenize)
Part-of-speech tagging (pos_tag)
Spelling suggestion and correction (spell and correct)
Phonetic algorithm and transliteration (soundex and transliterate)
Collation (sorted by dictionary order) (collate)
Number read out (num_to_thaiword and bahttext)
Datetime formatting (thai_strftime)
Thai-English keyboard misswitched fix (eng_to_thai, thai_to_eng)

Installation

pip install --upgrade pythainlp

This will install the latest stable release of PyThaiNLP.

Install different releases:

Stable release: pip install --upgrade pythainlp
Pre-release (nearly ready): pip install --upgrade --pre pythainlp
Development (likely to break things): pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip

Installation Options

Some functionalities, like Thai WordNet, may require extra packages. To install those requirements, specify a set of [name] immediately after pythainlp:

pip install pythainlp[extra1,extra2,...]

Possible extras:

full (install everything)
attacut (to support attacut, a fast and accurate tokenizer)
benchmarks (for word tokenization benchmarking)
icu (for ICU, International Components for Unicode, support in transliteration and tokenization)
ipa (for IPA, International Phonetic Alphabet, support in transliteration)
ml (to support ULMFiT models for classification)
thai2fit (for Thai word vector)
thai2rom (for machine-learnt romanization)
wordnet (for Thai WordNet API)

For dependency details, look at the extras variable in setup.py.

Data Directory

Some additional data, like word lists and language models, may be automatically downloaded during runtime.
PyThaiNLP caches these data under the directory ~/pythainlp-data by default.
The data directory can be changed by specifying the environment variable PYTHAINLP_DATA_DIR.
See the data catalog (db.json) at https://github.com/PyThaiNLP/pythainlp-corpus

Command-Line Interface

Some of PyThaiNLP functionalities can be used via command line with the thainlp command.

For example, to display a catalog of datasets:

thainlp data catalog

To show how to use:

thainlp help

Licenses

	License
PyThaiNLP source codes and notebooks	Apache Software License 2.0
Corpora, datasets, and documentations created by PyThaiNLP	Creative Commons Zero 1.0 Universal Public Domain Dedication License (CC0)
Language models created by PyThaiNLP	Creative Commons Attribution 4.0 International Public License (CC-by)
Other corpora and models that may be included in PyThaiNLP	See Corpus License

Contribute to PyThaiNLP

Please fork and create a pull request :)
For style guides and other information, including references to algorithms we use, please refer to our contributing page.

Who uses PyThaiNLP?

You can read INTHEWILD.md.

Citations

If you use PyThaiNLP in your project or publication, please cite the library as follows:

Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, & Pattarawat Chormai. (2016, Jun 27). PyThaiNLP: Thai Natural Language Processing in Python. Zenodo. http://doi.org/10.5281/zenodo.3519354

or by BibTeX entry:

@misc{pythainlp,
    title = "{P}y{T}hai{NLP}: {T}hai Natural Language Processing in {P}ython",
    author = "Phatthiyaphaibun, Wannaphong  and
      Chaovavanich, Korakot  and
      Polpanumas, Charin  and
      Suriyawongkul, Arthit  and
      Lowphansirikul, Lalita  and
      Chormai, Pattarawat",
    month = jun,
    year = "2016",
    doi = {10.5281/zenodo.3519354},
    publisher = {Zenodo},
    url = {http://doi.org/10.5281/zenodo.3519354}
}

Our NLP-OSS 2023 paper:

Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip, and Can Udomcharoenchaikit. 2023. PyThaiNLP: Thai Natural Language Processing in Python. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 25–36, Singapore, Singapore. Empirical Methods in Natural Language Processing.

and its BibTeX entry:

@inproceedings{phatthiyaphaibun-etal-2023-pythainlp,
    title = "{P}y{T}hai{NLP}: {T}hai Natural Language Processing in {P}ython",
    author = "Phatthiyaphaibun, Wannaphong  and
      Chaovavanich, Korakot  and
      Polpanumas, Charin  and
      Suriyawongkul, Arthit  and
      Lowphansirikul, Lalita  and
      Chormai, Pattarawat  and
      Limkonchotiwat, Peerat  and
      Suntorntip, Thanathip  and
      Udomcharoenchaikit, Can",
    editor = "Tan, Liling  and
      Milajevs, Dmitrijs  and
      Chauhan, Geeticka  and
      Gwinnup, Jeremy  and
      Rippeth, Elijah",
    booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
    month = dec,
    year = "2023",
    address = "Singapore, Singapore",
    publisher = "Empirical Methods in Natural Language Processing",
    url = "https://aclanthology.org/2023.nlposs-1.4",
    pages = "25--36",
    abstract = "We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained language models. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.",
}

Logo	Description
	Since 2019, our contributors Korakot Chaovavanich and Lalita Lowphansirikul have been supported by VISTEC-depa Thailand Artificial Intelligence Research Institute.
	We get support of free Mac Mini M1 from MacStadium for running CI builds.

pythainlp's People

Contributors

Stargazers

Watchers

Forkers

magma2 setuc nlsntcs c4n bongikairu colipso gain9999 atchariya gdunghi poom10510 arsapol lijielife kwarodom palincho tharaxodia mymemory wittawatj parmarno johnnyduo athiwatp tophymastery iotech-code moosubb4 devrabbiz offchan42 acelectic tambralinga aonrobot tiravata mcspx patiwat-w nyamakawa pariyat bspong grit0 afterdead godzillafiw khunreus rathachai yanndubs vorabhol-ch punch872 somjeat smeeklai phatsriwichai pontakornth thebevrishot auchanan herm3s kobkrit iamapinan joessattes fpitlok royee17 petetanru rickkyltl lrpopeyou bkktimber smileychubby kiezuckerbong khawoat6 nonagon-io napat jirateep-dev gnax49 hieuqtran zylinks m-phanat nguonchhay sangsiri yoonassanai pkanon zkan waritchana hputiprawan2 signalblues fossabot champ1375 sumethy tanawutwatthana robingong lxx719 sb4yd3e redsuncmx djkhz limeng05 sopanawit thanatchakromsang thananchaiktw 0xe-acc naponjatusripitak unsuthee saucedup321 itbcodedev lukea88 kunato pcrete eveem suphakornp batermj

pythainlp's Issues

please install hunspell : hunspell dictionary th_TH

ผมได้ลง th_TH.dic และ th_TH.aff ลงใน ~/Library/Spelling แล้วลองรันโค๊ด

# -*- coding: utf-8 -*-
from pythainlp.spell import *
a=spell("สี่เหลียม")
print(a) # ['สี่เหลี่ยม', 'เสียเหลี่ยม', 'เหลี่ยม']

please install hunspell
None

ผมต้องแก้ไขหรือทำอะไรเพิ่มเติมเพื่อที่จะรันโค๊ดนี้มั้ยครับ
ขอบคุณครับ

Add windows 10 dependency installation to documentation

การติดตั้ง pythainlp บน windows ต้องติดตั้ง pyicu ซึ่งทำได้ยากมาก
วิธีที่ง่ายที่สุดคือใช้ wheel

http://www.lfd.uci.edu/~gohlke/pythonlibs/#pyicu แล้ว download wheel ตาม python ตัวเองเช่น
ผมใช้ python x64 3.6.1 บน Windows ก็ให้ใช้
PyICU‑1.9.7‑cp36‑cp36m‑win_amd64.whl
pip install PyICU‑1.9.7‑cp36‑cp36m‑win_amd64.whl
pip install pythainlp

Soundex bug

https://github.com/wannaphongcom/pythainlp/blob/c1f414236c7351f0a2fe7d62f0d7f33da19de606/pythainlp/soundex.py#L44

Really appreciate your great work providing soundex alg for Thai.
len(c) shoud be len(s), right?

Python 2.7 come back !

กลับมารองรับ Python 2.7 แล้ว ใน PyThaiNLP 1.5

Test https://travis-ci.org/wannaphongcom/pythainlp/builds/252411596 ผ่าน

ImportError

ผมใช้ mac osx โดยใช้ python 3.6 ที่ติดมากับ anaconda ครับ แล้วลงตามที่เขียนไว้ตามคำแนะนำหลังจากนั้นก็ลองใช้โค๊ดตัวอย่างดูแล้วขึ้น error ตามนี้ครับ

>>> from pythainlp.tokenize import word_tokenize
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/arsapol/anaconda3/lib/python3.6/site-packages/pythainlp/__init__.py", line 10, in <module>
    from pythainlp.romanization import *
  File "/Users/arsapol/anaconda3/lib/python3.6/site-packages/pythainlp/romanization/__init__.py", line 3, in <module>
    import icu
  File "/Users/arsapol/anaconda3/lib/python3.6/site-packages/icu/__init__.py", line 40, in <module>
    from .docs import *
  File "/Users/arsapol/anaconda3/lib/python3.6/site-packages/icu/docs.py", line 23, in <module>
    from _icu import *
ImportError: dlopen(/Users/arsapol/anaconda3/lib/python3.6/site-packages/_icu.cpython-36m-darwin.so, 2): Library not loaded: libicui18n.54.dylib
  Referenced from: /Users/arsapol/anaconda3/lib/python3.6/site-packages/_icu.cpython-36m-darwin.so
  Reason: image not found

Edit : แก้ไขได้แล้วครับ แก้ตามลิ้งค์นี้ครับ

@BradyHammond @robooo , I have a solution for all of us; On 11 Dec 2016, successfully installed polyglot on my Mac computer with Anaconda. My specs:

macOS Sierra version 10.12.1
Anaconda, conda version 4.2.13
execute conda --version in terminal to see your version
Installed from here: https://repo.continuum.io/archive/Anaconda3-4.2.0-MacOSX-x86_64.sh
Python version 3.5.2 (see this Gist to get Polyglot working in Python 2.7)
I made a Github Gist that automatically installs Polyglot on a Mac Computer running Anaconda. To run my gist just cut and paste this:

wget https://gist.githubusercontent.com/linwoodc3/8704bbf6d1c6130dda02bbc28967a9e6/raw/91d8e579c0fa66399ab1959c9aa94ea09c3eb539/polyglotOnMacOSX.sh -O polyglotOnMacOSX.sh && bash polyglotOnMacOSX.sh
I will also list the exact steps, but first, here are some important notes:

Key Facts/Problem Areas

So far, this only works with Python 3.5
My GitHub Gist has a fix for Python 2.7, but it's more involved. Will try to create pyicu merge request
Must use easy_install pyicu and not pip install pyicu
installing pyicu with pip install pyicu always leads to Library not loaded... error
I believe this has something to do with easy_install using a Python egg for install; have to research difference between pip and easy_install
Must use conda install -c ccordoba12 icu=54.1
Found at https://anaconda.org/ccordoba12/icu
ensure that you have icu 54.1.1 and not icu 58.1 or icu 54.1.0. Just using conda install icu will not work!!
The biggest problem is that brew install icu4c uses version 58.1
Uses wrong version as you can see here: http://brewformulas.org/Icu4c. We need version 54.1 based on the output of each of our errors:
Library not loaded: libicui18n.54.dylib
Again, you can use my Github Gist to automatically test this on your machine. If you want to manually try the steps, they are

Exact Steps I used (tested several times)

First, started from clean version of Anaconda, HomeBrew, etc. This is optional but it worked for me:
brew uninstall --force icu4c
brew update
find $(brew --cache) -mindepth 1 -print -delete
Fresh install of anaconda :

rm -rf anaconda # removes existing Anaconda install
Command Line Installer for Python 3 OS X downloaded on 11 Dec 2016 from: https://repo.continuum.io/archive/Anaconda3-4.2.0-MacOSX-x86_64.sh

Now the steps after Anaconda is installed:

conda create -n icutest requests --no-deps -y
source activate icutest
conda install -c ccordoba12 icu=54.1
https://anaconda.org/ccordoba12/icu
conda install ipython jupyter notebook
ipython
!easy_install pyicu
exit
pip install polyglot
ipython
from polyglot.text import Text, Word
Or, you can use my gist to automatically test the install:

wget https://gist.githubusercontent.com/linwoodc3/8704bbf6d1c6130dda02bbc28967a9e6/raw/91d8e579c0fa66399ab1959c9aa94ea09c3eb539/polyglotOnMacOSX.sh -O polyglotOnMacOSX.sh && bash polyglotOnMacOSX.sh

TODO

เพิ่มการสะกดคำที่มีสระ ตามราชบัณฑิตยสถาน
แยกระบบแบ่งคำออก
อัพขึ้น PyPI
เอกสาร
PyICU
Postaggers ภาษาไทย
แยกไฟล์ข้อมูลมาอยู่ใน corpus
ระบบ Test
Python 2

Thai syllable segmentation ?

ควรตัดพยางค์ในภาษาไทยอย่างไร ?

list PyThaiNLP 1.2

ปรับแต่งโค้ดใหม่ เพิ่มประสิทธิภาพ
เพิ่ม nltk.Text()
เพิ่ม sentiment
รองรับ Python 2.7
ปรับ API ให้เหมือน NLTK
ปรับปรุงเอกสารและแจ้งเตือน API ที่จะถูกยกเลิกการใช้งานในเวชั่นถัดไป
วิธีการติดตั้งบน Windows

[nltk_data] Error loading omw: HTTP Error 403: Forbiden

ขึ้นข้อความ Error
[nltk_data] Error loading omw: HTTP Error 403: Forbiden
[nltk_data] Error loading wordnet: HTTP Error 403: Forbiden

แต่ยังสามารถ Tokenize ได้ครับ แต่ทำได้แค่ 'mm' นะครับ

ใช้ 'newmm' ไม่ได้ครับมี string จาก git version control อยู่ใน code ผมต้อง ลบออกไปเอง
แต่ไม่แน่ใจว่าถูก logic ไหม คาดว่า listcut ไม่ได้ถูกเรียกอยู่แล้ว นอกจากใน main function เท่านั้น

ใช้ 'deepcut' จะต้อง pip install deepcut ก่อน น่าจะเพิ่มไปในคู่มือครับ
เวลาในการ Tokenize ของ deepcut จะนานกว่า 'mm' อย่างเห็นได้ชัด

List PyThaiNLP 1.4

List pythainlp.keywords

เพิ่ม API ชื่อ find_keyword สำหรับใช้หาคำสำคัญหรือ keyword อย่างง่าย 86171ac
ระบบทดสอบโค้ด

ทำไมถึงใช้ ICU เป็น default?

รู้สึกว่า ICU มันเป็นเครื่องมือตัดคำที่แย่มากครับ อยากรู้ว่าทำไมถึงเลือกเป็น default

Error ascii cannot decode

Error Example ครับ

Add TNC Freq to pythainlp.corpus

data from https://github.com/korakot/thainlp/
pythainlp.corpus.tnc.get_word_frequency_all

pythainlp.tokenize.sent_tokenize ?

เราจะแบ่งประโยคในภาษาไทยอย่างไร ?

ในการแบ่งประโยคต้องอาศัยไวยากรณ์มาแบ่งด้วย

Tone detector for syllables - Determine Thai tone rule for syllable?

Hello! How could I determine tone rule for syllable? Thank you

Get some errors

ผมได้ลง pyicu ไปแล้วครับ แล้วตามด้วยตัว pythainlp
เสร็จแล้วจะลองรันตัว test.py ดู แล้วได้ error มาตามนี้

Traceback (most recent call last):
File "test.py", line 5, in
from pythainlp.segment import segment
File "/home/benkit/TH/pythainlp/pythainlp/init.py", line 12, in
from . import postaggers
File "/home/benkit/TH/pythainlp/pythainlp/postaggers/init.py", line 2, in
from .text import tag
File "/home/benkit/TH/pythainlp/pythainlp/postaggers/text.py", line 16, in
data1 =data()
File "/home/benkit/TH/pythainlp/pythainlp/postaggers/text.py", line 14, in data
model = json.load(handle)
File "/usr/lib/python3.4/json/init.py", line 265, in load
return loads(fp.read(),
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 6: invalid start byte

มีคำแนะนำอะไรบ้างคับ

Unicode Error

Hi,
thanks for fixing the import error!

tried to run your sample code now,
but still errors.

a) pythainlp/pythainlp/test/init.py", line 36 -- missing closing paranthesis .. easy to fix

but now:

[gerhard@localhost pythainlp]$ python test_gerhard.py
/home/gerhard/pythainlp/pythainlp/segment/dict.py:23: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if string == "":
Traceback (most recent call last):
  File "test_gerhard.py", line 6, in <module>
    b = segment(a)
  File "/home/gerhard/pythainlp/pythainlp/segment/dict.py", line 10, in segment
    result = tokenize(string, lines, "")
  File "/home/gerhard/pythainlp/pythainlp/segment/dict.py", line 27, in tokenize
    if string.startswith(pref):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

when using:

# -*- coding: utf-8 -*-

# ตัดคำ
from pythainlp.segment import segment
a = 'ฉันรักภาษาไทยเพราะฉันเป็นคนไทย'
b = segment(a)

I am not sure if this is a problem with my system, or general one ..

Cheers, Gerhard

"newmm" tokenizes white spaces

Hi,

First of all thanks for the great package 💯

I just came across a very thing I think should be changed:

from pythainlp import word_tokenize
print(word_tokenize('ข้อความภาษาไทย some english'))
# ['ข้อความ', 'ภาษาไทย', ' ', 'some', ' ', 'english']

print(word_tokenize('ข้อความภาษาไทย some english',engine='icu'))
# ['ข้อความ', 'ภาษา', 'ไทย', 'some', 'english']

The icu way of doing seems more logical. Of course I can simply use " ".join(txt).split() but I think not putting whitespaces in the list by default makes more sense.

Also a quick note about newmm, it seems way better than icu but in practice it cannot really be used because it is so slow. Because it's the new default I think it could be interesting to try making it faster, have you tested making all the loops faster with Cython ?

List PyThaiNLP 1.6

เพิ่ม API ให้ผู้ใช้งานโมดูลสามารถใช้พจนานุกรมของตัวเองในการตัดคำได้ 6d939fb
ยุบระบบไฟล์ที่ซับซ้อน และ ลบไฟล์ที่ไม่ได้ใช้ออก
เพิ่มประสิทธิภาพระบบตัดคำด้วยการเก็บไฟล์ Trie 830553e ไว้ใช้งาน
ระบบเรียงลำดับคำภาษาไทย #55
ระบบตัดพยางค์ในภาษาไทย e0f37c9
Error romanization module #52

List PyThaiNLP 1.3

ปรับปรุงเอกสาร
เพิ่มการแจ้งเตือน API ที่จะถูกยกเลิก และแจ้งเปลี่ยน API ให้คล้ายกับ NLTK
แก้ Bug #14 อย่างสมบูรณ์

List PyThaiNLP 1.5

import in python 2.7

import in python 2.7 9f0a1f2

>>> from pythainlp.segment import segment
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3-new\envs\python2\lib\site-packages\pythainlp\__init__.py", line 12, in <module>
    from . import postaggers
  File "C:\Anaconda3-new\envs\python2\lib\site-packages\pythainlp\postaggers\__init__.py", line 2, in <module>
    from .text import tag
  File "C:\Anaconda3-new\envs\python2\lib\site-packages\pythainlp\postaggers\text.py", line 7, in <module>
    import nltk.tag, nltk.data
  File "C:\Anaconda3-new\envs\python2\lib\site-packages\nltk\__init__.py", line 128, in <module>
    from nltk.chunk import *
  File "C:\Anaconda3-new\envs\python2\lib\site-packages\nltk\chunk\__init__.py", line 155, in <module>
    from nltk.data import load
  File "C:\Anaconda3-new\envs\python2\lib\site-packages\nltk\data.py", line 77, in <module>
    if 'APPENGINE_RUNTIME' not in os.environ and os.path.expanduser('~/') != '~/':
  File "C:\Anaconda3-new\envs\python2\lib\ntpath.py", line 311, in expanduser
    return userhome + path[i:]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc7 in position 9: ordinal not in range(128)
>>> from pythainlp.rank import rank
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3-new\envs\python2\lib\site-packages\pythainlp\__init__.py", line 6, in <module>
    from . import romanization
ImportError: cannot import name romanization
>>> from pythainlp.romanization import romanization
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3-new\envs\python2\lib\site-packages\pythainlp\__init__.py", line 6, in <module>
    from . import romanization
ImportError: cannot import name romanization
>>> b=romanization("แมว")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'romanization' is not defined
>>> from pythainlp.number import numtowords
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3-new\envs\python2\lib\site-packages\pythainlp\__init__.py", line 6, in <module>
    from . import romanization
ImportError: cannot import name romanization

Import Error

Hi @wannaphongcom ,

just installed the latest version of pythainlp on Redhat linux from the git sources.

When I try to use it I have the following problem:

import pythainlp
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/pythainlp-1.1-py2.7.egg/pythainlp/__init__.py", line 6, in <module>
    from pythainlp.romanization import *
TypeError: Item in ``from list'' not a string

From a quick web search it seems like there could be a python 2 / 3 issue with "unicode_literals".

Happy to get any feedback on how to solve this.

Cheers, Gerhard

Change PyICU code default to other

from https://www.facebook.com/groups/408004796247683/permalink/476021666112662/

License for corpus

Hi,

Are all *.txt files under https://github.com/wannaphongcom/pythainlp/tree/pythainlp1.4/pythainlp/corpus folder licensed under the Thai WordNet license (LICENSE_THA_WN)? What's the difference between thaiword.txt and new-thaidict.txt?

test error

Error romanization module

romanization("ณ ระนอง",engine='royin')

AttributeError Traceback (most recent call last)
in ()
----> 1 romanization("ณ ระนอง",engine='royin')

C:\ProgramData\Anaconda3\lib\site-packages\pythainlp\romanization_init_.py in romanization(data, engine)
15 elif engine=='pyicu':
16 from .pyicu import romanization
---> 17 return romanization(data)

C:\ProgramData\Anaconda3\lib\site-packages\pythainlp\romanization\royin.py in romanization(text)
542 '''
543 d=re.search(consonants_thai,text,re.U)
--> 544 text=re.sub(d.group(0),consonants[d.group(0)][0],text,flags=re.U)
545 listtext=list(text)
546 #print(listtext,0)

AttributeError: 'NoneType' object has no attribute 'group'

ทำตามตัวอย่างแล้วขึ้น error unicode

UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-4: ordinal not in range(128)
ต้องทำยังไงครับ

Python 3.6.1

fix royin romanization

มีปัญหา ต้องแก้ไข

Cannot install pythainlp

ไม่สามารถ install pythainlp ได้ค่ะ
ใช้คำสั่ง pip install pythainlp

ขึ้น Error ว่า
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "D:\Users\sine\AppData\Local\Temp\pip-build-ffw6hyzb\pyicu\setup.py", line 33, in <module>
    ''')
RuntimeError:
Please set the ICU_VERSION environment variable to the version of
ICU you have installed.

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in D:\Users\sine\AppData\Local\Temp\pip-build-ffw6hyzb\pyicu\

ปล.โหลดPyICU-1.9.7-cp36-cp36m-win_amd64 ไม่ทราบว่าต้องไว้ตรงไหน

Bug in TCC

มีข้อผิดพลาดในการแบ่งกลุ่มคำ

ออกแบบ pythainlp.corpus ใหม่

ฐานข้อมูลใน pythainlp.corpus เริ่มมีขนาดใหญ่เกินไป ทางแก้คือ ต้องออกแบบ pythainlp.corpus ใหม่

Add TTC Freq to pythainlp.corpus

Data from https://github.com/korakot/thainlp
pythainlp.corpus.ttc.get_word_frequency_all

bug in mm

from pythainlp.tokenize import word_tokenize
text = "แมวกินปลาแมวมันชอบนอนนอนกลางวันนอนแล้วนอนอีกเป็นสัตว์ที่ขี้เกียจจริงๆเลยแมวแต่แมวมันเข้ากับคนได้ดีฉันชอบแมว"
print(word_tokenize(text,engine='mm'))

['แมว', 'กิน', 'ปลา', 'แมว', 'มัน', 'ชอบ', 'นอน', 'นอน', 'กลางวัน', 'นอน', 'แล้ว', 'นอน', 'อีก', 'เป็น', 'สัตว์', 'ที่', 'ขี้เกียจ', 'จริงๆ', 'เลย', 'แมว', 'แต่', 'NOT_แมว', 'NOT_มัน', 'NOT_เข้ากับ', 'NOT_คน', 'NOT_ได้ดี', 'NOT_ฉัน', 'NOT_ชอบ', 'NOT_แมว']

ไม่แนะนำให้ใช้ mm เพราะกำลังอยู่ในช่วงพัฒนา

PyThaiNLP in Jython

รัน PyThaiNLP ใน Jython ได้

ไม่กำหนดเวลาเริ่ม และ เส้นตาย

Precision of each word segmentation engine.

รบกวนสอบถามความแม่นยำของแต่ละ engine (6 engine) ในการทำ thai tokenize สำหรับ deepcut ทราบตัวเลขแล้ว แต่ตัวอื่นๆไม่ทราบว่าหาได้จากไหนบ้างคะ พยายามหาบทความแล้ว ไม่เจอเลยค่ะ

เพื่อจะใช้ในการอ้างอิงในการเลือก engine ในการตัดคำ

ขอบคุณค่ะ

Long Term Support (LTS) version ?

ควรเลือก PyThaiNLP เวชั่นอะไรที่ควรเป็นรุ่น Long Term Support (LTS) และควรมีระยะเวลา Support เท่าไร

add Python 2.7

Can't from from pythainlp.tokenize import on Mac OS X

have some problem like this
Traceback (most recent call last):
File "/Users/ying/PycharmProjects/untitled/SentenceAligner/auto_Pretreatment/taiyu.py", line 1, in
from pythainlp.tokenize import tcc
File "/Users/ying/anaconda/lib/python3.6/site-packages/pythainlp/init.py", line 15, in
from pythainlp.date import now
File "/Users/ying/anaconda/lib/python3.6/site-packages/pythainlp/date/init.py", line 3, in
import icu
File "/Users/ying/anaconda/lib/python3.6/site-packages/icu/init.py", line 40, in
from .docs import *
File "/Users/ying/anaconda/lib/python3.6/site-packages/icu/docs.py", line 23, in
from _icu import *
ImportError: dlopen(/Users/ying/anaconda/lib/python3.6/site-packages/_icu.cpython-36m-darwin.so, 2): Library not loaded: libicui18n.54.dylib
Referenced from: /Users/ying/anaconda/lib/python3.6/site-packages/_icu.cpython-36m-darwin.so
Reason: image not found

Thai Soundex ระบบเสียงภาษาไทย ?

เป็นไปได้ไหมที่เราจะสร้าง Thai Soundex ด้วยวิธีการใดวิธีการหนึ่ง

แบบที่ 1 http://guru.sanook.com/1520/
แบบที่ 2 https://linux.thai.net/~thep/soundex/soundex.html

Library not loaded: libicui18n.54.dylib

>>> from pythainlp.tokenize import word_tokenize Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/qianchen/.pyenv/versions/anaconda2-4.4.0/lib/python2.7/site-packages/pythainlp/__init__.py", line 15, in <module> from pythainlp.date import now File "/Users/qianchen/.pyenv/versions/anaconda2-4.4.0/lib/python2.7/site-packages/pythainlp/date/__init__.py", line 3, in <module> import icu File "/Users/qianchen/.pyenv/versions/anaconda2-4.4.0/lib/python2.7/site-packages/icu/__init__.py", line 42, in <module> from docs import * File "/Users/qianchen/.pyenv/versions/anaconda2-4.4.0/lib/python2.7/site-packages/icu/docs.py", line 23, in <module> from _icu import * ImportError: dlopen(/Users/qianchen/.pyenv/versions/anaconda2-4.4.0/lib/python2.7/site-packages/_icu.so, 2): Library not loaded: libicui18n.54.dylib Referenced from: /Users/qianchen/.pyenv/versions/anaconda2-4.4.0/lib/python2.7/site-packages/_icu.so Reason: image not found

When I was trying to word_tokenize, I met this error.
This is my python version: Python 2.7.13 |Anaconda 4.4.0 (x86_64)| (default, Dec 20 2016, 23:05:08)
My mac os version is 10.11.6

Thanks for your attention.

list PyThaiNLP 1.1

ยกเลิกสนับสนุน Python 2.7
เพิ่มโมดูลตัวอักษร ก-ฮ
เพิ่มโมดูลเกี่ยวกับเวลา
เพิ่มเอกสารการติดตั้งบน Windows
เพิ่มรายละอียดภาษาอังกฤษ
เพิ่มเอกสารภาษาอังกฤษ
อื่นๆ

ไม่สามารถแบ่งคำที่มีภาษาอื่นปนอยู่ในประโยคได้

English (without space)

>>> a = "ผมชอบพูดภาษาไทยคำEnglishคำ"
>>> b = segment(a)
>>> b
['ผม', 'ชอบ', 'พูด', 'ภาษา', 'ไทย', 'คำEnglishคำ']

English (with space)

>>> a = "ผมชอบพูดไทยคำ English คำ"
>>> b = segment(a)
>>> b
['ผม', 'ชอบ', 'พูด', 'ไทย', 'คำ English คำ']

Chinese

>>> a =  "ผมมาจาก泰国ครับ"
>>> b = segment(a)
>>> b
['ผม', 'มา', 'จาก泰国ครับ']

การ romanization ใช้กับชื่อผมไม่ได้

from pythainlp.romanization import romanization

b=romanization("ณัฐชนน")
print(b) #ṇạṭ̄h chnn 

b=romanization("นัด") + romanization("ชะ") + romanization("โนน") #ใช้สระโอะ ไม่ได้ ToT
print(b) #nạdchanon ดีขึ้นถ้าแยกทีละพยางค์

b="natchanon" #ที่ถูกคือ
print(b)

เป็นไปได้ไหมที่เราจะนำ http://pioneer.chula.ac.th/~awirote/resources/thai-romanization.html port มา

การใช้งานบางอย่างไม่แสดงผลเป็นภาษาไทย

การใช้งานบางอย่างไม่แสดงผลเป็นภาษาไทยครับ ไม่แน่ใจว่าผมต้องติดตั้งอะไรที่เครื่องเพิ่มเติมก่อนหรือเปล่าครับ

การแบ่งครับ ไม่แสดงผลออกมาเป็นภาษาไทยครับ

>>> # Thai Segment 
... from pythainlp.segment import segment
>>> a = 'ฉันรักภาษาไทยเพราะฉันเป็นคนไทย'
>>> b = segment(a)
>>> print(b)
['\xe0\xb8\x89', '\xe0\xb8\xb1', '\xe0\xb8\x99\xe0', '\xb8\xa3\xe0', '\xb8\xb1\xe0\xb8\x81', '\xe0\xb8\xa0', '\xe0\xb8\xb2\xe0', '\xb8\xa9', '\xe0\xb8\xb2']

Postaggers ก็เช่นกัน
>>> # Thai Postaggers
... from pythainlp.postaggers import tag
>>> print(tag('คุณกำลังประชุม'))
[('\xe0\xb8\x84', None), ('\xe0\xb8\xb8\xe0\xb8', None), ('\x93\xe0\xb8\x81\xe0\xb8', None)]

รวมถึงการนับจำนวนคำด้วยครับ
>>> # Find the number word of the most
... from pythainlp.rank import rank
>>> aa = rank(b)
>>> print(aa)
Counter({'\xe0\xb8\x89': 1, '\xb8\xa9': 1, '\xe0\xb8\xb2\xe0': 1, '\xe0\xb8\xa0': 1, '\xb8\xa3\xe0': 1, '\xb8\xb1\xe0\xb8\x81': 1, '\xe0\xb8\x99\xe0': 1, '\xe0\xb8\xb2': 1, '\xe0\xb8\xb1': 1})

ส่วนการเปลี่ยนภาษามี error ที่ texttoeng() ครับ

>>> # Fix the printer forgot to change the language
... from pythainlp.change import *
>>> a="l;ylfu8iy["
>>> a=texttothai(a)
>>> b="นามรสนอำันี"
>>> b=texttoeng(b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pythainlp/change/__init__.py", line 34, in texttoeng
    data2+=a
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
>>> print(a)
สวัสดีครับ
>>> print(b)
นามรสนอำันี

การอ่านตัวเลขเป็นภาษาไทยทำได้ปกติ

>>> # Read a number to text in Thai language
... from pythainlp.number import numtowords
>>> print("5611116.50")
5611116.50
>>> print(numtowords(5611116.50))
ห้าล้านหกแสนหนึ่งหมื่นหนึ่งพันหนึ่งร้อยสิบหกบาทห้าสิบสตางค์

macOS Sierra 10.12.4
ติดตั้งตามใน README ครับ

$ brew install icu4c --force
$ brew link --force icu4c
$ CFLAGS=-I/usr/local/opt/icu4c/include LDFLAGS=-L/usr/local/opt/icu4c/lib pip install pythainlp

Word Vectors for Thai

Create word vectors for Thai

Clean wikipedia dump
Train language model (LSTM with dropouts)
Extract embeddings as word vectors
Repackage model for classification
Test classification on BEST

Can't symlink icu4c on OS X

Hi.

For some reason I can't symlink icu4c via brew link --force icu4c on OS X. Then, I installed it via brew install icu4c --HEAD and symlink(?) via

  echo 'export PATH="/usr/local/opt/icu4c/bin:$PATH"' >> ~/.bash_profile
  echo 'export PATH="/usr/local/opt/icu4c/sbin:$PATH"' >> ~/.bash_profile

Then I get this error when installing pythainlp:

Please set the ICU_VERSION environment variable to the version of ICU you have installed.

Any ideas?

ระบบเรียงลำดับคำภาษาไทย

สถานะ : ยังไม่ถูกปิด
เป้าหมาย : PyThaiNLP 1.6
เนื่องจากโค้ดระบบเรียงคำของ pythainlp.collation.collation มีการใช้งาน PyICU จึงเป็นเรื่องที่ไม่สะดวกนักในการติดตั้ง PyICU ในระบบต่าง ๆ
เป้าหมายคือ ทำระบบเรียงคำภาษาไทยใหม่ทดแทนโค้ดจาก PyICU
ตัวอย่างการเรียงคำที่ต้องการ ['งวด','งม'] -> ['งม','งวด']