Giter Site home page Giter Site logo

nlp_thai_resources's Introduction

Thai NLP Resource

Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus. Always welcome for pull requests.

Libraries/Services

Thai Character Cluster

Library Description Programming Languages Features License Author & Link
JTCC Thai Character Cluster Java GPL-3.0 Wittawat
TCC Thai Character Cluster Python Apache 2.0 Wannaphong

Sentiment Analysis

Library Description Programming Languages Features License Author & Link
sentiment_analysis_thai JagerV3

Soundex

Library Description Programming Languages Features License Author & Link
PyThaiNLP Python 3 LK82 + Udom83 Apache 2.0 Korakot, GitHub

Word Segmentation

Library Description Programming Languages Features License Author & Link
Chamkho Lao/Thai word segmentation Rust LGPL GitHub
CutKum Thai word segmentation with Deep Learning in Tensorflow. RNN. Python 93% F-measure. MIT Pucktada, GitHub
CutThai Thai word segmentation written in coffee-script Edit Coffee-script MIT Pureexe/cutthai GitHub
DeepCut A Thai word tokenization library using Deep Neural Network. CNN. Python 98.8% F-measure. MIT rkcosmos, GitHub
Lexto: Thai Lexeme Tokenizer Java LGPL NECTEC
Lexto Python 2 LGPL GitHub
Lexto Python 3 LGPL GitHub
Multi-Candidate-Word-Segmentation Multi Candidate Word Segmentation for Thai language Python, RNN, LSTM 97.0% F-measure (Word Level), 98.95% F-measure (Boundary Level) MIT paper, GitHub
PyThaiNLP Python 3 Maximal matching and various other engines Apache 2.0 GitHub
Swath SWATH (Smart Word Analysis for THai) is a word segmentation for Thai C Longest Matching, Maximal Matching and Part-of-Speech Bigram. GPL Paisarn Charoenpornsawat, CMU
SynThai Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM. Python 99.2% F-measure MIT KenjiroAI, GitHub
Thai Language Toolkit (tltk) Based on a paper by Wirote Aroonmanakun in 2002. Word segmentation is based on a maximum collocation approach. Syllable segmentation is based on 3grams statistics. (Dataset is included) Python 97.86% F-measure. (It was tested on a different testset; it is not fair to compare it with other models.) GPLv3 PyPI
Wordcut Thai word breaker for Node.js JavaScript, Node.JS LGPL-3.0 veer66, GitHub
wordcutpy A simple Thai word tokenizer written in 1 Python file Python 3 LGPL-3.0 veer66, GitHub

Part of Speech Tagging (POS Tagging)

Library Description Programming Languages Features License Author & Link
Chart-POS Thai POS Tagger C All rights reserved AIAT, KINDML, Thanaruk T. ([email protected]), tchayintr, Demo at iApp
Jitar+NAiST A simple Trigram HMM part-of-speech tagger Java Ver66, Jitar + NAiST, 1 + NAiST, 2
SynThai Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM. Python 0.9163 F-measure. RNN. LSTM MIT KenjiroAI, github

Name Entity Recognition

Library Description Programming Languages Features License Author & Link
Named Entity Tagging (Thai NEST) Thai Named Entity tagging Specification and Tools GPL KINDML, SIIT, AIAT
ThaiNER Thai Named Entity Recognition for PyThaiNLP Python Apache 2.0 (code) & CC BY 3.0 (Dataset) ThaiNER

News Structure Tagging

Library Description Programming Languages Features License Author & Link
News Structure Tagging Program Thai News Structure Tagging Program Metadata tagging, Structure tagging, Automatic News Title Generation GPL AIAT

Syntactic Parsing & Tools

Library Description Programming Languages Features License Author & Link
Chart-parser Extract Syntactic Structure from POS Tagged Sentence. C All rights reserved AIAT, KINDML, Thanaruk T. ([email protected]), tchayintr, Demo at iApp
Grammar Processing Labelled Brackets -> Context Free Grammars (CFGs) Python Transform and compute probability tchayintr

Word Embedding

Library Description Programming Languages Features License Author & Link
kobkrit-word-embedding Tensorflow implementation of Thai word embedding Python Source code, Example, Word distance graph LGPL Kobkrit V.

Question Answering (Machine Comprehension)

Service Description License Author & Link
Thai Machine Comprehension (ThaiMC) Bidirectional Attention Flow Copyright (As the service) iApp-AI

Emojification

Service Description License Author & Link
Thai Emotification LSTM GPL Demo at iApp-AI and Source, Github

Corpus and Dataset

Dictionaries / Translation Pairs

Library Description Size Features License Link
LEXiTRON Thai<->English Dictionary TH->EN, EN->TH LEXiTRON License NECTEC
Transliteration Corpus 31K pairs Thai-Eng Translation Pair CC BY-NC-SA 3.0 TH NECTEC
Yaitron LEXiTRON in machine readable format (XML) TH->EN, EN->TH LEXiTRON License Veer66 Schema, Data & Conversion Code

Downloadable Text Corpus

Library Description Size Features License Link
Click Bait Sentences Thai Click Bait Sentence 330 sent. (90.7KB) MIT Wannaphongcom
InterBEST 2009/2010 5M words Word Seg. CC BY-NC-SA 3.0 TH NECTEC
ORCHID 30K sent. Word Seg., POS Tagged. CC BY-NC-SA 3.0 TH NECTEC
Prime Minister 29 Prime Minister 29's Speech Sentences 338KB Word segged, Name Entity Tagged MIT Wannaphongcom
thai-jokes-corpus Cleaned Thai Jokes Corpus 457 jokes GPLv3 iApp Technology
Thai named entity corpora named entity corpora by Wirote Aroonmanakun's students 266KB-1.5MB syllable seg., word seg., Named Entity tagged GPLv3 (not sure, but tltk is using this license) นัชชา ถิระสาโรช Data
ศศิวิมล กาลันสีมา Data
ณัฐดาพร เลิศชีวะ Data
THAI-NEST Thai-NEST: Thai Named Entity tagging Specification and Tools 45K+ Name Entity Token Name Entity Tagged LGPL KINDML
Thai Sentimental Word List Thai Sentimental Words List 52KB Seperated Words as Adj, V MIT Wannaphongcom
Thai Wikipedia Formal Articles 1.49GB (~213.1 MB compressed) XML GFDL WIKIPEDIA
Thai WordNet THE CONSTRUCTION OF THAI WORDNET OF 1ST ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD AND WITH DICTIONARIES OF DIFFERENT COMPILATIONAL APPROACHES(ธนนท์ หลีน้อย)

THE CONSTRUCTION OF THAI WORDNET OF 2ND ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD : A STUDY OF THE DIVERSITY OF MEANINGS AFFECTING TRANSLATIONAL ACCURACY (ปริศนา อัครพุทธิพร)
WordNet N/A ธนนท์ หลีน้อย 2008
ปริศนา อัครพุทธิพร Data 2008
TNC Top-5000 Words Word frequency 5,000 words Frequency of Thai words in various genres, EXCEL All rights reserved CHULA
Toxicity in Thai Tweet Corpus Tokyo Metropolitan University Natural Language Processing Group Each tweet is labeled as toxic or non-toxic CC BY-NC 4.0 tmu-nlp
Wisesight Sentiment Corpus Social media message with sentiment label (positive, neutral, negative, question). ~26,700 messages Sentiment label, Question label Public domain PyThaiNLP

Web Query Text Corpus

Library Description Size Features License Link
Thai National Corpus 2 32M words Query text by genre, domain All rights reserved CHULA
Thai Medical Document 3,594 docs Document and dynamic keyword map All rights reserved KINDML, SIIT
Southeast Asian Languages Library Thai News, Web Text, Pop Music, Literature, Toponyms 20M chars Phase around a search text SEALang
HSE Thai Corpus Modern texts written in Thai language (mostly news websites) 50M tokens Query by word form, lexeme, translation, grammatical attributes, lexical attributees HSE School of Linguistics

Parallel Corpus

Library Description Size Features License Link
TALPCo TUFS Asian Language Parallel Corpus 1327 sent open parallel corpus consisting of Japanese sentences and their translations into Burmese (Myanmar; the official language of the Republic of the Union of Myanmar), Malay (the national language of Malaysia, Singapore and Brunei), Indonesian, Thai, Vietnamese and English CC BY 4.0 TALPCo

Pre-trained Language Models

Pre-trained Model Description Size Dimensions License Link
fastText Skip-Gram model trained on Wikipedia using fastText 300 CC BY-SA 3.0 Facebook + Bin & Text + Text Only
thai2fit ULMFit on Wikipedia. Perplexity of 46.80959 with 60,002 embeddings. 70MB 300 MIT thai2vec / PyThaiNLP
thbert Yet another pre-trained BERT particularly in Thai Apache 2.0 tchayintr

Benchmarks

Thai Text Classification Benchmarks

Tools

Corpus extractors

Library Description Programming Languages Features License Author & Link
BEST2010 cooker A tool for extracting segmented words from Thai segmented BEST2010 corpus Python3 Extracting segmented words, features, and data divisions Apache 2.0 tchayintr

Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)

https://resources.aiat.or.th/

Acknowledgements

nlp_thai_resources's People

Contributors

bact avatar bi89 avatar c4n avatar cstorm125 avatar ekapolc avatar kobkrit avatar p16i avatar tchayintr avatar veer66 avatar wannaphong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nlp_thai_resources's Issues

Unclear results for Multi-Candidate Word Segmenation using Bi-directional LSTM Neural Networks

Hi,

I've found the paper "Multi-Candidate Word Segmenation using Bi-directional LSTM Neural Networks" for your repository. While the approach in the paper sounds interesting, the details in mentioned in the experiment setup sounds very strange to me.

Screenshot 2019-06-17 14 35 27

As you're one of the authors, I wonder how the algorithm would perform if we try it on the whole test set. Obviously, the numbers would look differently. Would you have time to do that and update the plot and the numbers in the table accordingly?

Thank you very much in advance for your clarification.

The figure is taken from https://drive.google.com/file/d/1x8JmqQFlbMev0fiqCx7wM5YU-twxHmTu/view.

Link for ORCHID corpus is 404

It seems like the link for the ORCHID corpus is 404. I've googled a bit (Query: "site:nectec.or.th orchid corpus") and it seems like the page has been taken down sometime in mid 2020.

There is a copy of the page on archive.org, but the download links unfortunately were not archived.
https://web.archive.org/web/20180630023610/https://www.nectec.or.th/corpus/index.php?league=pm

Given that it seems like this is a small corpus, would it be possible to mirror this in this repo?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.