AttaCut: Fast and Reasonably Accurate Word Tokenizer for Thai

How does AttaCut look like?

TL;DR: 3-Layer Dilated CNN on syllable and character features. It’s 6x faster than DeepCut (SOTA) while its WL-f1 on BEST is 91%, only 2% lower.

Installation

$ pip install attacut

Remarks: Windows users need to install PyTorch before the command above. Please consult PyTorch.org for more details.

Usage

Command-Line Interface

$ attacut-cli -h
AttaCut: Fast and Reasonably Accurate Word Tokenizer for Thai

Usage:
  attacut-cli <src> [--dest=<dest>] [--model=<model>]
  attacut-cli [-v | --version]
  attacut-cli [-h | --help]

Arguments:
  <src>             Path to input text file to be tokenized

Options:
  -h --help         Show this screen.
  --model=<model>   Model to be used [default: attacut-sc].
  --dest=<dest>     If not specified, it'll be <src>-tokenized-by-<model>.txt
  -v --version      Show version

High-Level API

from attacut import tokenize, Tokenizer

# tokenize `txt` using our best model `attacut-sc`
words = tokenize(txt)

# alternatively, an AttaCut tokenizer might be instantiated directly, allowing
# one to specify whether to use `attacut-sc` or `attacut-c`.
atta = Tokenizer(model="attacut-sc")
words = atta.tokenize(txt)

For better efficiency, we recommend using attacut-cli. Please consult our Google Colab tutorial for more detials.

Benchmark Results

Belows are brief summaries. More details can be found on our benchmarking page.

Tokenization Quality

Speed

Retraining on Custom Dataset

Please refer to our retraining page

Related Resources

Acknowledgements

This repository was initially done by Pattarawat Chormai, while interning at Dr. Attapol Thamrongrattanarit's NLP Lab, Chulalongkorn University, Bangkok, Thailand. Many people have involed in this project. Complete list of names can be found on Acknowledgement.

Can't install on Windows

I used pip install https://github.com/PyThaiNLP/attacut/archive/master.zip on Windows but it has a installation problems. torch can't install on Windows.

>pip install https://github.com/PyThaiNLP/attacut/archive/master.zip
Collecting https://github.com/PyThaiNLP/attacut/archive/master.zip
  Downloading https://github.com/PyThaiNLP/attacut/archive/master.zip
     \ 2.4MB 1.1MB/s
Requirement already satisfied: docopt==0.6.2 in c:\users\tc\anaconda3\lib\site-packages (from attacut==0.0.3.dev0) (0.6.2)
Collecting fire==0.1.3 (from attacut==0.0.3.dev0)
  Downloading https://files.pythonhosted.org/packages/5a/b7/205702f348aab198baecd1d8344a90748cb68f53bdcd1cc30cbc08e47d3e/fire-0.1.3.tar.gz
Collecting nptyping==0.2.0 (from attacut==0.0.3.dev0)
  Downloading https://files.pythonhosted.org/packages/a5/0f/9b44a1866c7911d03329669d82d2ebb1b8e6dac15803fdb6588549a44193/nptyping-0.2.0-py3-none-any.whl
Collecting numpy==1.17.0 (from attacut==0.0.3.dev0)
  Downloading https://files.pythonhosted.org/packages/26/26/73ba03b2206371cdef62afebb877e9ba90a1f0dc3d9de22680a3970f5a50/numpy-1.17.0-cp37-cp37m-win_amd64.whl (12.8MB)
     |████████████████████████████████| 12.8MB 3.3MB/s
Requirement already satisfied: python-crfsuite==0.9.6 in c:\users\tc\anaconda3\lib\site-packages (from attacut==0.0.3.dev0) (0.9.6)
Collecting pyyaml==5.1.2 (from attacut==0.0.3.dev0)
  Downloading https://files.pythonhosted.org/packages/bc/3f/4f733cd0b1b675f34beb290d465a65e0f06b492c00b111d1b75125062de1/PyYAML-5.1.2-cp37-cp37m-win_amd64.whl (215kB)
     |████████████████████████████████| 225kB 3.2MB/s
Requirement already satisfied: six==1.12.0 in c:\users\tc\anaconda3\lib\site-packages (from attacut==0.0.3.dev0) (1.12.0)
Collecting ssg==0.0.4 (from attacut==0.0.3.dev0)
  Downloading https://files.pythonhosted.org/packages/05/e0/226b4fb9144d80a3efc474e581097d77abc4e8c3ce8e751469cb1c25e671/ssg-0.0.4-py3-none-any.whl (473kB)
     |████████████████████████████████| 481kB 2.2MB/s
Collecting torch==1.2.0 (from attacut==0.0.3.dev0)
  ERROR: Could not find a version that satisfies the requirement torch==1.2.0 (from attacut==0.0.3.dev0) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch==1.2.0 (from attacut==0.0.3.dev0)

pythainlp / attacut Goto Github PK

attacut's Introduction

AttaCut: Fast and Reasonably Accurate Word Tokenizer for Thai

How does AttaCut look like?

Installation

Usage

Command-Line Interface

High-Level API

Benchmark Results

Tokenization Quality

Speed

Retraining on Custom Dataset

Related Resources

Acknowledgements

attacut's People

Contributors

Stargazers

Watchers

Forkers

attacut's Issues

Tasks

Documentations (now available on https://pythainlp.github.io/attacut)

Testing & Final Benchmark

Recommend Projects

Recommend Topics

Recommend Org