Giter Site home page Giter Site logo

thai-word-segmentation's Introduction

Thai word segmentation with bi-directional RNN

This is code for preprocessing data, training model and inferring word segment boundaries of Thai text with bi-directional recurrent neural network. The model provides precision of 98.94%, recall of 99.28% and F1 score of 99.11%. Please see the blog post for the detailed description of the model.

Requirements

  • Python 3.4
  • TensorFlow 1.4
  • NumPy 1.13
  • scikit-learn 0.18

Files

  • preprocess.py: Preprocess corpus for model training
  • train.py: Train the Thai word segmentation model
  • predict_example.py: Example usage of the model to segment Thai words
  • saved_model: Pretrained model weights
  • thainlplib/labeller.py: Methods for preprocessing the corpus
  • thainlplib/model.py: Methods for training the model

Note that the InterBEST 2009 corpus is not included, but can be downloaded from the NECTEC website.

Usage

To try the prediction demo, run python3 predict_example.py. To preprocess the data, train the model and save the model, put the data files under data directory and then run python3 preprocess.py and python3 train.py.

Bug fixes and updates

  • 3/10/2019: Switched license to MIT
  • 1/6/2018: Fixed bug in splitting data incorrectly in preprocess.py. The model was retrained achieving precision 98.94, recall 99.28 and F1 score 99.11. Thank you Ekkalak Thongthanomkul for the bug report.
  • 1/6/2018: Load the model variables with signature names in predict_example.py.

Contributors

  • Jussi Jousimo
  • Natsuda Laokulrat
  • Ben Carr
  • Ekkalak Thongthanomkul
  • Vee Satayamas

License

MIT

Copyright (c) Sertis Co., Ltd., 2019

thai-word-segmentation's People

Contributors

statguy avatar teamsoo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

thai-word-segmentation's Issues

Duplicate writing training data in preprocessing.py?

x, y = process_line(line)
p = random.random()
example = make_sequence_example(x, y)
training_writer.write(example.SerializeToString())
if p <= training_proportion:
training_writer.write(example.SerializeToString())
else:
validation_writer.write(example.SerializeToString())

At line number 47, it looks like the data is not splitted correctly. I am not sure whether this line should be removed as we should write the data only once based on random variable p.

Hello

สวัดดีครับ เวลา train แล้ว แต่มันไม่ทำการ save model ให้จะทำไงดีครับ

Sentence segmentation

Hi,
I was looking for a tool to do thai sentence segmentation. But there seems to be no readily available tool. Paper propose methods for use cases like thai-engl translation [1] or disambiguation of space characters as sentence markers [2] according to the number of verbs (morphemes) / rule-based / discourse analysis.

Do you know a tool for this task or is it possible to use your word segmentation tool as part of a toolchain to do sentence segmentation?

Thank you in advance for any reply. Best regards.

PS.: using your tool to apply space characters between each "word" seems to improve the result of google translate (for single sentences). :-) Well, my sample size was not that large ...

[1] http://www.aclweb.org/anthology/W10-3602
[2] http://pioneer.netserv.chula.ac.th/~awirote/ling/snlp2007-wirote.pdf

Installing requirements

Hi,
when simply using pip install -r requirements.txt it fails for me because scikit-learn depends on scipy.
Would it help if the ordering is different (or you remove scipy)? Manual install works fine if I say pip install scikit-learn after the other packages are installed...

Except from log (using requirements.txt file):

...
ImportError: Scientific Python (SciPy) is not installed.
  scikit-learn requires SciPy >= 0.9.
...

Problem when deploying the model with Tensorflow serving

I have completed deploying saved model in Tensorflow serving on a server side but I have a problem when client trying to match the input tensor format with the saved model.

save_model function in model.py

inputs = {
'inputs': tf.saved_model.utils.build_tensor_info(tf.identity(self.tf_tokens_batch, 'inputs')),
'lengths': tf.saved_model.utils.build_tensor_info(tf.identity(self.tf_lengths_batch, 'lengths')),
'training': tf.saved_model.utils.build_tensor_info(tf.identity(self.tf_training, 'training'))
}

outputs = {
'outputs': tf.saved_model.utils.build_tensor_info(tf.identity(self.tf_masked_prediction, 'outputs'))
}

client.py (my implementation)

text = "ทดสอบ"
inputs = [ThaiWordSegmentLabeller.get_input_labels(text)]
lengths = [len(text)]
request.model_spec.name = 'word'
request.model_spec.signature_name = 'word_segmentation'

request.inputs['inputs'].CopyFrom(tf.contrib.util.make_tensor_proto(values=inputs,dtype=tf.int64))
request.inputs['lengths'].CopyFrom(tf.contrib.util.make_tensor_proto(values=lengths,dtype=tf.int64))
request.inputs['training'].CopyFrom(tf.contrib.util.make_tensor_proto(values=False,dtype=tf.bool))

Output

grpc.framework.interfaces.face.face.AbortionError: AbortionError(code=StatusCode.INVALID_ARGUMENT, details="You must feed a value for placeholder tensor 'Placeholder_1' with dtype bool
         [[Node: Placeholder_1 = Placeholder[_output_shapes=[[]], dtype=DT_BOOL, shape=[], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]")

Best Regards,

f1-score evaluation

Hi,

While working on #8, it seems to me that the evaluation of f-score is based on flatten true and pred labels. For example, given 2 samples whose lengths are 7 and 20. The current code flatten the labels to shape (27,) and compute the score. However, I think it could overestimate the value.

To illustrate, I've made a notebook using random data. You can see in there that the avg f-score is slightly lower than the f-score from the flatten data.

Looking forward to your thought on this.

Lao word segmentation Issue

I saw your report with such high accuracy, I am Lao student studying master in computer science in China now, I saw your project is very interesting and I also have an idea to do so for Lao word segmentation and pos tagging, what does the Lao corpus look like which i should prepare and I want to try to code such thing for Lao language.

Please give me any idea

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.