Giter Site home page Giter Site logo

vie-ner-lstm's Introduction

Vietnamese Named Entity Recognition (vie-ner-lstm)


Code by Thai-Hoang Pham at Alt Inc.

1. Introduction

vie-ner-lstm is a fast-implementation of the system described in a paper The Importance of Automatic Syntactic Features in Vietnamese Named Entity Recognition. This system is used to recognize named entities in Vietnamese texts and written by Python 2.7. The architecture of this system is two bidirectional LSTM layers followed by a feed-forward neural network. Finally, the output sequence is predicted by a softmax function.

Our system achieved an F1 score of 92.05% on VLSP standard testset. The performance of our system with each feature set is described in the following table.

Word2vec POS Chunk Regex F1
62.87%
x 74.02%
x x 85.90%
x x 86.79%
x x 74.13%
x x x x 92.05%

2. Installation

This software depends on NumPy, Keras. You must have them installed before using vie-ner-lstm.

The simple way to install them is using pip:

	# pip install -U numpy keras

3. Usage

3.1. Data

The input data's format of vie-ner-lstm follows VLSP 2016 campaign format. There are four columns in this dataset including of word, pos, chunk, and named entity. For details, see sample data in 'data' directory. The table below describes an example Vietnamese sentence in VLSP dataset.

Word POS Chunk NER
Từ E B-PP O
Singapore NNP B-NP B-LOC
, CH O O
chỉ R O O
khoảng N B-NP O
vài L B-NP O
chục M B-NP O
phút Nu B-NP O
ngồi V B-VP O
phà N B-NP O
V B-VP O
dến V B-VP O
được R O O
Batam NNP B-NP B-LOC
. CH O O

To access the full dataset of VLSP 2016 campaign, you need to sign the user agreement of the VLSP consortium.

3.2. Command-line Usage

You can use vie-ner-lstm software by a following command:

	$ bash ner.sh

Arguments in ner.sh script:

  • word_dir: path for word dictionary
  • vector_dir: path for vector dictionary
  • train_dir: path for training data
  • dev_dir: path for development data
  • test_dir: path for testing data
  • num_lstm_layer: number of LSTM layers used in this system
  • num_hidden_node: number of hidden nodes in a hidden LSTM layer
  • dropout: dropout for input data (The float number between 0 and 1)
  • batch_size: size of input batch for training this system.
  • patience: number used for early stopping in training stage

Note: In the first time of running vie-ner-lstm, this system will automatically download word embeddings for Vietnamese from the internet. (It may take a long time because a size of this embedding set is about 1 GB). If the system cannot automatically download this embedding set, you can manually download it from here (vector, word) and put it into embedding directory.

4. References

Thai-Hoang Pham, Phuong Le-Hong, "The Importance of Automatic Syntactic Features in Vietnamese Named Entity Recognition" Proceedings of the 31th Pacific Asia Conference on Language, Information and Computation (PACLIC 31)

@inproceedings{Pham:2017,
  title={The Importance of Automatic Syntactic Features in Vietnamese Named Entity Recognition},
  author={Thai-Hoang Pham and Phuong Le-Hong},
  booktitle={Proceedings of the 31th Pacific Asia Conference on Language, Information and Computation},
  year={2017},
}

5. Contact

Thai-Hoang Pham < [email protected] >

Alt Inc, Hanoi, Vietnam

vie-ner-lstm's People

Contributors

pth1993 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

vie-ner-lstm's Issues

Người mới bắt đầu

Chào Hoàng,

Mình mới làm quen với ML, và tìm hiểu về NER cho tiếng việt. Mình có clone code của Hoàng về chạy tuy nhiên có một số câu hỏi mong Hoàng giúp đỡ

  • file word embeddings 1G, mình muốn tìm hiểu cách để tạo ra file đó, mình muốn học từ đầu và hiểu từ đầu, nếu có thể Hoàng cho mình xin một số link hay tài liệu liên quan để tạo ra file đó. mình muốn tự làm file đó mục đích để hiểu.
  • về trainning ner: kết quả sau khi trainning mình có thể sử dụng lại được không? mình tưởng tượng sẽ làm 1 restfull api public, để có thể sử dụng kết quả đã trainning? việc này có khả thi hay không? nếu khả thi thì mình nên làm thế nào??
    Cảm ơn Hoàng

How to download data?

How can I download data to test this tool. And why does the test data have the NER column. What is the meaning of development data?

export model

I want to export the model and use it, but I do not know how to save the state after training. please help me. Thank you

how can i understand this graph in your introduction?

I can't understand this graph?
so the sequence in fist input layer is "Anh roi EU" with each word will put to word embedding layer then fit to LSTM? or word "Anh" transfer to [Anh, pos_tagging, chunk_tagging, regex_tagging] will put to word embedding layer then fit to LSTM, word "roi" and, word "EU" support to change parameter of model?
Can you answer and explain more detail?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.