Giter Site home page Giter Site logo

xxcharles / partial-crfsuite Goto Github PK

View Code? Open in Web Editor NEW

This project forked from oneplus/partial-crfsuite

0.0 1.0 0.0 1.14 MB

CRFsuite with partial annotation. Used in our paper 'Domain adaptation for CRF-based Chinese word segmentation using free annotations'

Home Page: https://aclweb.org/anthology/D/D14/D14-1093.pdf

License: Other

Makefile 0.19% Shell 0.12% Python 5.04% Roff 0.08% C 42.89% C++ 51.12% HTML 0.01% Perl 0.27% Ruby 0.29%

partial-crfsuite's Introduction

CRFsuite for partially annotated data

This CRF toolkit is a fork of the CRFsuite: a fast implementation of Conditional Random Fields (CRFs) by enabling it to support partially annotated input training data.

Partially annotated data

For sequence labeling problems, we may access to a type of data: the labels for certain subsequence are known or constrainted to a smaller set, while the labels for the other part of the sequence is unknown. Figure 1 illustrate an example for such type of data.

image

Figure 1

In character-based Chinese word segmentation (which is similar to the named entities recognition that label token with B, M, E, and S), if we know the subsequence of characters makes a word, their labels and labels on the boundary can be constrainted to a smaller set.

In the above example, the word 狐岐山(the Huqi Mountain) in the unannotated sentence is recognized as a word. As a result, we obtain a partially-annotated sentence, in which the segmentation ambiguity of the characters 狐(fox), 岐(brandy road) and 山(mountain) are resolved ( being the beginning, being the middle and being the end of the same word). At the same time, the segmentation ambiguity of the surrounding characters 在(at) and 救(save) are reduced ( being either a single-character word or the end of a multi-character word, and being either a single-character word or the beginning of a multicharacter word).

For more details, please refer to the our emnlp 2014 paper.

Using partial-crfsuite

Compiling and running

Since this toolkit is a fork of the CRFsuite, compiling and executating are completely same with the original software. If you are not quite familiar with CRFsuite, you can refer to its offical website which provides fancy document.

Input data format

The major difference between CRFsuite and partial-crfsuite is that the latter accepts training data with fuzzy or multiple labels. In the input training data, labels come in the first column as CRFsuite. To represent the multiple labels, all the possible labels are packed together by a | delimiter. Take the partially annotated sentence in Figure 1 for example, the training data can be

e|s	u[-1]=_bos_	u[0]=在	u[1]=狐
b	u[-1]=在	u[0]=狐	u[1]=歧
m	u[-1]=狐	u[0]=歧	u[1]=山
e	u[-1]=歧	u[0]=山	u[1]=救
b|s	u[-1]=山	u[0]=救	u[1]=治
b|m|e|s	u[-1]=救	u[0]=治	u[1]=碧
b|m|e|s	u[-1]=治	u[0]=碧	u[1]=瑶
b|m|e|s	u[-1]=碧	u[0]=瑶	u[1]=,
b|m|e|s	u[-1]=瑶	u[0]=,	u[1]=_eos_

Training

We have prepared an example training data in ./partial-data/train.crfsuite. You can use it to train a model.

./frontend/crfsuite learn -m train.model -a lbfgs partial-data/train.crfsuite

Until now, only the lbfgs is supported by partial-crfsuite. Other learning alogrithm was not tested in partial-crfsuite. We also welcome any test and patch :-)

Test

You can use the following commands to test the rained model in previous section. An evaluation script is also provides to test the segmentation performance.

./frontend/crfsuite tag -m train.model partial-data/test.crfsuite | ./partial-data/eval.py -r ./partial-data/test.reference

LICENSE

For license issue, please consult the LICENSE section in the original README file.

partial-crfsuite's People

Contributors

oneplus avatar abicky avatar kmike avatar cheusov avatar smhanov avatar timdawborn avatar jlerouge avatar eiichiroi avatar freeone3000 avatar unnonouno avatar

Watchers

xxcharles avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.