Giter Site home page Giter Site logo

crf's Introduction

About

This Java project is an implementation of linear-chain Conditional Random Fields (CRF) in pure Java, with no third party dependency (except for log4j).

The CRF is exemplified by being used for the task of Part-Of-Speech tagging.

This is a free open-source project, which can be used also for commercial purposes. See LICENSE file.

Advantages

  1. Self contained: all the algorithms are implemented in the package itself.
  2. Free open source, also for commercial use.
  3. Relatively efficient: while this is a "by the book" implementation, some techniques have been invented and implemented to significantly improve run time efficiency.
  4. Clear, readable and well documented code. Might be useful also for educational purposes.

Self contained

The only required thrid party is Log4j. See http://logging.apache.org/log4j/1.2/ (Note that it is automatically downloaded by Maven. No user action required).

The algorithms, including function optimization using LBFGS algorithm, forward-backward algorithm and Viterbi algorithm are fully implemented in the code.

Compile and run

The project can be compiled with J2SE version 8.

This is a Maven project, compiled by simply running mvn compile.

To use this project as a library in another Maven project, add the following to the other project's POM file:

<dependency>
    <groupId>com.github.asher-stern</groupId>
    <artifactId>CRF</artifactId>
    <version>1.2.0</version>
</dependency>

Entry points

For those who are interested only in the CRF but not the POS-tagger, an example entry point is com.asher_stern.crf.crf.run.ExampleMain.

Note that this entry point is only a skeleton, and the user should copy it, and implement the feature generator and other stuff required to run the CRF for the user's specific problem.

Those who are interested in the POS-tagging example can use the following two entry points: com.asher_stern.crf.postagging.demo.TrainAndEvaluate and com.asher_stern.crf.postagging.demo.UsePosTagger.

The first entry point is for training the POS-tagger and evaluating it, and the second is for running it on test examples.

Note that training requires the Penn Tree-Bank corpus, and it should be provided as a directory with no subdirectories, which contains ".mrg" files, where each files contains parse-trees.

A note about the POS-tagging example

Please note that this is not a state-of-the-art POS tagger, since it does not employ state-of-the-art features. Rather, this POS-tagger uses very simplistic features, and is intended to exemplify CRF.

Enjoy!

crf's People

Contributors

asher-stern avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crf's Issues

Convergence criterion is wrong

L-BFGS convergence criterion is wrong (requires that the difference between subsequent function values would be small).

This leads to too many LBFGS iterations, and makes run-time significantly too long.

Hello,i want to have your e-mail or some other contact information.

Hi,
i am a student from China, and i am learning your CRF code.
But i can't get the training data --- Penn Tree-Bank corpus.
I know the corpus is not free, so I just want to get 10 - 20 sentence of the corpus for test.
could you copy some sentence to me for testing ?
Thank you very much !

LiKun,Zhengzhou University,China

Make sure values don't become non-finite

It can happen (it, in fact has happened to me) that values calculated during the forward-backward algorithm become larger than Double.MAX_VALUE, and are represented using POSITIVE_INFINITY.

These are values we can't really compute with, leading to NaN values and wrecking our training procedure.

I would propose to use maximum finite double values in these cases. Add the following lines to CrfUtilities.safeAdd():

if (variable == Double.POSITIVE_INFINITY) variable = Double.MAX_VALUE;
if (variable == Double.NEGATIVE_INFINITY) variable = -Double.MAX_VALUE;

And also provide a function CrfUtilities.safeMultiply():

public static double safeMultiply(double variable, final double val2) {
    final double oldValue = variable;
    variable *= val2;
    if (variable == Double.POSITIVE_INFINITY) variable = Double.MAX_VALUE;
    if (variable == Double.NEGATIVE_INFINITY) variable = -Double.MAX_VALUE;
    if (!Double.isFinite(variable)) {
        //  Note that we have not added the check for ((val2 < 0.0) && (oldValue < variable)) || ((val2 > 0.0) && (oldValue > variable)), because floating point arithmetic is inexact
        throw new CrfException("Error: multiplying value to \"double\" variable yielded unexpected results. "
                + "variable was: " + String.format("%1$.3f", oldValue) + ", value to multiply was: " + String.format("%1$.3f", val2));
    }
    return variable;
}

And use these functions in all places where we do floating point calculations.

I don't know too much about the internals of this project, so I would like your feedback on this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.