Giter Site home page Giter Site logo

mitlm's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mitlm's Issues

Tokens beginning with # cause a crash when using count files

(Reporting this here as well as https://code.google.com/p/mitlm/issues/detail?id=44 in case github gets more attention these days)

The crash only happens if the ngram order is higher than 1, and only if the # occurs at the start of a token.

I'm guessing this is because it interprets a # at the beginning of a line in a text counts file as a comment and skips it, meaning a unigram beginning with a # is missing from the term dictionary when it's encountered in a later bigram.

What steps will reproduce the problem?

$ estimate-ngram -wc counts -text <(echo 'a #hashtag')
0.001   Loading corpus /dev/fd/63...
0.002   Smoothing[1] = ModKN
0.002   Smoothing[2] = ModKN
0.002   Smoothing[3] = ModKN
0.002   Set smoothing algorithms...
0.002   Saving counts to counts...

$ cat counts
<s>     1
a       1
#hashtag        1
<s> a   1
a #hashtag      1
#hashtag </s>   1
<s> a #hashtag  1
a #hashtag </s> 1

$ estimate-ngram -counts counts -wl lm.arpa
0.001   Loading counts counts...
estimate-ngram: src/NgramModel.cpp:800: void mitlm::NgramModel::_ComputeBackoffs(): Assertion `allTrue(backoffs != NgramVector::Invalid)' failed.
Aborted (core dumped)

What version of the product are you using? On what operating system?

Built from latest master on github. Ubuntu 14.04.1

Crash where suffix of an ngram not present in count file

What steps will reproduce the problem?

Create a large counts file in which there is an ngram (e.g. "foo bar baz") whose suffix ngram ("bar baz") doesn't exist earlier in the file.

Run estimate-ngram -wl lm.arpa -counts counts on it.

Note this doesn't always happen consistently for me with smaller count files, but seems to replicate fairly consistently with larger (or at least middle-sized) files.

What is the expected output? What do you see instead?

I'd ideally expect it allow a language model to be built in this case, even if it means removing/skipping over the ngram in question, or making some assumption about the count for the missing suffix (e.g. same as the higher-order ngram).

I realise that these missing suffixes won't occur if I use MITLM itself to compute the counts from a corpus, however if dealing with large amounts of count-based source data from some other tools/sources (e.g. MapReduce jobs), it's possible for these kinds of constraints to be violated accidentally due to data corruption or bugs beyond your control, and so it would be convenient if MITLM could cope gracefully with these cases.

Alternatively if this is a WONTFIX then it would be good to at least document what the constraint is on acceptable input for counts files, and give a more friendly error message if the constraint is violated, so people know how to fix up their input files in order to get MITLM to work.

Currently what you see is:

estimate-ngram: src/NgramModel.cpp:811: void mitlm::NgramModel::_ComputeBackoffs(): Assertion `allTrue(backoffs != NgramVector::Invalid)' failed.
Aborted (core dumped)

What version of the product are you using? On what operating system?

Built from latest github master, Ubuntu 14.04.1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.