Giter Site home page Giter Site logo

clojure-opennlp's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clojure-opennlp's Issues

treebank make-tree uses clojure reader, chokes on some tokens from natural language

Not a show stopper.

The parsing code here gets treebank strings from OpenNLP. The treebank strings
are very nearly s-expressions and are parsed as such. They are only "very nearly"
s-expressions, not perfectly so because of tokens that are not parsed by clojure.
The code here uses the Clojure reader, so it crashes when it sees a token it doesn't like.
The general idea of going from treebank strings into trees of clojure objects is
still worth pursuing. However, doing it perfectly will require either some pre-processing
or a modified reader.

Not everything that isn't a sequence is a symbol. Not all scalars are symbols. Numbers for example, are happily read by the reader, but are not symbols. That's OK. Some tokens from natural text are not lexed by the reader into clojure. Time values for example, like "2:30". They would appear in the natural language input without quotes. Clojure tries to make things that start with numerals into some kind of number, and the colon throws it off. Since the OpenNLP tokenizer doesn't split 2:30 into 2 : 30, but leaves it, Clojure throws.

My boss at work is a classic AI LISP hack an recommends not using the reader for things that are not lisp s-expressions. He mentioned lisp code he has that basically does the same thing, but can be modified to deal with this case. We work with his academic nephew who is more familiar with the Clojure dialect. He suspects the use of Lisp features not available in clojure. I'll check it out. Hopefully we can get the two of them in on github community fun.
(BTW the features involve macro-related changes to the reader (table?))

It's a plug-in fix. A modified reader would work with the same interface as read-string,
and quote odd stuff like 2:30 that the clojure reader doesn't like...making a string of them.
Using the existing reader for now is fine.

Proposal for treebank-parser tree structure

Hey @dakrone,

I am particularly interested by the treebank-parser.

One cool representation would be actually a one-to-one translation from the string representation of the tree into a Clojure List, with the first element being the tag and the rest of it the chunk!
This will be visually more understandable, and stick with Lisp's common representation of data in general !
This could be done using some reader-tricks:

(load-string  (str "(quote "
                                    (first  (treebank-parser ["This is a sentence ."]))
                                    ")"))
;;=> (TOP (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN sentence))) (. .)))

But it would be better to have it generated when the parse is being done...
Whadda ya think ?

Tokenizing not happening perfectly

My code is taken from README :

(use 'clojure.pprint) ; just for this documentation
(use 'opennlp.nlp)

(def tokenize (make-tokenizer "models/en-token.bin"))
(def pos-tag (make-pos-tagger "models/en-pos-maxent.bin"))

(pprint (pos-tag (tokenize "john macharty quits")))
;// verb is taken as noun.
(["john" "NN"] ["macharty" "NN"] ["quits" "NNS"])

;// here verb is taken as noun.
(pprint (pos-tag (tokenize "bl joshi quits")))
(["bl" "JJ"] ["joshi" "NNP"] ["quits" "NNS"])

The verb quit is predicted as noun. Please see the comments in the code.

Am i doing something wrong ?
I see that we use the latest verion of opennlp.

Do we have any online testing resource of opennlp like that of stanford http://nlp.stanford.edu:8080/parser/index.jsp to compare them ?

How to deal with indeterminacy?

Evaluating (treebank-parser ["What can happen in a second ."]) using the set-up in the README file here, I get the following parse:

(TOP
 (SBARQ
  (WHNP (WP What))
  (SQ
   (VP (MD can) (VP (VB happen) (PP (IN in) (NP (DT a) (JJ second))))))
  (. .)))

Actually I'm pretty sure the JJ should be an NN. Is this alternative known to the OpenNLP engine at some point in its parse, and if so, can I get it to report on the known alternative(s)?

bare clojure.java.io/readers, writers, input-streams etc etc all over tools/train.clj

all clojure.java.io/readers, writers, input-streams, output-streams etc inside train.clj have not been wrapped with the 'with-open' macro. This strikes me as very weird because they are correct in all other namespaces but not in train.clj which uses them most! Unless, I'm missing something important this should be fixed asap... it took 3 minutes to fix it in my fork...

NullPointerException when chunk-filter encounters a phrase with {:tag nil}

The chunker occasionally outputs a chunk with a nil tag in cases where the chunk isn't part of a detected phrase, such as a sentence that starts with a coordinating conjunction like "And".

(use 'clojure.pprint)
(use 'opennlp.nlp)
(use 'opennlp.treebank)
(use 'opennlp.tools.filters)

(def tokenize (make-tokenizer "models/en-token.bin"))
(def pos-tag (make-pos-tagger "models/en-pos-maxent.bin"))
(def chunker (make-treebank-chunker "models/en-chunker.bin"))

(pprint
  (noun-phrases
   (chunker 
     (pos-tag 
        (tokenize "And when the party entered the assembly room, it consisted of only five altogether; Mr. Bingley, his two sisters, the husband of the eldest, and
another young man.")))))

Results in:

NullPointerException   java.util.regex.Matcher.getTextLength (:-1)

Because the first phrase has the nil tag:

(pprint (noun-phrases
          '({:phrase ["And"], :tag nil})))

For reference, bin/opennlp ChunkerME en-chunker.bin handles the same text this way, not putting the coordinating conjunction in a phrase at all:

And_CC when_WRB the_DT party_NN entered_VBD the_DT assembly_NN room,_NN it_PRP consisted_VBD of_IN five_CD altogether._.
=>
 And_CC [ADVP when_WRB ] [NP the_DT party_NN ] [VP entered_VBD ] [NP the_DT assembly_NN room,_NN ] [NP it_PRP ] [VP consisted_VBD ] [PP of_IN ] [NP five_CD ] altogether._.

The nil tag is probably a good way to represent this, except for the fact that re-find throws an exception when passed a nil string.

I've fixed this in my fork by removing nil phrases before filtering, but this has the side-effect of making it impossible to filter to select the nil phrases themselves. This may be an acceptable trade-off. I'm not sure.

(defmacro fixed-chunk-filter
  "Declare a filter for treebank-chunked lists with the given name and regex."
  [n r]
  (let [docstring (str "Given a list of treebank-chunked elements, "
                       "return only the " n " in a list.")]
    `(defn ~n
       ~docstring
       [elements#]
       (filter (fn [t#] (re-find ~r (:tag t#))) 
               (remove #(nil? (:tag %)) elements#)))))

IOException Mark invalid java.BufferedReader.reset (BufferedReader.java:505)

I'm getting this error when I try to use the following sample code from the readme:

(with-open [rdr (clojure.java.io/reader "/tmp/bigfile")]
  (let [sentences (sentence-seq rdr get-sentences)]
    ;; process your lazy seq of sentences however you desire
    (println "first 5 sentences:")
    (clojure.pprint/pprint (take 5 sentences))))

My file exists and I'm able to do (slurp "/tmp/bigfile")
I'm new to Clojure so I'm sorry if it's a basic java interop issue. Nevertheless I successfully imported the get-sentences and sentence-seq functions and have otherwise been able to use the library without problems.

CompilerException clojure.lang.ArityException

I am trying to use the library but getting error. I am working on OS X Yosemite version 10.10.1 and installed opennlp using brew install apache-opennlp.

(defproject firstattempt "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.7.0"]
                 [clojure-opennlp "0.3.3"]])

user=> (use 'clojure.pprint)
nil
user=> (use 'opennlp.nlp)
nil
user=> (use 'opennlp.treebank)

CompilerException clojure.lang.ArityException: Wrong number of args (2) passed to: StringReader, compiling:(abnf.clj:189:28) 

The chunker needs punctuation to work properly

Using the definitions of tokenize, pos-tag, and chunker from the readme, and 1.5.1 versions of the model files, the following behaviour is observed:

 (-> "I am looking for a good way to annotate this english text."
    tokenize pos-tag chunker phrases)
;; => (["I"] ["am" "looking"]  ["for"]  ["a" "good" "way"] ["to" "annotate"] ["this" "English" "text"]))

;; cf. the same operation, when the text is not full-stop terminated:
 (-> "I am looking for a good way to annotate this English text"
    tokenize pos-tag chunker phrases)
;; => (["I"] ["am" "looking"] ["for"] ["a" "good" "way"] ["to" "annotate"] ["this" "English"])

The pos-tag output seems correct however.

NoClassDefFoundError for instaparse when creating uberjar

Hey,

There seems to be an error when you try to run the Uberjar-

Exception in thread "main" java.lang.NoClassDefFoundError: instaparse/print$parser__GT_str (wrong name: instaparse/print$Parser__GT_str)

The issue is due to the outdated dependency to instaparse. Updating the dependency should solve the issue.

opennlp library

the java opennlp lib is missing from the dependencies in project.clj

build-posdictionary is broken?

Hi,

I tried to train new language model and found out "build-posdictionary" is not working.

Here's the snippets of code that i'm using

(def tagdict (build-posdictionary "jv-tagdict"))
(def pos-model (train-pos-tagger "jv" "workdir/jv-pos.train" tagdict))

I'm using opennlp-tools "1.5.3", clojure "1.5.1" and clojure-opennlp "0.3.1-SNAPSHOT". and here's the error message.

Exception in thread "main" java.lang.ClassCastException: java.io.BufferedReader cannot be cast to java.io.InputStream, compiling:(jv-pos-learn.clj:23:14)
...
Caused by: java.lang.ClassCastException: java.io.BufferedReader cannot be cast to java.io.InputStream
    at opennlp.tools.train$build_posdictionary.invoke(train.clj:49)
...

Does anyone have any ideas?

Thanks.
Jim

Custom Feature generation impossible via 'make-name-finder'

Hi there,

It seems that 'make-name-finder' does not take into account the several constructors in the NameFinderME.java ...More specifically, there is no way to use the constructor that accepts a custom feature generator... I propose this, which is not a breaking change:

(defmethod make-name-finder TokenNameFinderModel
  [model & {:keys [feature-generator]}] ;;optional arg - defaults to nil
  (fn name-finder
    [tokens & contexts]
    {:pre [(seq tokens)
           (every? #(= (class %) String) tokens)]}
    (let [finder (NameFinderME. model feature-generator *beam-size*) ;can be nil - no problem
          matches (.find finder (into-array String tokens))
          probs (seq (.probs finder))]
      (with-meta
        (distinct (Span/spansToStrings matches (into-array String tokens)))
        {:probabilities probs})))) 

Jim

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.