dakrone / clojure-opennlp Goto Github PK

View Code? Open in Web Editor NEW

749.0 64.0 81.0 32.82 MB

Natural Language Processing in Clojure (opennlp)

License: Eclipse Public License 1.0

Clojure 100.00%

clojure-opennlp's People

Stargazers

Watchers

Forkers

arohner zaxtax rplevy apatry ranjithtenz zmedelis danielglauser crisweber budu daviddpark alexott stask elnopintan gnarmis kirasystems jimpil runexec kthguru mpenet clojens akhudek hiredman ilikedata iterion jlindsey15 wangzhiwei-ai hellcoderz ikarth dthume gskielian pce1991 jonathanmarvens otrewyi191 yilab arnaudsj sirilanka yogsototh skottk mpereira alisheikh juancarloscruzd cvic gorinovic bugra hardikus siyuan1990 zzmjohn berhoden rowhit bahostetterlewis ttuulari ailoan blankrain cavhack ruedigergad joelittlejohn tranchis colinchenmaster nile free-variation mammammamoi arnaudyoh plumpmath wenxijuji tony824 devasthali-machine dpom solertis faiz-lisp afcarl clojusc s312569 nlpka6j stjordanis danieltanfh95 sandlunds reborg commotum yijingluo standardgalactic glottocrisio

clojure-opennlp's Issues

treebank make-tree uses clojure reader, chokes on some tokens from natural language

Not a show stopper.

The parsing code here gets treebank strings from OpenNLP. The treebank strings
are very nearly s-expressions and are parsed as such. They are only "very nearly"
s-expressions, not perfectly so because of tokens that are not parsed by clojure.
The code here uses the Clojure reader, so it crashes when it sees a token it doesn't like.
The general idea of going from treebank strings into trees of clojure objects is
still worth pursuing. However, doing it perfectly will require either some pre-processing
or a modified reader.

Not everything that isn't a sequence is a symbol. Not all scalars are symbols. Numbers for example, are happily read by the reader, but are not symbols. That's OK. Some tokens from natural text are not lexed by the reader into clojure. Time values for example, like "2:30". They would appear in the natural language input without quotes. Clojure tries to make things that start with numerals into some kind of number, and the colon throws it off. Since the OpenNLP tokenizer doesn't split 2:30 into 2 : 30, but leaves it, Clojure throws.

My boss at work is a classic AI LISP hack an recommends not using the reader for things that are not lisp s-expressions. He mentioned lisp code he has that basically does the same thing, but can be modified to deal with this case. We work with his academic nephew who is more familiar with the Clojure dialect. He suspects the use of Lisp features not available in clojure. I'll check it out. Hopefully we can get the two of them in on github community fun.
(BTW the features involve macro-related changes to the reader (table?))

It's a plug-in fix. A modified reader would work with the same interface as read-string,
and quote odd stuff like 2:30 that the clojure reader doesn't like...making a string of them.
Using the existing reader for now is fine.

Proposal for treebank-parser tree structure

Hey @dakrone,

I am particularly interested by the treebank-parser.

One cool representation would be actually a one-to-one translation from the string representation of the tree into a Clojure List, with the first element being the tag and the rest of it the chunk!
This will be visually more understandable, and stick with Lisp's common representation of data in general !
This could be done using some reader-tricks:

(load-string  (str "(quote "
                                    (first  (treebank-parser ["This is a sentence ."]))
                                    ")"))
;;=> (TOP (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN sentence))) (. .)))

But it would be better to have it generated when the parse is being done...
Whadda ya think ?

java.io.FileNotFoundException: Could not locate opennlp/nlp__init.class or opennlp/nlp.clj on classpath

I keep getting this issue. It seems like it might be because the opennlp.jar file doesn't exist. This blog http://writequit.org/blog/?p=365 says it can be found here:
http://github.com/dakrone/clojure-opennlp/tree/master/lib/
but the directory doesn't seem to exist...

Anyone have any ideas?

Upgrading to OpenNLP 1.6

Are there any plans to upgrade to the latest stable version?

Tokenizing not happening perfectly

My code is taken from README :

(use 'clojure.pprint) ; just for this documentation
(use 'opennlp.nlp)

(def tokenize (make-tokenizer "models/en-token.bin"))
(def pos-tag (make-pos-tagger "models/en-pos-maxent.bin"))

(pprint (pos-tag (tokenize "john macharty quits")))
;// verb is taken as noun.
(["john" "NN"] ["macharty" "NN"] ["quits" "NNS"])

;// here verb is taken as noun.
(pprint (pos-tag (tokenize "bl joshi quits")))
(["bl" "JJ"] ["joshi" "NNP"] ["quits" "NNS"])

The verb quit is predicted as noun. Please see the comments in the code.

Am i doing something wrong ?
I see that we use the latest verion of opennlp.

Do we have any online testing resource of opennlp like that of stanford http://nlp.stanford.edu:8080/parser/index.jsp to compare them ?

could you include the models for dates, organizations, money, location, and time?

These seem easy to bring in and similar to the name recognizer, but would be super useful to people in industry trying to use some basic nlp.

How to deal with indeterminacy?

Evaluating (treebank-parser ["What can happen in a second ."]) using the set-up in the README file here, I get the following parse:

(TOP
 (SBARQ
  (WHNP (WP What))
  (SQ
   (VP (MD can) (VP (VB happen) (PP (IN in) (NP (DT a) (JJ second))))))
  (. .)))

Actually I'm pretty sure the JJ should be an NN. Is this alternative known to the OpenNLP engine at some point in its parse, and if so, can I get it to report on the known alternative(s)?

bare clojure.java.io/readers, writers, input-streams etc etc all over tools/train.clj

all clojure.java.io/readers, writers, input-streams, output-streams etc inside train.clj have not been wrapped with the 'with-open' macro. This strikes me as very weird because they are correct in all other namespaces but not in train.clj which uses them most! Unless, I'm missing something important this should be fixed asap... it took 3 minutes to fix it in my fork...

NullPointerException when chunk-filter encounters a phrase with {:tag nil}

The chunker occasionally outputs a chunk with a nil tag in cases where the chunk isn't part of a detected phrase, such as a sentence that starts with a coordinating conjunction like "And".

(use 'clojure.pprint)
(use 'opennlp.nlp)
(use 'opennlp.treebank)
(use 'opennlp.tools.filters)

(def tokenize (make-tokenizer "models/en-token.bin"))
(def pos-tag (make-pos-tagger "models/en-pos-maxent.bin"))
(def chunker (make-treebank-chunker "models/en-chunker.bin"))

(pprint
  (noun-phrases
   (chunker 
     (pos-tag 
        (tokenize "And when the party entered the assembly room, it consisted of only five altogether; Mr. Bingley, his two sisters, the husband of the eldest, and
another young man.")))))

Results in:

NullPointerException   java.util.regex.Matcher.getTextLength (:-1)

Because the first phrase has the nil tag:

(pprint (noun-phrases
          '({:phrase ["And"], :tag nil})))

For reference, bin/opennlp ChunkerME en-chunker.bin handles the same text this way, not putting the coordinating conjunction in a phrase at all:

And_CC when_WRB the_DT party_NN entered_VBD the_DT assembly_NN room,_NN it_PRP consisted_VBD of_IN five_CD altogether._.
=>
 And_CC [ADVP when_WRB ] [NP the_DT party_NN ] [VP entered_VBD ] [NP the_DT assembly_NN room,_NN ] [NP it_PRP ] [VP consisted_VBD ] [PP of_IN ] [NP five_CD ] altogether._.

The nil tag is probably a good way to represent this, except for the fact that re-find throws an exception when passed a nil string.

I've fixed this in my fork by removing nil phrases before filtering, but this has the side-effect of making it impossible to filter to select the nil phrases themselves. This may be an acceptable trade-off. I'm not sure.

(defmacro fixed-chunk-filter
  "Declare a filter for treebank-chunked lists with the given name and regex."
  [n r]
  (let [docstring (str "Given a list of treebank-chunked elements, "
                       "return only the " n " in a list.")]
    `(defn ~n
       ~docstring
       [elements#]
       (filter (fn [t#] (re-find ~r (:tag t#))) 
               (remove #(nil? (:tag %)) elements#)))))

IOException Mark invalid java.BufferedReader.reset (BufferedReader.java:505)

I'm getting this error when I try to use the following sample code from the readme:

(with-open [rdr (clojure.java.io/reader "/tmp/bigfile")]
  (let [sentences (sentence-seq rdr get-sentences)]
    ;; process your lazy seq of sentences however you desire
    (println "first 5 sentences:")
    (clojure.pprint/pprint (take 5 sentences))))

My file exists and I'm able to do (slurp "/tmp/bigfile")
I'm new to Clojure so I'm sorry if it's a basic java interop issue. Nevertheless I successfully imported the get-sentences and sentence-seq functions and have otherwise been able to use the library without problems.

CompilerException clojure.lang.ArityException

I am trying to use the library but getting error. I am working on OS X Yosemite version 10.10.1 and installed opennlp using brew install apache-opennlp.

(defproject firstattempt "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.7.0"]
                 [clojure-opennlp "0.3.3"]])

user=> (use 'clojure.pprint)
nil
user=> (use 'opennlp.nlp)
nil
user=> (use 'opennlp.treebank)

CompilerException clojure.lang.ArityException: Wrong number of args (2) passed to: StringReader, compiling:(abnf.clj:189:28)

The chunker needs punctuation to work properly

Using the definitions of tokenize, pos-tag, and chunker from the readme, and 1.5.1 versions of the model files, the following behaviour is observed:

 (-> "I am looking for a good way to annotate this english text."
    tokenize pos-tag chunker phrases)
;; => (["I"] ["am" "looking"]  ["for"]  ["a" "good" "way"] ["to" "annotate"] ["this" "English" "text"]))

;; cf. the same operation, when the text is not full-stop terminated:
 (-> "I am looking for a good way to annotate this English text"
    tokenize pos-tag chunker phrases)
;; => (["I"] ["am" "looking"] ["for"] ["a" "good" "way"] ["to" "annotate"] ["this" "English"])

The pos-tag output seems correct however.

NoClassDefFoundError for instaparse when creating uberjar

Hey,

There seems to be an error when you try to run the Uberjar-

Exception in thread "main" java.lang.NoClassDefFoundError: instaparse/print$parser__GT_str (wrong name: instaparse/print$Parser__GT_str)

The issue is due to the outdated dependency to instaparse. Updating the dependency should solve the issue.

opennlp library

the java opennlp lib is missing from the dependencies in project.clj

build-posdictionary is broken?

Hi,

I tried to train new language model and found out "build-posdictionary" is not working.

Here's the snippets of code that i'm using

(def tagdict (build-posdictionary "jv-tagdict"))
(def pos-model (train-pos-tagger "jv" "workdir/jv-pos.train" tagdict))

I'm using opennlp-tools "1.5.3", clojure "1.5.1" and clojure-opennlp "0.3.1-SNAPSHOT". and here's the error message.

Exception in thread "main" java.lang.ClassCastException: java.io.BufferedReader cannot be cast to java.io.InputStream, compiling:(jv-pos-learn.clj:23:14)
...
Caused by: java.lang.ClassCastException: java.io.BufferedReader cannot be cast to java.io.InputStream
    at opennlp.tools.train$build_posdictionary.invoke(train.clj:49)
...

Does anyone have any ideas?

Thanks.
Jim

Upgrade to OpenNLP 1.5.2

Currently there are issues with the trainer with 1.5.2.

Custom Feature generation impossible via 'make-name-finder'

Hi there,

It seems that 'make-name-finder' does not take into account the several constructors in the NameFinderME.java ...More specifically, there is no way to use the constructor that accepts a custom feature generator... I propose this, which is not a breaking change:

(defmethod make-name-finder TokenNameFinderModel
  [model & {:keys [feature-generator]}] ;;optional arg - defaults to nil
  (fn name-finder
    [tokens & contexts]
    {:pre [(seq tokens)
           (every? #(= (class %) String) tokens)]}
    (let [finder (NameFinderME. model feature-generator *beam-size*) ;can be nil - no problem
          matches (.find finder (into-array String tokens))
          probs (seq (.probs finder))]
      (with-meta
        (distinct (Span/spansToStrings matches (into-array String tokens)))
        {:probabilities probs}))))

Jim

dakrone / clojure-opennlp Goto Github PK

clojure-opennlp's People

Stargazers

Watchers

Forkers

clojure-opennlp's Issues

Recommend Projects

Recommend Topics

Recommend Org