exist-db / exist-stanford-nlp Goto Github PK

XQuery wrapper around the Stanford CoreNLP pipeline

License: GNU Lesser General Public License v2.1

Java 43.04% XQuery 29.83% HTML 2.01% CSS 5.03% JavaScript 2.72% TypeScript 17.37%

xquery-library xquery-modules expath java nlp exist-db stanford-nlp english-language named-entity-recognition wrapper

exist-stanford-nlp's Introduction

author	title
Loren Cahlander North Carolina Unites States of America <[email protected]>	Stanford CoreNLP Wrapper for eXist-db

exist-stanford-nlp

Introduction

This application is a wrapper around the Stanford CoreNLP pipeline for eXist-db

Why

Loren was between projects and at an eXist-db weekly conference call it came to light that the previous implementations of Stanford NLP and Named Entity Recognition were not compatible with version 5.x of eXist-db. Loren took this project on while looking for the next project, so please see the contributions section at the end of this article.

Requirements

eXist-db: 5.0.0 with min 4Gb memory

For Building from Source

maven: 3.6.0
java: 8
(node: 12)
(polymer-cli: 1.9.11)

Building from Source

All dependencies including node.js and polymer dependencies are managed by maven. Simply, run mvn clean package to generate a .xar file inside the target/ directory. Then follow the installation instructions below.

When developing web-components you can navigate to the src/main/polymer directory and execute polymer-cli commands.

For more information see the polymer readme

Testing

To run unit tests(java, xquery, web-component) locally use: mvn test.

Support for integration tests, namely, Web Component Tester is TBD.

Installing the Application

Open the eXist-db Dashboard
Login as the administrator
Select Stanford Natural Language Processing

Loading Languages

The application is installed without language files OOTB. The files need to be loaded after installation. Click on the Setup tab and then click on the language(s) that you want to load.

When a language is loaded, then there is a checkmark in the button.

Properties

The properties files within the JAR file are transformed to JSON documents where the entries pointing to the data files that have been loaded into the database are transformed to the URL to that resource.

Defaults

The pipeline uses default properties that assume that the english jar file is loaded in the classpath. Since the english jar is loaded into the database it is important to have a defaults JSON document that points to the english files in the database.

The defaults are loaded into /db/apps/stanford-nlp/data/StanfordCoreNLP-english.json

User Interface

Named Entity Recognition

This user interface allows the user to enter text in the textbox, select the language and then after it is submitted the resulting NER has a color coded view of the text that identities the named entities.

NLP

RESTful API

Natural Language Processing

Named Entity Recognition

XQuery Function Modules

Natural Language Processing

xquery version "3.1";

import module namespace nlp="http://exist-db.org/xquery/stanford-nlp";

let $properties := json-doc("/db/apps/stanford-nlp/data/StanfordCoreNLP-german.json")

let $text := "Juliana kommt aus Paris. Das ist die Hauptstadt von Frankreich. " ||
             "In diesem Sommer macht sie einen Sprachkurs in Freiburg. Das ist " ||
              "eine Universitätsstadt im Süden von Deutschland."

return nlp:parse($text, $properties)

The properties JSON document for German is:

{
    "ner.applyNumericClassifiers": "false",
    "depparse.language": "german",
    "ner.useSUTime": "false",
    "ner.applyFineGrained": "false",
    "tokenize.language": "de",
    "parse.model": "http://localhost:8080/exist/apps/stanford-nlp/data/edu/stanford/nlp/models/lexparser/germanFactored.ser.gz",
    "pos.model": "http://localhost:8080/exist/apps/stanford-nlp/data/edu/stanford/nlp/models/pos-tagger/german/german-hgc.tagger",
    "ner.model": "http://localhost:8080/exist/apps/stanford-nlp/data/edu/stanford/nlp/models/ner/german.conll.germeval2014.hgc_175m_600.crf.ser.gz",
    "annotators": [
        "tokenize",
        "ssplit",
        "pos",
        "ner",
        "parse"
    ],
    "depparse.model": "http://localhost:8080/exist/apps/stanford-nlp/data/edu/stanford/nlp/models/parser/nndep/UD_German.gz"
}

This returns an XML document of the parsed text.

<StanfordNLP>
    <sentences>
        <sentence id="1">
            <tokens>
                <token id="1">
                    <word>Juliana</word>
                    <CharacterOffsetBegin>0</CharacterOffsetBegin>
                    <CharacterOffsetEnd>7</CharacterOffsetEnd>
                    <POS>NE</POS>
                    <NER>PERSON</NER>
                </token>
                <token id="2">
                    <word>kommt</word>
                    <CharacterOffsetBegin>8</CharacterOffsetBegin>
                    <CharacterOffsetEnd>13</CharacterOffsetEnd>
                    <POS>VVFIN</POS>
                    <NER>O</NER>
                </token>
                <token id="3">
                    <word>aus</word>
                    <CharacterOffsetBegin>14</CharacterOffsetBegin>
                    <CharacterOffsetEnd>17</CharacterOffsetEnd>
                    <POS>APPR</POS>
                    <NER>O</NER>
                </token>
                <token id="4">
                    <word>Paris</word>
                    <CharacterOffsetBegin>18</CharacterOffsetBegin>
                    <CharacterOffsetEnd>23</CharacterOffsetEnd>
                    <POS>NE</POS>
                    <NER>LOCATION</NER>
                </token>
                <token id="5">
                    <word>.</word>
                    <CharacterOffsetBegin>23</CharacterOffsetBegin>
                    <CharacterOffsetEnd>24</CharacterOffsetEnd>
                    <POS>$.</POS>
                    <NER>O</NER>
                </token>
            </tokens>
            <parse>(ROOT
  (S (NE Juliana) (VVFIN kommt)
    (PP (APPR aus) (NE Paris))
    ($. .)))

</parse>
        </sentence>
        <sentence id="2">
            <tokens>
                <token id="1">
                    <word>Das</word>
                    <CharacterOffsetBegin>25</CharacterOffsetBegin>
                    <CharacterOffsetEnd>28</CharacterOffsetEnd>
                    <POS>PDS</POS>
                    <NER>O</NER>
                </token>
                <token id="2">
                    <word>ist</word>
                    <CharacterOffsetBegin>29</CharacterOffsetBegin>
                    <CharacterOffsetEnd>32</CharacterOffsetEnd>
                    <POS>VAFIN</POS>
                    <NER>O</NER>
                </token>
                <token id="3">
                    <word>die</word>
                    <CharacterOffsetBegin>33</CharacterOffsetBegin>
                    <CharacterOffsetEnd>36</CharacterOffsetEnd>
                    <POS>ART</POS>
                    <NER>O</NER>
                </token>
                <token id="4">
                    <word>Hauptstadt</word>
                    <CharacterOffsetBegin>37</CharacterOffsetBegin>
                    <CharacterOffsetEnd>47</CharacterOffsetEnd>
                    <POS>NN</POS>
                    <NER>O</NER>
                </token>
                <token id="5">
                    <word>von</word>
                    <CharacterOffsetBegin>48</CharacterOffsetBegin>
                    <CharacterOffsetEnd>51</CharacterOffsetEnd>
                    <POS>APPR</POS>
                    <NER>O</NER>
                </token>
                <token id="6">
                    <word>Frankreich</word>
                    <CharacterOffsetBegin>52</CharacterOffsetBegin>
                    <CharacterOffsetEnd>62</CharacterOffsetEnd>
                    <POS>NE</POS>
                    <NER>LOCATION</NER>
                </token>
                <token id="7">
                    <word>.</word>
                    <CharacterOffsetBegin>62</CharacterOffsetBegin>
                    <CharacterOffsetEnd>63</CharacterOffsetEnd>
                    <POS>$.</POS>
                    <NER>O</NER>
                </token>
            </tokens>
            <parse>(ROOT
  (S (PDS Das) (VAFIN ist)
    (NP (ART die) (NN Hauptstadt)
      (PP (APPR von) (NE Frankreich)))
    ($. .)))

</parse>
        </sentence>
        <sentence id="3">
            <tokens>
                <token id="1">
                    <word>In</word>
                    <CharacterOffsetBegin>64</CharacterOffsetBegin>
                    <CharacterOffsetEnd>66</CharacterOffsetEnd>
                    <POS>APPR</POS>
                    <NER>O</NER>
                </token>
                <token id="2">
                    <word>diesem</word>
                    <CharacterOffsetBegin>67</CharacterOffsetBegin>
                    <CharacterOffsetEnd>73</CharacterOffsetEnd>
                    <POS>PDAT</POS>
                    <NER>O</NER>
                </token>
                <token id="3">
                    <word>Sommer</word>
                    <CharacterOffsetBegin>74</CharacterOffsetBegin>
                    <CharacterOffsetEnd>80</CharacterOffsetEnd>
                    <POS>NN</POS>
                    <NER>O</NER>
                </token>
                <token id="4">
                    <word>macht</word>
                    <CharacterOffsetBegin>81</CharacterOffsetBegin>
                    <CharacterOffsetEnd>86</CharacterOffsetEnd>
                    <POS>VVFIN</POS>
                    <NER>O</NER>
                </token>
                <token id="5">
                    <word>sie</word>
                    <CharacterOffsetBegin>87</CharacterOffsetBegin>
                    <CharacterOffsetEnd>90</CharacterOffsetEnd>
                    <POS>PPER</POS>
                    <NER>O</NER>
                </token>
                <token id="6">
                    <word>einen</word>
                    <CharacterOffsetBegin>91</CharacterOffsetBegin>
                    <CharacterOffsetEnd>96</CharacterOffsetEnd>
                    <POS>ART</POS>
                    <NER>O</NER>
                </token>
                <token id="7">
                    <word>Sprachkurs</word>
                    <CharacterOffsetBegin>97</CharacterOffsetBegin>
                    <CharacterOffsetEnd>107</CharacterOffsetEnd>
                    <POS>NN</POS>
                    <NER>O</NER>
                </token>
                <token id="8">
                    <word>in</word>
                    <CharacterOffsetBegin>108</CharacterOffsetBegin>
                    <CharacterOffsetEnd>110</CharacterOffsetEnd>
                    <POS>APPR</POS>
                    <NER>O</NER>
                </token>
                <token id="9">
                    <word>Freiburg</word>
                    <CharacterOffsetBegin>111</CharacterOffsetBegin>
                    <CharacterOffsetEnd>119</CharacterOffsetEnd>
                    <POS>NE</POS>
                    <NER>LOCATION</NER>
                </token>
                <token id="10">
                    <word>.</word>
                    <CharacterOffsetBegin>119</CharacterOffsetBegin>
                    <CharacterOffsetEnd>120</CharacterOffsetEnd>
                    <POS>$.</POS>
                    <NER>O</NER>
                </token>
            </tokens>
            <parse>(ROOT
  (S
    (PP (APPR In) (PDAT diesem) (NN Sommer))
    (VVFIN macht) (PPER sie)
    (NP (ART einen) (NN Sprachkurs)
      (PP (APPR in) (NE Freiburg)))
    ($. .)))

</parse>
        </sentence>
        <sentence id="4">
            <tokens>
                <token id="1">
                    <word>Das</word>
                    <CharacterOffsetBegin>121</CharacterOffsetBegin>
                    <CharacterOffsetEnd>124</CharacterOffsetEnd>
                    <POS>PDS</POS>
                    <NER>O</NER>
                </token>
                <token id="2">
                    <word>ist</word>
                    <CharacterOffsetBegin>125</CharacterOffsetBegin>
                    <CharacterOffsetEnd>128</CharacterOffsetEnd>
                    <POS>VAFIN</POS>
                    <NER>O</NER>
                </token>
                <token id="3">
                    <word>eine</word>
                    <CharacterOffsetBegin>129</CharacterOffsetBegin>
                    <CharacterOffsetEnd>133</CharacterOffsetEnd>
                    <POS>ART</POS>
                    <NER>O</NER>
                </token>
                <token id="4">
                    <word>Universitätsstadt</word>
                    <CharacterOffsetBegin>134</CharacterOffsetBegin>
                    <CharacterOffsetEnd>151</CharacterOffsetEnd>
                    <POS>NN</POS>
                    <NER>O</NER>
                </token>
                <token id="5">
                    <word>im</word>
                    <CharacterOffsetBegin>152</CharacterOffsetBegin>
                    <CharacterOffsetEnd>154</CharacterOffsetEnd>
                    <POS>APPRART</POS>
                    <NER>O</NER>
                </token>
                <token id="6">
                    <word>Süden</word>
                    <CharacterOffsetBegin>155</CharacterOffsetBegin>
                    <CharacterOffsetEnd>160</CharacterOffsetEnd>
                    <POS>NN</POS>
                    <NER>O</NER>
                </token>
                <token id="7">
                    <word>von</word>
                    <CharacterOffsetBegin>161</CharacterOffsetBegin>
                    <CharacterOffsetEnd>164</CharacterOffsetEnd>
                    <POS>APPR</POS>
                    <NER>O</NER>
                </token>
                <token id="8">
                    <word>Deutschland</word>
                    <CharacterOffsetBegin>165</CharacterOffsetBegin>
                    <CharacterOffsetEnd>176</CharacterOffsetEnd>
                    <POS>NE</POS>
                    <NER>LOCATION</NER>
                </token>
                <token id="9">
                    <word>.</word>
                    <CharacterOffsetBegin>176</CharacterOffsetBegin>
                    <CharacterOffsetEnd>177</CharacterOffsetEnd>
                    <POS>$.</POS>
                    <NER>O</NER>
                </token>
            </tokens>
            <parse>(ROOT
  (S (PDS Das) (VAFIN ist)
    (NP (ART eine) (NN Universitätsstadt)
      (PP (APPRART im) (NN Süden)
        (PP (APPR von) (NE Deutschland))))
    ($. .)))

</parse>
        </sentence>
    </sentences>
</StanfordNLP>

Named Entity Recognition

There is an XQuery library module that takes the output of the NLP pipeline and surrounds the named entities with the appropriate tags.

xquery version "3.1";

import module namespace ner = "http://exist-db.org/xquery/stanford-nlp/ner";

let $text := "Juliana kommt aus Paris. Das ist die Hauptstadt von Frankreich. " ||
             "In diesem Sommer macht sie einen Sprachkurs in Freiburg. Das ist " ||
              "eine Universitätsstadt im Süden von Deutschland."
   
return ner:query-text-as-xml($text, "de")

With the results:

<ner>
    <PERSON>Juliana</PERSON> kommt aus <LOCATION>Paris</LOCATION>.
Das ist die Hauptstadt von <LOCATION>Frankreich</LOCATION>.
In diesem Sommer macht sie einen Sprachkurs in <LOCATION>Freiburg</LOCATION>.
Das ist eine Universitätsstadt im Süden von <LOCATION>Deutschland</LOCATION>.</ner>

Future Developments

Any requests for features should be submitted to https://github.com/lcahlander/exist-stanford-nlp/issues

About the Author

Loren is an independent contractor, so his contributions to the Open Source community are on his own time. If you appreciate his contributions to the NoSQL and the Natural Language Processing communities, then please either contract him for a project or submit a contribution to his company PayPal at [email protected].

exist-stanford-nlp's People

Contributors

Stargazers

Watchers

Forkers

duncdrum adamretter isabella232 angelodel80 lcahlander aqhali

exist-stanford-nlp's Issues

wrong function sig on readme

ner:query-text-as-xml works ner:query-text($text, "de") doesn't

no such function in module

Automate build and configure CI

This automates more building steps and dependency installation via maven.
Also add CI for automated build.
Unit test scaffold for java and xquery is already there, (sans actual test).
As for integration-testing there are some open questions about the use of WCT (which will not test contents as deployed from inside exist) vs cypress (which will).

Exist-db has a saucelab account so either one could be made to run on CI.

nlp:parse return type mismatch: element() vs. document-node()

Using stanford-nlp-0.5.1, I noticed that the function documentation states that nlp:parse() returns an element(), but in my testing, it returns document-node(). Indeed, I received an error to this effect when passing text to nlp:parse() using the arrow operator. Here's the code:

xquery version "3.1";

import module namespace nlp="http://exist-db.org/xquery/stanford-nlp";

"Hello World!" => nlp:parse(
    map {
        "annotators" : "tokenize, ssplit",
        "tokenize.language" : "en"
    }
)

Error:

err:XPTY0004 document-node()(Hello05World611!1112) is not a sub-type of element() [source: xquery version "3.1"; import module namespace nlp="http://exist-db.org/xquery/stanford-nlp"; "Hello World!" => nlp:parse( map { "annotators" : "tokenize, ssplit", "tokenize.language" : "en" } )] In function: nlp:parse(xs:string, map(*)?) [-1:-1:String]

Looking at this another way, the following code returns true() instead of the expected false():

xquery version "3.1";

import module namespace nlp="http://exist-db.org/xquery/stanford-nlp";

nlp:parse(
    "Hello World!",
    map {
        "annotators" : "tokenize, ssplit",
        "tokenize.language" : "en"
    }
) instance of element()

Screenshot of the function documentation:

Three other quick observations about the function documentation, if I may:

The $properties cardinality is listed as optional, i.e., map(*)? and $properties?. I think it should probably be exactly one, e.g., map(*) and $properties, because judging by exist.log, calling nlp:parse("Hello World!", ())—with an empty sequence for the 2nd parameter—seems to invoke a default pipeline, which includes the most memory-intensive corefs step and triggers a "GC overhead limit exceeded" on my system with default 2 GB memory allocated to eXist. Better, I think, to err on the side of returning nothing than risk triggering a memory overflow. How about requiring exactly one map that contains at least the "annotators" entry? Then, if no annotator is supplied, the function would return an empty sequence. Just an idea.
The description of the $properties parameter should be revised, from:

The path to the serialized classifier to load. Should point to a binary resource stored within the database

to:

A map containing properties for the NLP pipeline. Typically, at least map { "annotators": "tokenize, ssplit" } should be provided. Properties can also be loaded from a JSON file via json-doc().
I'd suggest changing $properties to $options. The XQuery 3.1 spec uses $options for all functions that take a map for this purpose. See https://www.w3.org/TR/xpath-functions-31/, with 32 instances of $options and 0 instances of $properties. A totally stylistic suggestion though.

ner:classify-node complains about node()

The function documentation lead me to expect the following to work.

Test

let $test := <p>克林顿说，华盛顿将逐步落实对韩国的经济援助。金大中对克林顿的讲话报以掌声：克林顿总统在会谈中重申，他坚定地支持韩国摆脱经济危机。</p>

return
  ner:classify-node($test)

Expected

<p>
<PERSON>克林顿</PERSON>
说，
<STATE_OR_PROVINCE>华盛顿</STATE_OR_PROVINCE>
将逐步落实对
<COUNTRY>韩国</COUNTRY>
的经济援助。
<PERSON>金大中</PERSON>
对
<PERSON>克林顿</PERSON>
的讲话报以掌声：
<PERSON>克林顿</PERSON>
<TITLE>总统</TITLE>
在会谈中重申，他坚定地支持
<COUNTRY>韩国</COUNTRY>
摆脱经济危机。
</p>

Actual

err:XPTY0004 Type error: the node name should evaluate to a single item [at line 64, column 5, source: /exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
In function:
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [108:8:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:classify(xs:string, map(*)) [88:51:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:dispatch(node()?, map(*)) [23:12:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:classify-node(node()) [23:7:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]

System

5.3.0-SNAPSHOT
docker:latest

[BUG] io.lacuna.bifurcan.Maps$Entry cannot be cast to java.util.Map$Entry in eXist 5.3.0-SNAPSHOT

Describe the bug

The nlp:parse() function raises an error under eXist 5.3.0-SNAPSHOT (current develop HEAD):

io.lacuna.bifurcan.Maps$Entry cannot be cast to java.util.Map$Entry

The error is not present in eXist 5.2.0.

Expected behavior

The library should work under eXist 5.3.0-SNAPSHOT.

To Reproduce

As instructed in the README:

Build and install the .xar
Run /db/apps/stanford-nlp/modules/load-languages.xq
Run the following query:

xquery version "3.1";

import module namespace nlp="http://exist-db.org/xquery/stanford-nlp";

let $text := 
    "This application is a wrapper around the Stanford CoreNLP pipeline for 
    eXist-db. The application is installed without language files OOTB. The 
    files need to be loaded after installation. The pipeline uses default 
    properties that assume that the english jar file is loaded in the classpath.
    Since the english jar is loaded into the database it is important to have a 
    defaults JSON document that points to the english files in the database."
let $properties := 
    map { 
        "annotators": "tokenize,ssplit"
    }
return
    nlp:parse($text, $properties)

Instead of returning an XML document of the parsed text, the query produces an error. From exist.log:


2021-04-19 17:21:52,287 [qtp698673041-69] ERROR (XQueryServlet.java [process]:550) - io.lacuna.bifurcan.Maps$Entry cannot be cast to java.util.Map$Entry 
java.lang.ClassCastException: io.lacuna.bifurcan.Maps$Entry cannot be cast to java.util.Map$Entry
	at org.exist.xquery.nlp.StanfordNLPFunction.eval(StanfordNLPFunction.java:86) ~[stanford-nlp-0.7.0-SNAPSHOT.jar:0.7.0-SNAPSHOT]
	at org.exist.xquery.BasicFunction.eval(BasicFunction.java:73) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
	at org.exist.xquery.InternalFunctionCall.eval(InternalFunctionCall.java:62) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
	at org.exist.xquery.DebuggableExpression.eval(DebuggableExpression.java:58) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
	at org.exist.xquery.LetExpr.eval(LetExpr.java:110) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
	at org.exist.xquery.LetExpr.eval(LetExpr.java:110) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
	at org.exist.xquery.AbstractExpression.eval(AbstractExpression.java:71) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
	at org.exist.xquery.PathExpr.eval(PathExpr.java:279) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
	at org.exist.xquery.AbstractExpression.eval(AbstractExpression.java:71) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
	at org.exist.xquery.XQuery.execute(XQuery.java:261) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
        ...

Unit Test

The following xqsuite test produces the error described above in eXist 5.3.0-SNAPSHOT.

xquery version "3.1";

module namespace t="http://exist-db.org/xquery/test";

import module namespace load-language = "http://exist-db.org/xquery/stanford-nlp/load-language" at "/db/apps/stanford-nlp/modules/load-language.xqm";

import module namespace config = "http://exist-db.org/apps/stanford-nlp/config";
import module namespace nlp="http://exist-db.org/xquery/stanford-nlp";

declare namespace test="http://exist-db.org/xquery/xqsuite";

(: uncomment if you haven't already loaded the languages :)
(: 
declare
    %test:setUp
function t:setup() {
    load-language:process($config:corenlp-model-url || "english.jar")
};
:)

declare
    %test:assertEquals(5)
function t:test() {
    let $text := 
        "This application is a wrapper around the Stanford CoreNLP pipeline for 
        eXist-db. The application is installed without language files OOTB. The 
        files need to be loaded after installation. The pipeline uses default 
        properties that assume that the english jar file is loaded in the classpath.
        Since the english jar is loaded into the database it is important to have a 
        defaults JSON document that points to the english files in the database."
    let $properties := 
        map { 
            "annotators": "tokenize,ssplit"
        }
    return
        nlp:parse($text, $properties)//sentence => count()
};

The test suite returns:

<testsuite package="http://exist-db.org/xquery/test" timestamp="2021-04-19T17:29:28.154-04:00"
    tests="1" failures="0" errors="1" pending="0" time="PT0.006S">
    <testcase name="test" class="t:test">
        <error type="java:java.lang.ClassCastException"
            message="io.lacuna.bifurcan.Maps$Entry cannot be cast to java.util.Map$Entry"/>
    </testcase>
</testsuite>

In eXist 5.2.0, the test passes:

<testsuite package="http://exist-db.org/xquery/test" timestamp="2021-04-19T17:33:32.065-04:00"
    tests="1" failures="0" errors="0" pending="0" time="PT0.146S">
    <testcase name="test" class="t:test"/>
</testsuite>

When modifying the pom.xml file's exist.version to read 5.3.0-SNAPSHOT:

https://github.com/eXist-db/exist-stanford-nlp/blob/master/pom.xml#L65

... mvn clean package fails during compilation:

[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /Users/joe/workspace/exist-stanford-nlp/src/main/java/org/exist/xquery/nlp/StanfordNLPFunction.java:[84,83] incompatible types: java.util.Iterator<io.lacuna.bifurcan.IEntry<org.exist.xquery.value.AtomicValue,org.exist.xquery.value.Sequence>> cannot be converted to java.util.Iterator<java.util.Map.Entry<org.exist.xquery.value.AtomicValue,org.exist.xquery.value.Sequence>>
[INFO] 1 error
[INFO] -------------------------------------------------------------

Context (please always complete the following information):

OS: macOS 11.2.3
eXist-db Version: eXist 5.3.0-SNAPSHOT 70610262be66f7f7ebda574d6d1206dfaf48444f 20210419134637 (develop HEAD)
Java Version: OpenJDK 1.8.0_282 (liberica-jdk-8-full)
App Version: 0.7.0-SNAPSHOT (master HEAD)

Additional context

How is eXist-db installed? built from source
Any custom changes in e.g. conf.xml? none

build troubles

for one we are bitten by the https bug in the archetype

[ERROR] Failed to execute goal on project stanford-nlp: Could not resolve dependencies for project org.exist-db:stanford-nlp:jar:0.6.0-SNAPSHOT: Failed to collect dependencies at org.exist-db:exist-core:jar:5.0.0 -> org.exist-db.thirdparty.com.thaiopensource:jing:jar:20151127

test/resources/conf.xml contains an invalid section, i don't think the file is necessary at all

<builtin-modules>
  <module uri="https://my-organisation.com/exist-db/ns/app/my-java-module" class="org.exist.xquery.ner.ExampleModule"/>
</builtin-modules>

my ide has a number of code-smell warning regarding unused imports and the like, some deprecation warnings

because of eXist-db/exist#3725 the app won't run on 5.3.0-SNAPSHOT

core-nlp has had a major version jump which we could try to follow

testing isn't working

ci produces false positives

Support comma-delimited properties as arrays

Currently, properties are supplied as a single string with inline comma delimiters:

map { 
  "annotators" : "tokenize, ssplit, pos, lemma, ner, parse, coref" 
}

For modularity, it would be preferable to supply these as arrays:

map { 
  "annotators" : ["tokenize", "ssplit", "pos", "lemma", "ner", "parse", "coref"]
}

Sequences would work too, but this would prevent you from saving option sets as JSON files in the database, since sequences are only available in XDM.

[BUG] In eXist-db version 6.0.1 io.lacuna.bifurcan.Maps$Entry cannot be cast to java.util.Map$Entry

In eXist-db 6.0.1, if I run a query, using nlp:parse function, is shown this error:
io.lacuna.bifurcan.Maps$Entry cannot be cast to java.util.Map$Entry.
How can I fix this bug?
Thank you!

Many annotators invoked when only 2 were specified

Using v0.5.2, I would expect the following code to invoke only the two specified annotators, tokenize and ssplit:

xquery version "3.1";

import module namespace nlp="http://exist-db.org/xquery/stanford-nlp";

nlp:parse(
    "Hello World!",
    map {
        "annotators" : "tokenize, ssplit",
        "tokenize.language" : "en"
    }
)

But judging by the logs, it also invokes pos, lemma, ner, depparse, and coref:

2020-02-04 05:47:08,947 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Searching for resource: StanfordCoreNLP.properties ... found. 
2020-02-04 05:47:08,962 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Adding annotator tokenize 
2020-02-04 05:47:08,977 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Adding annotator ssplit 
2020-02-04 05:47:08,982 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Adding annotator pos 
2020-02-04 05:47:09,910 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.9 sec]. 
2020-02-04 05:47:09,910 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Adding annotator lemma 
2020-02-04 05:47:09,912 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Adding annotator ner 
2020-02-04 05:47:09,988 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - encoding=utf-8 
2020-02-04 05:47:13,506 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [3.4 sec]. 
2020-02-04 05:47:14,189 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.7 sec]. 
2020-02-04 05:47:15,543 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [1.4 sec]. 
2020-02-04 05:47:15,550 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1. 
2020-02-04 05:47:15,797 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt 
2020-02-04 05:47:23,126 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - ner.fine.regexner: Read 580704 unique entries out of 581863 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab, 0 TokensRegex patterns. 
2020-02-04 05:47:23,141 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - ner.fine.regexner: Read 4869 unique entries out of 4869 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_cased.tab, 0 TokensRegex patterns. 
2020-02-04 05:47:23,142 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - ner.fine.regexner: Read 585573 unique entries from 2 files 
2020-02-04 05:47:49,838 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Adding annotator depparse 
2020-02-04 05:47:50,110 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Loading depparse model: edu/stanford/nlp/models/parser/nndep/english_UD.gz ...  
2020-02-04 05:48:14,095 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - PreComputed 99996, Elapsed Time: 17.01 (s) 
2020-02-04 05:48:14,096 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Initializing dependency parser ... done [24.0 sec]. 
2020-02-04 05:48:14,101 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Adding annotator coref

(Even with 4 GB allocated to eXist, 30 minutes has passed since this query was submitted, and it hasn't returned its result, the iMac's 8-core CPU is pegged, and eXist is unresponsive.)

Here's my console output. I'm using macOS 11.2.3, OpenJDK 1.8.0_282 (Liberica), and Maven 3.8.1.

% mvn clean package
[INFO] Scanning for projects...
[INFO] 
[INFO] ---------------------< org.exist-db:stanford-nlp >----------------------
[INFO] Building Stanford Natural Language Processing 0.7.0-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ stanford-nlp ---
[INFO] 
[INFO] --- maven-enforcer-plugin:1.2:enforce (enforce-maven) @ stanford-nlp ---
[INFO] 
[INFO] --- buildversion-plugin:1.0.3:set-properties (default) @ stanford-nlp ---
[INFO] 
[INFO] --- frontend-maven-plugin:1.9.1:install-node-and-npm (install node and npm) @ stanford-nlp ---
[INFO] Node v12.22.1 is already installed.
[INFO] NPM 7.9.0 is already installed.
[INFO] 
[INFO] --- frontend-maven-plugin:1.9.1:npm (npm version bump) @ stanford-nlp ---
[INFO] Running 'npm version --no-git-tag-version --allow-same-version=true 0.7.0-SNAPSHOT' in /Users/joe/workspace/exist-stanford-nlp/src/main/polymer
[INFO] v0.7.0-SNAPSHOT
[INFO] 
[INFO] --- frontend-maven-plugin:1.9.1:npm (npm install) @ stanford-nlp ---
[INFO] Running 'npm i' in /Users/joe/workspace/exist-stanford-nlp/src/main/polymer
[INFO] 
[INFO] up to date, audited 1528 packages in 4s
[INFO] 
[INFO] 30 packages are looking for funding
[INFO]   run `npm fund` for details
[INFO] 
[INFO] 29 vulnerabilities (17 low, 10 high, 2 critical)
[INFO] 
[INFO] To address issues that do not require attention, run:
[INFO]   npm audit fix
[INFO] 
[INFO] Some issues need review, and may require choosing
[INFO] a different dependency.
[INFO] 
[INFO] Run `npm audit` for details.
[INFO] 
[INFO] --- frontend-maven-plugin:1.9.1:npm (polymer build) @ stanford-nlp ---
[INFO] Running 'npm run build' in /Users/joe/workspace/exist-stanford-nlp/src/main/polymer
[INFO] 
[INFO] > [email protected] build
[INFO] > polymer build
[INFO] 
[INFO] info:	Clearing build/ directory...
[INFO] info:	(esm-bundled) Building...
[INFO] info:	(es6-bundled) Building...
[INFO] info:	(es5-bundled) Building...
[INFO] error:	Uncaught exception: Error: not implemented
[INFO] error:	Error: not implemented
[INFO]     at Writable._write (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_writable.js:407:6)
[INFO]     at doWrite (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_writable.js:279:12)
[INFO]     at writeOrBuffer (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_writable.js:266:5)
[INFO]     at Writable.write (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_writable.js:211:11)
[INFO]     at DestroyableTransform.ondata (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_readable.js:572:20)
[INFO]     at DestroyableTransform.emit (events.js:314:20)
[INFO]     at readableAddChunk (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_readable.js:195:16)
[INFO]     at DestroyableTransform.Readable.push (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_readable.js:162:10)
[INFO]     at DestroyableTransform.Transform.push (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_transform.js:145:32)
[INFO]     at afterTransform (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_transform.js:101:12)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  26.895 s
[INFO] Finished at: 2021-04-12T15:12:51-04:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.9.1:npm (polymer build) on project stanford-nlp: Failed to run task: 'npm run build' failed. org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

exist-db / exist-stanford-nlp Goto Github PK

exist-stanford-nlp's Introduction

exist-stanford-nlp

Introduction

Why

Requirements

For Building from Source

Building from Source

Testing

Installing the Application

Loading Languages

Properties

Defaults

User Interface

Named Entity Recognition

NLP

RESTful API

Natural Language Processing

Named Entity Recognition

XQuery Function Modules

Natural Language Processing

Named Entity Recognition

Future Developments

About the Author

exist-stanford-nlp's People

Contributors

Stargazers

Watchers

Forkers

exist-stanford-nlp's Issues

Test

Expected

Actual

System

Recommend Projects

Recommend Topics

Recommend Org