Giter Site home page Giter Site logo

exist-db / exist-stanford-nlp Goto Github PK

View Code? Open in Web Editor NEW
13.0 5.0 6.0 60.75 MB

XQuery wrapper around the Stanford CoreNLP pipeline

License: GNU Lesser General Public License v2.1

Java 43.04% XQuery 29.83% HTML 2.01% CSS 5.03% JavaScript 2.72% TypeScript 17.37%
xquery-library xquery-modules expath java nlp exist-db stanford-nlp english-language named-entity-recognition wrapper

exist-stanford-nlp's Introduction

author title
Loren Cahlander North Carolina Unites States of America <[email protected]>
Stanford CoreNLP Wrapper for eXist-db

exist-stanford-nlp

Build Status

Introduction

This application is a wrapper around the Stanford CoreNLP pipeline for eXist-db

Why

Loren was between projects and at an eXist-db weekly conference call it came to light that the previous implementations of Stanford NLP and Named Entity Recognition were not compatible with version 5.x of eXist-db. Loren took this project on while looking for the next project, so please see the contributions section at the end of this article.

Requirements

  • eXist-db: 5.0.0 with min 4Gb memory

For Building from Source

  • maven: 3.6.0
  • java: 8
  • (node: 12)
  • (polymer-cli: 1.9.11)

Building from Source

All dependencies including node.js and polymer dependencies are managed by maven. Simply, run mvn clean package to generate a .xar file inside the target/ directory. Then follow the installation instructions below.

When developing web-components you can navigate to the src/main/polymer directory and execute polymer-cli commands.

For more information see the polymer readme

Testing

To run unit tests(java, xquery, web-component) locally use: mvn test.

Support for integration tests, namely, Web Component Tester is TBD.

Installing the Application

  1. Open the eXist-db Dashboard

  2. Login as the administrator

  3. Select Stanford Natural Language Processing

    GUI install

Loading Languages

The application is installed without language files OOTB. The files need to be loaded after installation. Click on the Setup tab and then click on the language(s) that you want to load.

When a language is loaded, then there is a checkmark in the button.

Properties

The properties files within the JAR file are transformed to JSON documents where the entries pointing to the data files that have been loaded into the database are transformed to the URL to that resource.

Defaults

The pipeline uses default properties that assume that the english jar file is loaded in the classpath. Since the english jar is loaded into the database it is important to have a defaults JSON document that points to the english files in the database.

The defaults are loaded into /db/apps/stanford-nlp/data/StanfordCoreNLP-english.json

User Interface

Named Entity Recognition

This user interface allows the user to enter text in the textbox, select the language and then after it is submitted the resulting NER has a color coded view of the text that identities the named entities.

NLP

RESTful API

Natural Language Processing

Named Entity Recognition

XQuery Function Modules

Natural Language Processing

xquery version "3.1";

import module namespace nlp="http://exist-db.org/xquery/stanford-nlp";

let $properties := json-doc("/db/apps/stanford-nlp/data/StanfordCoreNLP-german.json")

let $text := "Juliana kommt aus Paris. Das ist die Hauptstadt von Frankreich. " ||
             "In diesem Sommer macht sie einen Sprachkurs in Freiburg. Das ist " ||
              "eine Universitätsstadt im Süden von Deutschland."

return nlp:parse($text, $properties)

The properties JSON document for German is:

{
    "ner.applyNumericClassifiers": "false",
    "depparse.language": "german",
    "ner.useSUTime": "false",
    "ner.applyFineGrained": "false",
    "tokenize.language": "de",
    "parse.model": "http://localhost:8080/exist/apps/stanford-nlp/data/edu/stanford/nlp/models/lexparser/germanFactored.ser.gz",
    "pos.model": "http://localhost:8080/exist/apps/stanford-nlp/data/edu/stanford/nlp/models/pos-tagger/german/german-hgc.tagger",
    "ner.model": "http://localhost:8080/exist/apps/stanford-nlp/data/edu/stanford/nlp/models/ner/german.conll.germeval2014.hgc_175m_600.crf.ser.gz",
    "annotators": [
        "tokenize",
        "ssplit",
        "pos",
        "ner",
        "parse"
    ],
    "depparse.model": "http://localhost:8080/exist/apps/stanford-nlp/data/edu/stanford/nlp/models/parser/nndep/UD_German.gz"
}

This returns an XML document of the parsed text.

<StanfordNLP>
    <sentences>
        <sentence id="1">
            <tokens>
                <token id="1">
                    <word>Juliana</word>
                    <CharacterOffsetBegin>0</CharacterOffsetBegin>
                    <CharacterOffsetEnd>7</CharacterOffsetEnd>
                    <POS>NE</POS>
                    <NER>PERSON</NER>
                </token>
                <token id="2">
                    <word>kommt</word>
                    <CharacterOffsetBegin>8</CharacterOffsetBegin>
                    <CharacterOffsetEnd>13</CharacterOffsetEnd>
                    <POS>VVFIN</POS>
                    <NER>O</NER>
                </token>
                <token id="3">
                    <word>aus</word>
                    <CharacterOffsetBegin>14</CharacterOffsetBegin>
                    <CharacterOffsetEnd>17</CharacterOffsetEnd>
                    <POS>APPR</POS>
                    <NER>O</NER>
                </token>
                <token id="4">
                    <word>Paris</word>
                    <CharacterOffsetBegin>18</CharacterOffsetBegin>
                    <CharacterOffsetEnd>23</CharacterOffsetEnd>
                    <POS>NE</POS>
                    <NER>LOCATION</NER>
                </token>
                <token id="5">
                    <word>.</word>
                    <CharacterOffsetBegin>23</CharacterOffsetBegin>
                    <CharacterOffsetEnd>24</CharacterOffsetEnd>
                    <POS>$.</POS>
                    <NER>O</NER>
                </token>
            </tokens>
            <parse>(ROOT
  (S (NE Juliana) (VVFIN kommt)
    (PP (APPR aus) (NE Paris))
    ($. .)))

</parse>
        </sentence>
        <sentence id="2">
            <tokens>
                <token id="1">
                    <word>Das</word>
                    <CharacterOffsetBegin>25</CharacterOffsetBegin>
                    <CharacterOffsetEnd>28</CharacterOffsetEnd>
                    <POS>PDS</POS>
                    <NER>O</NER>
                </token>
                <token id="2">
                    <word>ist</word>
                    <CharacterOffsetBegin>29</CharacterOffsetBegin>
                    <CharacterOffsetEnd>32</CharacterOffsetEnd>
                    <POS>VAFIN</POS>
                    <NER>O</NER>
                </token>
                <token id="3">
                    <word>die</word>
                    <CharacterOffsetBegin>33</CharacterOffsetBegin>
                    <CharacterOffsetEnd>36</CharacterOffsetEnd>
                    <POS>ART</POS>
                    <NER>O</NER>
                </token>
                <token id="4">
                    <word>Hauptstadt</word>
                    <CharacterOffsetBegin>37</CharacterOffsetBegin>
                    <CharacterOffsetEnd>47</CharacterOffsetEnd>
                    <POS>NN</POS>
                    <NER>O</NER>
                </token>
                <token id="5">
                    <word>von</word>
                    <CharacterOffsetBegin>48</CharacterOffsetBegin>
                    <CharacterOffsetEnd>51</CharacterOffsetEnd>
                    <POS>APPR</POS>
                    <NER>O</NER>
                </token>
                <token id="6">
                    <word>Frankreich</word>
                    <CharacterOffsetBegin>52</CharacterOffsetBegin>
                    <CharacterOffsetEnd>62</CharacterOffsetEnd>
                    <POS>NE</POS>
                    <NER>LOCATION</NER>
                </token>
                <token id="7">
                    <word>.</word>
                    <CharacterOffsetBegin>62</CharacterOffsetBegin>
                    <CharacterOffsetEnd>63</CharacterOffsetEnd>
                    <POS>$.</POS>
                    <NER>O</NER>
                </token>
            </tokens>
            <parse>(ROOT
  (S (PDS Das) (VAFIN ist)
    (NP (ART die) (NN Hauptstadt)
      (PP (APPR von) (NE Frankreich)))
    ($. .)))

</parse>
        </sentence>
        <sentence id="3">
            <tokens>
                <token id="1">
                    <word>In</word>
                    <CharacterOffsetBegin>64</CharacterOffsetBegin>
                    <CharacterOffsetEnd>66</CharacterOffsetEnd>
                    <POS>APPR</POS>
                    <NER>O</NER>
                </token>
                <token id="2">
                    <word>diesem</word>
                    <CharacterOffsetBegin>67</CharacterOffsetBegin>
                    <CharacterOffsetEnd>73</CharacterOffsetEnd>
                    <POS>PDAT</POS>
                    <NER>O</NER>
                </token>
                <token id="3">
                    <word>Sommer</word>
                    <CharacterOffsetBegin>74</CharacterOffsetBegin>
                    <CharacterOffsetEnd>80</CharacterOffsetEnd>
                    <POS>NN</POS>
                    <NER>O</NER>
                </token>
                <token id="4">
                    <word>macht</word>
                    <CharacterOffsetBegin>81</CharacterOffsetBegin>
                    <CharacterOffsetEnd>86</CharacterOffsetEnd>
                    <POS>VVFIN</POS>
                    <NER>O</NER>
                </token>
                <token id="5">
                    <word>sie</word>
                    <CharacterOffsetBegin>87</CharacterOffsetBegin>
                    <CharacterOffsetEnd>90</CharacterOffsetEnd>
                    <POS>PPER</POS>
                    <NER>O</NER>
                </token>
                <token id="6">
                    <word>einen</word>
                    <CharacterOffsetBegin>91</CharacterOffsetBegin>
                    <CharacterOffsetEnd>96</CharacterOffsetEnd>
                    <POS>ART</POS>
                    <NER>O</NER>
                </token>
                <token id="7">
                    <word>Sprachkurs</word>
                    <CharacterOffsetBegin>97</CharacterOffsetBegin>
                    <CharacterOffsetEnd>107</CharacterOffsetEnd>
                    <POS>NN</POS>
                    <NER>O</NER>
                </token>
                <token id="8">
                    <word>in</word>
                    <CharacterOffsetBegin>108</CharacterOffsetBegin>
                    <CharacterOffsetEnd>110</CharacterOffsetEnd>
                    <POS>APPR</POS>
                    <NER>O</NER>
                </token>
                <token id="9">
                    <word>Freiburg</word>
                    <CharacterOffsetBegin>111</CharacterOffsetBegin>
                    <CharacterOffsetEnd>119</CharacterOffsetEnd>
                    <POS>NE</POS>
                    <NER>LOCATION</NER>
                </token>
                <token id="10">
                    <word>.</word>
                    <CharacterOffsetBegin>119</CharacterOffsetBegin>
                    <CharacterOffsetEnd>120</CharacterOffsetEnd>
                    <POS>$.</POS>
                    <NER>O</NER>
                </token>
            </tokens>
            <parse>(ROOT
  (S
    (PP (APPR In) (PDAT diesem) (NN Sommer))
    (VVFIN macht) (PPER sie)
    (NP (ART einen) (NN Sprachkurs)
      (PP (APPR in) (NE Freiburg)))
    ($. .)))

</parse>
        </sentence>
        <sentence id="4">
            <tokens>
                <token id="1">
                    <word>Das</word>
                    <CharacterOffsetBegin>121</CharacterOffsetBegin>
                    <CharacterOffsetEnd>124</CharacterOffsetEnd>
                    <POS>PDS</POS>
                    <NER>O</NER>
                </token>
                <token id="2">
                    <word>ist</word>
                    <CharacterOffsetBegin>125</CharacterOffsetBegin>
                    <CharacterOffsetEnd>128</CharacterOffsetEnd>
                    <POS>VAFIN</POS>
                    <NER>O</NER>
                </token>
                <token id="3">
                    <word>eine</word>
                    <CharacterOffsetBegin>129</CharacterOffsetBegin>
                    <CharacterOffsetEnd>133</CharacterOffsetEnd>
                    <POS>ART</POS>
                    <NER>O</NER>
                </token>
                <token id="4">
                    <word>Universitätsstadt</word>
                    <CharacterOffsetBegin>134</CharacterOffsetBegin>
                    <CharacterOffsetEnd>151</CharacterOffsetEnd>
                    <POS>NN</POS>
                    <NER>O</NER>
                </token>
                <token id="5">
                    <word>im</word>
                    <CharacterOffsetBegin>152</CharacterOffsetBegin>
                    <CharacterOffsetEnd>154</CharacterOffsetEnd>
                    <POS>APPRART</POS>
                    <NER>O</NER>
                </token>
                <token id="6">
                    <word>Süden</word>
                    <CharacterOffsetBegin>155</CharacterOffsetBegin>
                    <CharacterOffsetEnd>160</CharacterOffsetEnd>
                    <POS>NN</POS>
                    <NER>O</NER>
                </token>
                <token id="7">
                    <word>von</word>
                    <CharacterOffsetBegin>161</CharacterOffsetBegin>
                    <CharacterOffsetEnd>164</CharacterOffsetEnd>
                    <POS>APPR</POS>
                    <NER>O</NER>
                </token>
                <token id="8">
                    <word>Deutschland</word>
                    <CharacterOffsetBegin>165</CharacterOffsetBegin>
                    <CharacterOffsetEnd>176</CharacterOffsetEnd>
                    <POS>NE</POS>
                    <NER>LOCATION</NER>
                </token>
                <token id="9">
                    <word>.</word>
                    <CharacterOffsetBegin>176</CharacterOffsetBegin>
                    <CharacterOffsetEnd>177</CharacterOffsetEnd>
                    <POS>$.</POS>
                    <NER>O</NER>
                </token>
            </tokens>
            <parse>(ROOT
  (S (PDS Das) (VAFIN ist)
    (NP (ART eine) (NN Universitätsstadt)
      (PP (APPRART im) (NN Süden)
        (PP (APPR von) (NE Deutschland))))
    ($. .)))

</parse>
        </sentence>
    </sentences>
</StanfordNLP>

Named Entity Recognition

There is an XQuery library module that takes the output of the NLP pipeline and surrounds the named entities with the appropriate tags.

xquery version "3.1";

import module namespace ner = "http://exist-db.org/xquery/stanford-nlp/ner";

let $text := "Juliana kommt aus Paris. Das ist die Hauptstadt von Frankreich. " ||
             "In diesem Sommer macht sie einen Sprachkurs in Freiburg. Das ist " ||
              "eine Universitätsstadt im Süden von Deutschland."
   
return ner:query-text-as-xml($text, "de")

With the results:

<ner>
    <PERSON>Juliana</PERSON> kommt aus <LOCATION>Paris</LOCATION>.
Das ist die Hauptstadt von <LOCATION>Frankreich</LOCATION>.
In diesem Sommer macht sie einen Sprachkurs in <LOCATION>Freiburg</LOCATION>.
Das ist eine Universitätsstadt im Süden von <LOCATION>Deutschland</LOCATION>.</ner>

Future Developments

Any requests for features should be submitted to https://github.com/lcahlander/exist-stanford-nlp/issues

About the Author

Loren is an independent contractor, so his contributions to the Open Source community are on his own time. If you appreciate his contributions to the NoSQL and the Natural Language Processing communities, then please either contract him for a project or submit a contribution to his company PayPal at [email protected].

exist-stanford-nlp's People

Contributors

adamretter avatar dependabot[bot] avatar duncdrum avatar lcahlander avatar marmoure avatar open-collective-bot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

exist-stanford-nlp's Issues

Automate build and configure CI

This automates more building steps and dependency installation via maven.
Also add CI for automated build.
Unit test scaffold for java and xquery is already there, (sans actual test).
As for integration-testing there are some open questions about the use of WCT (which will not test contents as deployed from inside exist) vs cypress (which will).

Exist-db has a saucelab account so either one could be made to run on CI.

nlp:parse return type mismatch: element() vs. document-node()

Using stanford-nlp-0.5.1, I noticed that the function documentation states that nlp:parse() returns an element(), but in my testing, it returns document-node(). Indeed, I received an error to this effect when passing text to nlp:parse() using the arrow operator. Here's the code:

xquery version "3.1";

import module namespace nlp="http://exist-db.org/xquery/stanford-nlp";

"Hello World!" => nlp:parse(
    map {
        "annotators" : "tokenize, ssplit",
        "tokenize.language" : "en"
    }
)

Error:

err:XPTY0004 document-node()(Hello05World611!1112) is not a sub-type of element() [source: xquery version "3.1"; import module namespace nlp="http://exist-db.org/xquery/stanford-nlp"; "Hello World!" => nlp:parse( map { "annotators" : "tokenize, ssplit", "tokenize.language" : "en" } )] In function: nlp:parse(xs:string, map(*)?) [-1:-1:String]

Looking at this another way, the following code returns true() instead of the expected false():

xquery version "3.1";

import module namespace nlp="http://exist-db.org/xquery/stanford-nlp";

nlp:parse(
    "Hello World!",
    map {
        "annotators" : "tokenize, ssplit",
        "tokenize.language" : "en"
    }
) instance of element()

Screenshot of the function documentation:

Screen Shot 2020-01-31 at 10 38 03 PM

Three other quick observations about the function documentation, if I may:

  1. The $properties cardinality is listed as optional, i.e., map(*)? and $properties?. I think it should probably be exactly one, e.g., map(*) and $properties, because judging by exist.log, calling nlp:parse("Hello World!", ())—with an empty sequence for the 2nd parameter—seems to invoke a default pipeline, which includes the most memory-intensive corefs step and triggers a "GC overhead limit exceeded" on my system with default 2 GB memory allocated to eXist. Better, I think, to err on the side of returning nothing than risk triggering a memory overflow. How about requiring exactly one map that contains at least the "annotators" entry? Then, if no annotator is supplied, the function would return an empty sequence. Just an idea.

  2. The description of the $properties parameter should be revised, from:

    The path to the serialized classifier to load. Should point to a binary resource stored within the database

    to:

    A map containing properties for the NLP pipeline. Typically, at least map { "annotators": "tokenize, ssplit" } should be provided. Properties can also be loaded from a JSON file via json-doc().

  3. I'd suggest changing $properties to $options. The XQuery 3.1 spec uses $options for all functions that take a map for this purpose. See https://www.w3.org/TR/xpath-functions-31/, with 32 instances of $options and 0 instances of $properties. A totally stylistic suggestion though.

ner:classify-node complains about node()

The function documentation lead me to expect the following to work.

Screenshot 2020-02-18 at 15 56 04

Test

let $test := <p>克林顿说,华盛顿将逐步落实对韩国的经济援助。金大中对克林顿的讲话报以掌声:克林顿总统在会谈中重申,他坚定地支持韩国摆脱经济危机。</p>

return
  ner:classify-node($test)

Expected

<p>
<PERSON>克林顿</PERSON>
说,
<STATE_OR_PROVINCE>华盛顿</STATE_OR_PROVINCE>
将逐步落实对
<COUNTRY>韩国</COUNTRY>
的经济援助。
<PERSON>金大中</PERSON>
对
<PERSON>克林顿</PERSON>
的讲话报以掌声:
<PERSON>克林顿</PERSON>
<TITLE>总统</TITLE>
在会谈中重申,他坚定地支持
<COUNTRY>韩国</COUNTRY>
摆脱经济危机。
</p>

Actual

err:XPTY0004 Type error: the node name should evaluate to a single item [at line 64, column 5, source: /exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
In function:
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [78:13:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:enrich(xs:string, node()*) [108:8:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:classify(xs:string, map(*)) [88:51:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:dispatch(node()?, map(*)) [23:12:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]
	ner:classify-node(node()) [23:7:/exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm]

System

5.3.0-SNAPSHOT
docker:latest

[BUG] io.lacuna.bifurcan.Maps$Entry cannot be cast to java.util.Map$Entry in eXist 5.3.0-SNAPSHOT

Describe the bug

The nlp:parse() function raises an error under eXist 5.3.0-SNAPSHOT (current develop HEAD):

io.lacuna.bifurcan.Maps$Entry cannot be cast to java.util.Map$Entry

The error is not present in eXist 5.2.0.

Expected behavior

The library should work under eXist 5.3.0-SNAPSHOT.

To Reproduce

As instructed in the README:

  1. Build and install the .xar
  2. Run /db/apps/stanford-nlp/modules/load-languages.xq
  3. Run the following query:
xquery version "3.1";

import module namespace nlp="http://exist-db.org/xquery/stanford-nlp";

let $text := 
    "This application is a wrapper around the Stanford CoreNLP pipeline for 
    eXist-db. The application is installed without language files OOTB. The 
    files need to be loaded after installation. The pipeline uses default 
    properties that assume that the english jar file is loaded in the classpath.
    Since the english jar is loaded into the database it is important to have a 
    defaults JSON document that points to the english files in the database."
let $properties := 
    map { 
        "annotators": "tokenize,ssplit"
    }
return
    nlp:parse($text, $properties)

Instead of returning an XML document of the parsed text, the query produces an error. From exist.log:


2021-04-19 17:21:52,287 [qtp698673041-69] ERROR (XQueryServlet.java [process]:550) - io.lacuna.bifurcan.Maps$Entry cannot be cast to java.util.Map$Entry 
java.lang.ClassCastException: io.lacuna.bifurcan.Maps$Entry cannot be cast to java.util.Map$Entry
	at org.exist.xquery.nlp.StanfordNLPFunction.eval(StanfordNLPFunction.java:86) ~[stanford-nlp-0.7.0-SNAPSHOT.jar:0.7.0-SNAPSHOT]
	at org.exist.xquery.BasicFunction.eval(BasicFunction.java:73) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
	at org.exist.xquery.InternalFunctionCall.eval(InternalFunctionCall.java:62) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
	at org.exist.xquery.DebuggableExpression.eval(DebuggableExpression.java:58) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
	at org.exist.xquery.LetExpr.eval(LetExpr.java:110) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
	at org.exist.xquery.LetExpr.eval(LetExpr.java:110) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
	at org.exist.xquery.AbstractExpression.eval(AbstractExpression.java:71) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
	at org.exist.xquery.PathExpr.eval(PathExpr.java:279) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
	at org.exist.xquery.AbstractExpression.eval(AbstractExpression.java:71) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
	at org.exist.xquery.XQuery.execute(XQuery.java:261) ~[exist-core-5.3.0-SNAPSHOT.jar:5.3.0-SNAPSHOT]
        ...

Unit Test

The following xqsuite test produces the error described above in eXist 5.3.0-SNAPSHOT.

xquery version "3.1";

module namespace t="http://exist-db.org/xquery/test";

import module namespace load-language = "http://exist-db.org/xquery/stanford-nlp/load-language" at "/db/apps/stanford-nlp/modules/load-language.xqm";

import module namespace config = "http://exist-db.org/apps/stanford-nlp/config";
import module namespace nlp="http://exist-db.org/xquery/stanford-nlp";

declare namespace test="http://exist-db.org/xquery/xqsuite";

(: uncomment if you haven't already loaded the languages :)
(: 
declare
    %test:setUp
function t:setup() {
    load-language:process($config:corenlp-model-url || "english.jar")
};
:)

declare
    %test:assertEquals(5)
function t:test() {
    let $text := 
        "This application is a wrapper around the Stanford CoreNLP pipeline for 
        eXist-db. The application is installed without language files OOTB. The 
        files need to be loaded after installation. The pipeline uses default 
        properties that assume that the english jar file is loaded in the classpath.
        Since the english jar is loaded into the database it is important to have a 
        defaults JSON document that points to the english files in the database."
    let $properties := 
        map { 
            "annotators": "tokenize,ssplit"
        }
    return
        nlp:parse($text, $properties)//sentence => count()
};

The test suite returns:

<testsuite package="http://exist-db.org/xquery/test" timestamp="2021-04-19T17:29:28.154-04:00"
    tests="1" failures="0" errors="1" pending="0" time="PT0.006S">
    <testcase name="test" class="t:test">
        <error type="java:java.lang.ClassCastException"
            message="io.lacuna.bifurcan.Maps$Entry cannot be cast to java.util.Map$Entry"/>
    </testcase>
</testsuite>

In eXist 5.2.0, the test passes:

<testsuite package="http://exist-db.org/xquery/test" timestamp="2021-04-19T17:33:32.065-04:00"
    tests="1" failures="0" errors="0" pending="0" time="PT0.146S">
    <testcase name="test" class="t:test"/>
</testsuite>

When modifying the pom.xml file's exist.version to read 5.3.0-SNAPSHOT:

https://github.com/eXist-db/exist-stanford-nlp/blob/master/pom.xml#L65

... mvn clean package fails during compilation:

[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /Users/joe/workspace/exist-stanford-nlp/src/main/java/org/exist/xquery/nlp/StanfordNLPFunction.java:[84,83] incompatible types: java.util.Iterator<io.lacuna.bifurcan.IEntry<org.exist.xquery.value.AtomicValue,org.exist.xquery.value.Sequence>> cannot be converted to java.util.Iterator<java.util.Map.Entry<org.exist.xquery.value.AtomicValue,org.exist.xquery.value.Sequence>>
[INFO] 1 error
[INFO] -------------------------------------------------------------

Context (please always complete the following information):

  • OS: macOS 11.2.3
  • eXist-db Version: eXist 5.3.0-SNAPSHOT 70610262be66f7f7ebda574d6d1206dfaf48444f 20210419134637 (develop HEAD)
  • Java Version: OpenJDK 1.8.0_282 (liberica-jdk-8-full)
  • App Version: 0.7.0-SNAPSHOT (master HEAD)

Additional context

  • How is eXist-db installed? built from source
  • Any custom changes in e.g. conf.xml? none

build troubles

for one we are bitten by the https bug in the archetype

[ERROR] Failed to execute goal on project stanford-nlp: Could not resolve dependencies for project org.exist-db:stanford-nlp:jar:0.6.0-SNAPSHOT: Failed to collect dependencies at org.exist-db:exist-core:jar:5.0.0 -> org.exist-db.thirdparty.com.thaiopensource:jing:jar:20151127

test/resources/conf.xml contains an invalid section, i don't think the file is necessary at all

<builtin-modules>
  <module uri="https://my-organisation.com/exist-db/ns/app/my-java-module" class="org.exist.xquery.ner.ExampleModule"/>
</builtin-modules>

my ide has a number of code-smell warning regarding unused imports and the like, some deprecation warnings

because of eXist-db/exist#3725 the app won't run on 5.3.0-SNAPSHOT

core-nlp has had a major version jump which we could try to follow

testing isn't working

ci produces false positives

Support comma-delimited properties as arrays

Currently, properties are supplied as a single string with inline comma delimiters:

map { 
  "annotators" : "tokenize, ssplit, pos, lemma, ner, parse, coref" 
}

For modularity, it would be preferable to supply these as arrays:

map { 
  "annotators" : ["tokenize", "ssplit", "pos", "lemma", "ner", "parse", "coref"]
}

Sequences would work too, but this would prevent you from saving option sets as JSON files in the database, since sequences are only available in XDM.

Many annotators invoked when only 2 were specified

Using v0.5.2, I would expect the following code to invoke only the two specified annotators, tokenize and ssplit:

xquery version "3.1";

import module namespace nlp="http://exist-db.org/xquery/stanford-nlp";

nlp:parse(
    "Hello World!",
    map {
        "annotators" : "tokenize, ssplit",
        "tokenize.language" : "en"
    }
)

But judging by the logs, it also invokes pos, lemma, ner, depparse, and coref:

2020-02-04 05:47:08,947 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Searching for resource: StanfordCoreNLP.properties ... found. 
2020-02-04 05:47:08,962 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Adding annotator tokenize 
2020-02-04 05:47:08,977 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Adding annotator ssplit 
2020-02-04 05:47:08,982 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Adding annotator pos 
2020-02-04 05:47:09,910 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.9 sec]. 
2020-02-04 05:47:09,910 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Adding annotator lemma 
2020-02-04 05:47:09,912 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Adding annotator ner 
2020-02-04 05:47:09,988 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - encoding=utf-8 
2020-02-04 05:47:13,506 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [3.4 sec]. 
2020-02-04 05:47:14,189 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.7 sec]. 
2020-02-04 05:47:15,543 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [1.4 sec]. 
2020-02-04 05:47:15,550 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1. 
2020-02-04 05:47:15,797 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt 
2020-02-04 05:47:23,126 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - ner.fine.regexner: Read 580704 unique entries out of 581863 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab, 0 TokensRegex patterns. 
2020-02-04 05:47:23,141 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - ner.fine.regexner: Read 4869 unique entries out of 4869 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_cased.tab, 0 TokensRegex patterns. 
2020-02-04 05:47:23,142 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - ner.fine.regexner: Read 585573 unique entries from 2 files 
2020-02-04 05:47:49,838 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Adding annotator depparse 
2020-02-04 05:47:50,110 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Loading depparse model: edu/stanford/nlp/models/parser/nndep/english_UD.gz ...  
2020-02-04 05:48:14,095 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - PreComputed 99996, Elapsed Time: 17.01 (s) 
2020-02-04 05:48:14,096 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Initializing dependency parser ... done [24.0 sec]. 
2020-02-04 05:48:14,101 [qtp709874091-44] INFO  (SLF4JHandler.java [print]:88) - Adding annotator coref 

(Even with 4 GB allocated to eXist, 30 minutes has passed since this query was submitted, and it hasn't returned its result, the iMac's 8-core CPU is pegged, and eXist is unresponsive.)

[question] ner-module location

see #4
the ner-module.xqm is located at /exist/etc/../data/expathrepo/stanford-nlp-0.5.8/content/ner-module.xqm shouldn't that be inside /db/system/repo instead?

[feature] corenlp output visualiser

coenlp provides an xsl to visualize the output of the annotators like this

Including it in the package to create a view, should be relatively simple and would be a nice feature addition

[BUG] Build failure

Kudos, @lcahlander and @duncdrum!

For me, a fresh checkout and clone of e19f3bf fails when running mvn clean package. Does anyone have any suggestions?

Here's my console output. I'm using macOS 11.2.3, OpenJDK 1.8.0_282 (Liberica), and Maven 3.8.1.

% mvn clean package
[INFO] Scanning for projects...
[INFO] 
[INFO] ---------------------< org.exist-db:stanford-nlp >----------------------
[INFO] Building Stanford Natural Language Processing 0.7.0-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ stanford-nlp ---
[INFO] 
[INFO] --- maven-enforcer-plugin:1.2:enforce (enforce-maven) @ stanford-nlp ---
[INFO] 
[INFO] --- buildversion-plugin:1.0.3:set-properties (default) @ stanford-nlp ---
[INFO] 
[INFO] --- frontend-maven-plugin:1.9.1:install-node-and-npm (install node and npm) @ stanford-nlp ---
[INFO] Node v12.22.1 is already installed.
[INFO] NPM 7.9.0 is already installed.
[INFO] 
[INFO] --- frontend-maven-plugin:1.9.1:npm (npm version bump) @ stanford-nlp ---
[INFO] Running 'npm version --no-git-tag-version --allow-same-version=true 0.7.0-SNAPSHOT' in /Users/joe/workspace/exist-stanford-nlp/src/main/polymer
[INFO] v0.7.0-SNAPSHOT
[INFO] 
[INFO] --- frontend-maven-plugin:1.9.1:npm (npm install) @ stanford-nlp ---
[INFO] Running 'npm i' in /Users/joe/workspace/exist-stanford-nlp/src/main/polymer
[INFO] 
[INFO] up to date, audited 1528 packages in 4s
[INFO] 
[INFO] 30 packages are looking for funding
[INFO]   run `npm fund` for details
[INFO] 
[INFO] 29 vulnerabilities (17 low, 10 high, 2 critical)
[INFO] 
[INFO] To address issues that do not require attention, run:
[INFO]   npm audit fix
[INFO] 
[INFO] Some issues need review, and may require choosing
[INFO] a different dependency.
[INFO] 
[INFO] Run `npm audit` for details.
[INFO] 
[INFO] --- frontend-maven-plugin:1.9.1:npm (polymer build) @ stanford-nlp ---
[INFO] Running 'npm run build' in /Users/joe/workspace/exist-stanford-nlp/src/main/polymer
[INFO] 
[INFO] > [email protected] build
[INFO] > polymer build
[INFO] 
[INFO] info:	Clearing build/ directory...
[INFO] info:	(esm-bundled) Building...
[INFO] info:	(es6-bundled) Building...
[INFO] info:	(es5-bundled) Building...
[INFO] error:	Uncaught exception: Error: not implemented
[INFO] error:	Error: not implemented
[INFO]     at Writable._write (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_writable.js:407:6)
[INFO]     at doWrite (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_writable.js:279:12)
[INFO]     at writeOrBuffer (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_writable.js:266:5)
[INFO]     at Writable.write (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_writable.js:211:11)
[INFO]     at DestroyableTransform.ondata (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_readable.js:572:20)
[INFO]     at DestroyableTransform.emit (events.js:314:20)
[INFO]     at readableAddChunk (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_readable.js:195:16)
[INFO]     at DestroyableTransform.Readable.push (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_readable.js:162:10)
[INFO]     at DestroyableTransform.Transform.push (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_transform.js:145:32)
[INFO]     at afterTransform (/Users/joe/workspace/exist-stanford-nlp/src/main/polymer/node_modules/readable-stream/lib/_stream_transform.js:101:12)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  26.895 s
[INFO] Finished at: 2021-04-12T15:12:51-04:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.9.1:npm (polymer build) on project stanford-nlp: Failed to run task: 'npm run build' failed. org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.