rug-compling / alpino Goto Github PK
View Code? Open in Web Editor NEWAlpino parser and related tools for Dutch
License: GNU Lesser General Public License v2.1
Alpino parser and related tools for Dutch
License: GNU Lesser General Public License v2.1
Op regel 40 ontbreekt een komma.
Line 40 in b976582
In CLARIAH we are automatically collecting software metadata for all tools in CLARIAH-PLUS (and CLARIAH-CORE). This metadata is automatically and periodically harvested directly from the source code repositories, and this Alpino git repo is one of those sources. The advantage of this approach is that metadata is as close to the source as possible, reflects the actual software version, and developers retain full authorship and control without needing any middlemen.
The results are published daily on https://tools.clariah.nl/ and this will in turn be queried by other platforms (Ineo, CLARIN VLO) to disseminate the tools in portals for end-users.
The harvesting is set-up in such a way that various existing metadata formats are supported and automatically converted. The whole idea is to burden developers as little as possible, use standards they already use, and prevent any unnecessary duplication of metadata fields. But Alpino is not currently using any scheme, so our harvester doesn't have much to fallback to, and as a consequence the metadata quality is rather poor.
In January a call went out to request all CLARIAH developers to take a look at this metadata and to improve upon it where needed (see CLARIAH/clariah-plus#143). Alpino has a long history in CLARIAH and CLARIN and is much used, so we'd like to have good metadata for it. Could you take a look at improving the metadata? Alpino's results are currently like this.
It would also help a lot if you could use github's release mechanism (i.e. git tags) to tag releases of Alpino (with a semantic version).
Please see the contributing guidelines and the CLARIAH Software Metadata Requirements for in-depth instructions on what metadata to provide and how this can be accomplished.
Is it possible to have Alpino output the parse tree in the following format:
In: "Several theories about the higher prevalence in males have been investigated, but the cause of the difference is unconfirmed; one theory is that females are underdiagnosed."
Out: (S (S (S (NP (NP (JJ Several) (NNS theories)) (PP (IN about) (NP (NP (DT the) (JJR higher) (NN prevalence)) (PP (IN in) (NP (NNS males)))))) (VP (VBP have) (VP (VBN been) (VP (VBN investigated))))) (, ,) (CC but) (S (NP (NP (DT the) (NN cause)) (PP (IN of) (NP (DT the) (NN difference)))) (VP (VBZ is) (ADJP (JJ unconfirmed))))) (: ;) (S (NP (CD one) (NN theory)) (VP (VBZ is) (SBAR (IN that) (S (NP (NNS females)) (VP (VBP are) (ADJP (JJ underdiagnosed))))))) (. .))
This output is currently achieved through the use of AllenNLP and a minimal span-based neural constituency parser. However, as I'm also working with Dutch data I intend to use the Alpino parser. If the above output isn't conceivable I suspect I have to go over the XML output and work something out myself.
When I modify the Makefile.start_server script
Line 10 in 7a2ea6e
and change assume_input_is_tokenized=off to assume_input_is_tokenized=on the output becomes malformed.
For example:
$ make -f Makefile.start_server
PROLOGMAXSIZE=1500M /opt/Alpino-git233/bin/Alpino -notk -veryfast user_max=20000\
server_kind=parse\
server_port=42424\
assume_input_is_tokenized=on\
debug=1\
-init_dict_p\
batch_command=alpino_server\
2> /alpino_server.log &
$ telnet localhost 42424
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hallo wereld .
top/top|top/hd|hallo/[0,1]|127.0.0.1
hallo/[0,1]|tag/nucl|wereld/[1,2]|127.0.0.1
/[2,3]|127.0.0.1app|.
<?xml version="1.0" encoding="UTF-8"?>
<alpino_ds version="1.6">
<parser build="Alpino-x86_64-linux-glibc2.5-git233-sicstus" date="2021-02-04T16:52" cats="1" skips="0" />
<node begin="0" cat="top" end="3" id="0" rel="top">
<node begin="0" cat="du" end="3" id="1" rel="--">
<node begin="0" end="1" frame="tag" his="normal" his_1="normal" id="2" lcat="advp" lemma="hallo" pos="tag" postag="TSW()" pt="tsw" rel="tag" root="hallo" sense="hallo" word="hallo"/>
<node begin="1" cat="np" end="3" id="3" rel="nucl">
<node begin="1" end="2" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" his="normal" his_1="normal" id="4" lcat="np" lemma="wereld" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" rnum="sg" root="wereld" sense="wereld" word="wereld"/>
"/>pecial="hoofd" word=".m" positie="vrij" postag="TW(hoofd,vrij)" pt="tw" rel="app" root=".ssion" id="5" infl="both" lcat="detp" lemma=".
</node>
</node>
</node>
</sentence> sentid="127.0.0.1">hallo wereld .
</alpino_ds>
Connection closed by foreign host.
Keeping assume_input_is_tokenized
to off
does give a correctly formatted sentence item: <sentence sentid="127.0.0.1">hallo wereld .</sentence>
.
I have to implement a work-around here anyway to support older Alpino-versions, so this isn't an issue for me. But I was wondering if there might be some setting I'm missing here to prevent this from happening? I couldn't figure out where in the Alpino-code this goes wrong.
The following input:
$ cd /tmp; mkdir -p example; echo '" Kijk " , zei Japi , " de " " Stad Gent . " " Je zag in de verte het water aan weerszijden van de boeg hoog opvliegen ; om de schroef zag je het woelen en bruisen en schuimen .' | $ALPINO_HOME/bin/Alpino number_analyses=1 end_hook=xml -parse -flag treebank example
Gives:
$ grep mwu example/1.xml
<node begin="14" cat="mwu(" "," ")" end="16" his="normal" his_1="longpunct" id="7" rel="--">
<node begin="9" cat="mwu(" " Stad Gent," " Stad Gent)" end="13" his="name" his_1="not_begin" id="19" rel="hd">
$ xmlwf example/1.xml
example/1.xml:11:32: not well-formed (invalid token)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.