Giter Site home page Giter Site logo

rug-compling / alpino Goto Github PK

View Code? Open in Web Editor NEW
22.0 22.0 2.0 1.87 GB

Alpino parser and related tools for Dutch

License: GNU Lesser General Public License v2.1

HTML 0.49% Shell 0.02% Makefile 0.24% C++ 0.82% Prolog 90.59% Perl 3.45% C 0.30% Go 0.08% Python 0.36% TeX 0.29% Tcl 0.26% PostScript 0.16% Roff 0.03% NSIS 0.01% XSLT 0.32% PHP 0.01% Java 0.03% XQuery 0.07% Batchfile 0.01% Raku 2.47%

alpino's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

alpino's Issues

Request to improve software metadata for CLARIAH

In CLARIAH we are automatically collecting software metadata for all tools in CLARIAH-PLUS (and CLARIAH-CORE). This metadata is automatically and periodically harvested directly from the source code repositories, and this Alpino git repo is one of those sources. The advantage of this approach is that metadata is as close to the source as possible, reflects the actual software version, and developers retain full authorship and control without needing any middlemen.

The results are published daily on https://tools.clariah.nl/ and this will in turn be queried by other platforms (Ineo, CLARIN VLO) to disseminate the tools in portals for end-users.

The harvesting is set-up in such a way that various existing metadata formats are supported and automatically converted. The whole idea is to burden developers as little as possible, use standards they already use, and prevent any unnecessary duplication of metadata fields. But Alpino is not currently using any scheme, so our harvester doesn't have much to fallback to, and as a consequence the metadata quality is rather poor.

In January a call went out to request all CLARIAH developers to take a look at this metadata and to improve upon it where needed (see CLARIAH/clariah-plus#143). Alpino has a long history in CLARIAH and CLARIN and is much used, so we'd like to have good metadata for it. Could you take a look at improving the metadata? Alpino's results are currently like this.

It would also help a lot if you could use github's release mechanism (i.e. git tags) to tag releases of Alpino (with a semantic version).

Please see the contributing guidelines and the CLARIAH Software Metadata Requirements for in-depth instructions on what metadata to provide and how this can be accomplished.

Possible to get parse tree output similar to AllenNLP?

Is it possible to have Alpino output the parse tree in the following format:

In: "Several theories about the higher prevalence in males have been investigated, but the cause of the difference is unconfirmed; one theory is that females are underdiagnosed."

Out: (S (S (S (NP (NP (JJ Several) (NNS theories)) (PP (IN about) (NP (NP (DT the) (JJR higher) (NN prevalence)) (PP (IN in) (NP (NNS males)))))) (VP (VBP have) (VP (VBN been) (VP (VBN investigated))))) (, ,) (CC but) (S (NP (NP (DT the) (NN cause)) (PP (IN of) (NP (DT the) (NN difference)))) (VP (VBZ is) (ADJP (JJ unconfirmed))))) (: ;) (S (NP (CD one) (NN theory)) (VP (VBZ is) (SBAR (IN that) (S (NP (NNS females)) (VP (VBP are) (ADJP (JJ underdiagnosed))))))) (. .))

This output is currently achieved through the use of AllenNLP and a minimal span-based neural constituency parser. However, as I'm also working with Dutch data I intend to use the Alpino parser. If the above output isn't conceivable I suspect I have to go over the XML output and work something out myself.

</sentence> sentid="" output when assume_input_is_tokenized=on

When I modify the Makefile.start_server script

assume_input_is_tokenized=off\

and change assume_input_is_tokenized=off to assume_input_is_tokenized=on the output becomes malformed.

For example:

$ make -f Makefile.start_server 
PROLOGMAXSIZE=1500M /opt/Alpino-git233/bin/Alpino -notk -veryfast user_max=20000\
            server_kind=parse\
            server_port=42424\
            assume_input_is_tokenized=on\
            debug=1\
            -init_dict_p\
            batch_command=alpino_server\
    	2> /alpino_server.log &

$ telnet localhost 42424
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hallo wereld .
top/top|top/hd|hallo/[0,1]|127.0.0.1
hallo/[0,1]|tag/nucl|wereld/[1,2]|127.0.0.1
/[2,3]|127.0.0.1app|.
<?xml version="1.0" encoding="UTF-8"?>
<alpino_ds version="1.6">
  <parser build="Alpino-x86_64-linux-glibc2.5-git233-sicstus" date="2021-02-04T16:52" cats="1" skips="0" />
  <node begin="0" cat="top" end="3" id="0" rel="top">
    <node begin="0" cat="du" end="3" id="1" rel="--">
      <node begin="0" end="1" frame="tag" his="normal" his_1="normal" id="2" lcat="advp" lemma="hallo" pos="tag" postag="TSW()" pt="tsw" rel="tag" root="hallo" sense="hallo" word="hallo"/>
      <node begin="1" cat="np" end="3" id="3" rel="nucl">
        <node begin="1" end="2" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" his="normal" his_1="normal" id="4" lcat="np" lemma="wereld" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" rnum="sg" root="wereld" sense="wereld" word="wereld"/>
"/>pecial="hoofd" word=".m" positie="vrij" postag="TW(hoofd,vrij)" pt="tw" rel="app" root=".ssion" id="5" infl="both" lcat="detp" lemma=".
      </node>
    </node>
  </node>
</sentence> sentid="127.0.0.1">hallo wereld .
</alpino_ds>
Connection closed by foreign host.

Keeping assume_input_is_tokenized to off does give a correctly formatted sentence item: <sentence sentid="127.0.0.1">hallo wereld .</sentence>.

I have to implement a work-around here anyway to support older Alpino-versions, so this isn't an issue for me. But I was wondering if there might be some setting I'm missing here to prevent this from happening? I couldn't figure out where in the Alpino-code this goes wrong.

invalid xml due to double double quotes

The following input:

$ cd /tmp; mkdir -p example; echo '" Kijk " , zei Japi , " de " " Stad Gent . " " Je zag in de verte het water aan weerszijden van de boeg hoog opvliegen ; om de schroef zag je het woelen en bruisen en schuimen .' | $ALPINO_HOME/bin/Alpino number_analyses=1 end_hook=xml -parse -flag treebank example

Gives:

$ grep mwu example/1.xml
    <node begin="14" cat="mwu(" "," ")" end="16" his="normal" his_1="longpunct" id="7" rel="--">
          <node begin="9" cat="mwu(" " Stad Gent," " Stad Gent)" end="13" his="name" his_1="not_begin" id="19" rel="hd">
$ xmlwf example/1.xml
example/1.xml:11:32: not well-formed (invalid token)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.