Giter Site home page Giter Site logo

languagemachines / frog Goto Github PK

View Code? Open in Web Editor NEW
73.0 16.0 11.0 71.7 MB

Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.

Home Page: https://languagemachines.github.io/frog

License: GNU General Public License v3.0

Shell 1.00% C++ 95.57% Makefile 0.30% M4 2.82% Dockerfile 0.31%
dutch nlp natural-language-processing lemmatiser pos-tagger dependency-parser named-entity-recognition computational-linguistics folia morphological-analyser

frog's People

Contributors

helmutg avatar irishx avatar kosloot avatar proycon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

frog's Issues

frog server parses both headers and data when called with curl/wget

System: Linux Mint 18.1 (kernel: 4.4.0-53-generic)
curl version: 7.47.0
frog version: 0.13.8

Command entered:

curl -H "User-Agent:" -H "Accept:" -H "Content-Length:" -H "Content-Type:" -H "Host:" -d "Dit is een test.
EOT
" localhost:8001

Trace:

== Info: Rebuilt URL to: localhost:8001/
== Info: Trying 127.0.0.1...
== Info: Connected to localhost (127.0.0.1) port 8001 (#0)
=> Send header, 19 bytes (0x13)
0000: 50 4f 53 54 20 2f 20 48 54 54 50 2f 31 2e 31 0d POST / HTTP/1.1.
0010: 0a 0d 0a ...
=> Send data, 21 bytes (0x15)
0000: 44 69 74 20 69 73 20 65 65 6e 20 74 65 73 74 2e Dit is een test.
0010: 0a 45 4f 54 0a .EOT.
== Info: upload completely sent off: 21 out of 21 bytes
<= Recv data, 62 bytes (0x3e)
0000: 31 09 50 4f 53 54 09 50 4f 53 54 09 5b 50 4f 53 1.POST.POST.[POS
0010: 54 5d 09 53 50 45 43 28 64 65 65 6c 65 69 67 65 T].SPEC(deeleige
0020: 6e 29 09 31 2e 30 30 30 30 30 30 09 42 2d 50 45 n).1.000000.B-PE
0030: 52 09 42 2d 4e 50 09 30 09 52 4f 4f 54 0a R.B-NP.0.ROOT.
<= Recv data, 482 bytes (0x1e2)
0000: 32 09 2f 09 2f 09 5b 2f 5d 09 4c 45 54 28 29 09 2././.[/].LET().
0010: 31 2e 30 30 30 30 30 30 09 4f 09 49 2d 4e 50 09 1.000000.O.I-NP.
0020: 31 09 70 75 6e 63 74 0a 33 09 48 54 54 50 09 48 1.punct.3.HTTP.H
0030: 54 54 50 09 5b 48 54 54 50 5d 09 53 50 45 43 28 TTP.[HTTP].SPEC(
0040: 64 65 65 6c 65 69 67 65 6e 29 09 31 2e 30 30 30 deeleigen).1.000
0050: 30 30 30 09 42 2d 4f 52 47 09 49 2d 4e 50 09 31 000.B-ORG.I-NP.1
0060: 09 6d 6f 64 0a 34 09 2f 09 2f 09 5b 2f 5d 09 4c .mod.4././.[/].L
0070: 45 54 28 29 09 31 2e 30 30 30 30 30 30 09 4f 09 ET().1.000000.O.
0080: 4f 09 33 09 70 75 6e 63 74 0a 35 09 31 2e 31 09 O.3.punct.5.1.1.
0090: 31 2e 31 09 5b 31 2e 31 5d 09 53 50 45 43 28 73 1.1.[1.1].SPEC(s
00a0: 79 6d 62 29 09 31 2e 30 30 30 30 30 30 09 4f 09 ymb).1.000000.O.
00b0: 42 2d 4e 50 09 33 09 6d 6f 64 0a 0a 31 09 44 69 B-NP.3.mod..1.Di
00c0: 74 09 64 69 74 09 5b 64 69 74 5d 09 56 4e 57 28 t.dit.[dit].VNW(
00d0: 61 61 6e 77 2c 70 72 6f 6e 2c 73 74 61 6e 2c 76 aanw,pron,stan,v
00e0: 6f 6c 2c 33 6f 2c 65 76 29 09 30 2e 37 37 37 30 ol,3o,ev).0.7770
00f0: 38 35 09 4f 09 42 2d 4e 50 09 32 09 73 75 0a 32 85.O.B-NP.2.su.2
0100: 09 69 73 09 7a 69 6a 6e 09 5b 7a 69 6a 6e 5d 09 .is.zijn.[zijn].
0110: 57 57 28 70 76 2c 74 67 77 2c 65 76 29 09 30 2e WW(pv,tgw,ev).0.
0120: 39 39 39 38 39 31 09 4f 09 42 2d 56 50 09 30 09 999891.O.B-VP.0.
0130: 52 4f 4f 54 0a 33 09 65 65 6e 09 65 65 6e 09 5b ROOT.3.een.een.[
0140: 65 65 6e 5d 09 4c 49 44 28 6f 6e 62 65 70 2c 73 een].LID(onbep,s
0150: 74 61 6e 2c 61 67 72 29 09 30 2e 39 39 39 31 31 tan,agr).0.99911
0160: 33 09 4f 09 42 2d 4e 50 09 34 09 64 65 74 0a 34 3.O.B-NP.4.det.4
0170: 09 74 65 73 74 09 74 65 73 74 09 5b 74 65 73 74 .test.test.[test
0180: 5d 09 4e 28 73 6f 6f 72 74 2c 65 76 2c 62 61 73 ].N(soort,ev,bas
0190: 69 73 2c 7a 69 6a 64 2c 73 74 61 6e 29 09 30 2e is,zijd,stan).0.
01a0: 39 30 33 30 35 35 09 4f 09 49 2d 4e 50 09 32 09 903055.O.I-NP.2.
01b0: 70 72 65 64 63 0a 35 09 2e 09 2e 09 5b 2e 5d 09 predc.5.....[.].
01c0: 4c 45 54 28 29 09 31 2e 30 30 30 30 30 30 09 4f LET().1.000000.O
01d0: 09 4f 09 34 09 70 75 6e 63 74 0a 0a 52 45 41 44 .O.4.punct..READ
01e0: 59 0a Y.

Frog Freeze

frog (0.13.6) freezes on input that contains large sequences of exclamation marks. To reproduce, create a file input.txt with the following content:
Nee!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
run:
frog -t input.text

NER post processing stap toevoegen

Naast de gazeteers, is er behoefte aan een lijst van NE's die in een Post=processing stap alsnog keihard over de standaard NE's heen gezet worden.

  • Dit moet optioneel zijn.
  • Toegekende NE's moeten netjes overschreven worden. Langere kunnen vervangen worden.
  • de lijst moet wel zorgvuldig opgesteld worden. Liefst alleen woorden die NIET al als een gewone NE getagd zijn.

Volgorde van regels leidt tot falende lezing

Bezienswaardig gaat goed:
V_*V 0 V 0 0 A_V*A 0 N 0 0 0 0 A_N*+De 0/P
==>
[ [ [ [be]V_*V [zie]V ]V [ns]A_V*A [ [waarde]N [ig]A_N* ]A ]A P ]A

Maar bezienswaarigheid:
V_*V 0 V 0 0 A_V*A 0 N 0 0 0 0 A_N*+De 0 N_A* 0 0 0/e

==>
[ [ [be]V_*V [zie]V ]V [ns]A_V*A [ [ [waarde]N [ig]A_N* ]A [heid]N_A* ]N e ]N
Dit ziet er mooi uit, maar de A_V*A regel van [ns] is NIET toepasbaar omdat de A waardig eerst is omgezet naar de N waardigheid

Dit is oplosbaar door de 'infix' regels eerder toe te passen dan de prefix regels.
Het gevolg voor MBMA is dan onduidelijk. Meestal lijkt dat een slecht idee.

Andere oplossing:
naast een A_V*A variant ook een A_V*N variant maken.
Dan kies je dus voor de lezing waarin je 'bezie' aan 'waardigheid' koppelt
en niet die waarin je 'heid' aan 'bezienswaardig' koppelt.

ook twijfelachtig.

"terminate called without an active exception"

Forwarding bug report mailed by Alex Bransen:

ik krijg frog maar niet aan de praat met folia als input, maar wil even
checken of dat door mijn installatie komt, of dat jij het ook hebt. Zou je
als je even tijd hebt het bijgevoegde folia bestand door frog willen
gooien, om te kijken of jij ook een error krijgt? Ik krijg:

"terminate called without an active exception" (lekker nuttige error ook)

cmd is: frog -x BAAC_A-11-0119.xml -X frogged-BAAC.xml

gek genoeg als ik de -x flag naar -t verander (en hij dus het xml als plain
text gaat behandelen) doet hij het wel..

Volgens foliavalidator is het wel valid XML trouwens.

Input file is https://download.anaproy.nl/BAAC_A-11-0119.xml , I can reproduce the bug locally on the latest development version.

Frog gets progressively slower when running for hours, days

When running frog for a long time, performance decreases significantly.

For instance, I'm processing a 2.8GB file. These are some pv outputs at several moments:

frog: 15.8MiB  0:05:10 [39.2KiB/s] [>                                 ]  0% ETA  4:20:44:08
frog:  545MiB  3:29:02 [62.2KiB/s] [>                                 ]  2% ETA  5:13:26:11     
frog:  858MiB  5:23:19 [30.6KiB/s] [=>                                ]  4% ETA  5:09:10:47
frog: 3.05GiB 73:28:54 [    0 B/s] [====>                             ] 14% ETA 17:22:57:01

At first, the expected time is 4 days and 20 hours. After having run for 3 days, the expected time has gone up to almost 18 days.

This is the script used:

#!/usr/bin/env bash
FILE=$1
FILE_SIZE=`wc -c < "$FILE" | cut -f1`
let "EXPECTED_FROG_SIZE = 8 * FILE_SIZE"
BODY="${FILE%.*}"

echo "Processing: $FILE"
echo "Size: $FILE_SIZE"
echo "Writing output to: ${BODY}_frog.txt"

pv -s ${FILE_SIZE} -cN in ${FILE} | frog --skip=acmnp 2> ${BODY}_frog.log | pv -s ${EXPECTED_FROG_SIZE} -cN frog > ${BODY}_frog.txt

Move Frog documentation sources to this git repo

They are still on overleaf; best keep everything together with the sources (so versions correspond to the source code).

Also: fix python-frog error in documentation, explicit configuration file path is outdated and needs to be removed

intern altijd deep morpheme doen

De code kan simpeler worden door ALTIJD deep morheme analyse te doen en voor de 'classic' case de platte structuur er uit te trekken.

question on accuracy

Hello, this is not an issue, just a question. Basically on accuracy of the different NLP tasks.
I'm interested in comparing different types of NLP annotators and their accuracy. How well does frog do regarding accuracy on tokenisation, parts of speech tagging, lemmatisation, morphological feature annotation, dependency parsing?
Are there numbers available which are comparable to the CONLL17 shared task (for example by training the frog on Dutch data from universal dependencies and next outputting the results (for example by using the evaluation script used by the CONLL17 shared task available at https://github.com/ufal/conll2017/blob/master/evaluation_script/conll17_ud_eval.py)
Are such numbers available?

Frog can't deal with tokens that contain spaces

In historical dutch, certain words may be written apart although they can be considered one token: "vol daen" (voldaan) and represented as a single <w> in FoLiA. Would the various Frog modules (mblem, mbpos etc) be able to deal with spaces in tokens?

impliciete regels in MBMA/CELEX

CELEX bevat nogal wat impliciete morfologische regels voorbeeld:

16501\buitenom\35\C\1\Y\Y\Y\buiten+om\PP\N\N((buiten)[P],(om)[P])[B]\N\N\N

Hier staat kort samengevat, 'buitenom' is een B opgebouwd uit twee P's
De impliciete regel zit in de haakjes:
P + P ==> B

In mbma-merged zie je dit niet terug:
buitenom P 0 0 0 0 0 P 0

fout 1: Geen B resultaat
fout 2: geen compositie regel. We zouden in MBMA altijd P + P ==> P kunnen doen (een PP compound) net zoals ook al voor N + N bestaat.
Dan mis je nog steeds dat buitenom blijkbaar een Bijwoord is (volgens CELEX)

Met de nieuwe GLUE regel is dit wel oplosbaar:
buitenom B_^PP 0 0 0 0 0 P 0
buitenom==> [ [buiten]B_^PP [om]P ]B B

CELEX is wat ambigue ook, gezien 'buitenomgaan'
16502\buitenomgaan\0\C\1\Y\Y\Y\buiten+omga\B1\N\N((buiten)[B],((om)[P],(ga)[V])[V])[V]\N\N\Y

Hier is 'buiten' direct een B en niet een P.
MBMA maakt geen diepere structuur aan dan:
buitenomgaan==> [ [buiten]B [om]P [ga]V i ]V V
Dat er een V wordt toegekend is op basis van de inflectie

Wat CELEX hier beoogd is denk ik meer zoiets:
buitenomgaan B 0 0 0 0 0 V_^PV 0 V 0 0+Ian 0/i|0/tm
buitenomgaan==> [ [buiten]B [ [om]V_^PV [ga]V ]V PV-compound i ]V V

of zelfs:
buitenomgaan B^PV 0 0 0 0 0 V_^PV 0 V 0 0+Ian 0/i|0/tm
buitenomgaan==> [ [buiten]B [ [om]V_^PV [ga]V ]V PV-compound i ]V V
Met zelfde resultaat omdat de GLUE regel niet recursief toegepast wordt, en de eerste pas kan nadat de tweede geslaagd is.

Performance issues on processing huge collections -> revise multithreading implementation

Processing of huge amounts of pre-tokenised FoLiA documents (for Nederlab) goes unexpectedly slow, despite disabling various modules (--skip=mcpa). In about 24 hours, about 90 documents have been processed.

Frog is called on a directory as follows (to eliminate initialisation overhead):

frog --skip=mcpa -override tokenizer.rulesFile=tokconfig-nld-historical --xmldir "." --threads 40 --testdir input/ -x

Log excerpt of a single document (rarity:/scratch/proycon/morr001cryp01_01.tok.folia.xml) in a long-running batch (days if not weeks):

frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 1 milliseconds and 510 microseconds
frog-:CGN tagging took:   42 seconds, 593 milliseconds and 430 microseconds
frog-:NER took:           30 seconds, 188 milliseconds and 829 microseconds
frog-:Mblem took:         0 seconds, 803 milliseconds and 711 microseconds
frog-:Frogging in total took: 89 seconds, 930 milliseconds and 350 microseconds
frog-:resulting FoLiA doc saved in ./mole002refe01_01_0005.tok.folia.xml
frog-:Frogging input/moll013albu01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 16 milliseconds and 169 microseconds
frog-:CGN tagging took:   263 seconds, 846 milliseconds and 415 microseconds
frog-:NER took:           184 seconds, 152 milliseconds and 171 microseconds
frog-:Mblem took:         5 seconds, 328 milliseconds and 128 microseconds
frog-:Frogging in total took: 559 seconds, 893 milliseconds and 805 microseconds
frog-:resulting FoLiA doc saved in ./moll013albu01_01.tok.folia.xml
frog-:Frogging input/mont003zome01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 47 milliseconds and 859 microseconds
frog-:CGN tagging took:   57 seconds, 762 milliseconds and 530 microseconds
frog-:NER took:           37 seconds, 773 milliseconds and 605 microseconds
frog-:Mblem took:         1 seconds, 304 milliseconds and 201 microseconds
frog-:Frogging in total took: 120 seconds, 269 milliseconds and 869 microseconds
frog-:resulting FoLiA doc saved in ./mont003zome01_01.tok.folia.xml
----------------------------- (my emphasis) ------------------------
frog-:Frogging input/morr001cryp01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 31 milliseconds and 59 microseconds
frog-:CGN tagging took:   322 seconds, 971 milliseconds and 28 microseconds
frog-:NER took:           728 seconds, 839 milliseconds and 11 microseconds
frog-:Mblem took:         20 seconds, 219 milliseconds and 307 microseconds
frog-:Frogging in total took: 1494 seconds, 657 milliseconds and 222 microseconds
-----------------------------------------------------
frog-:resulting FoLiA doc saved in ./morr001cryp01_01.tok.folia.xml
frog-:Frogging input/moul004vowe01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 7 milliseconds and 995 microseconds
frog-:CGN tagging took:   336 seconds, 695 milliseconds and 921 microseconds
frog-:NER took:           111 seconds, 161 milliseconds and 469 microseconds
frog-:Mblem took:         3 seconds, 0 milliseconds and 52 microseconds
frog-:Frogging in total took: 508 seconds, 408 milliseconds and 533 microseconds
frog-:resulting FoLiA doc saved in ./moul004vowe01_01.tok.folia.xml
frog-:Frogging input/mouw001brah01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 37 milliseconds and 979 microseconds
frog-:CGN tagging took:   30 seconds, 709 milliseconds and 748 microseconds
frog-:NER took:           28 seconds, 536 milliseconds and 162 microseconds
frog-:Mblem took:         0 seconds, 966 milliseconds and 700 microseconds
frog-:Frogging in total took: 75 seconds, 622 milliseconds and 890 microseconds
frog-:resulting FoLiA doc saved in ./mouw001brah01_01.tok.folia.xml
frog-:Frogging input/muld014janf01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!

(full log in rarity:/scratch/proycon/frog.log)

Comparison; a standalone run on only the highlighted document (without -nostdout) :

frog-:Frogging morr001cryp01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 17 milliseconds and 448 microseconds
frog-:CGN tagging took:   17 seconds, 26 milliseconds and 152 microseconds
frog-:NER took:           65 seconds, 981 milliseconds and 710 microseconds
frog-:Mblem took:         28 seconds, 174 milliseconds and 786 microseconds
frog-:Frogging in total took: 487 seconds, 310 milliseconds and 584 microseconds

I have some minor suggestions for better debugging:

  • Let frog output a date/timestamp before and after processing (makes it easier to inspect a long logs running for days)
  • Is some normalisation of timings possible (in addition to the absolute numbers), e.g. divided by the total amount of tokens?

Other possibility for testdir:

  • Parallellise document processing instead of modules? (if not too complicated)

Testability: how to find the right frogtests?

Since the frog tests are in a separate repository, automatic testing of any arbitrary version (that includes old versions) is not possible without explicitly establishing which frogtests version is needed for which frog version. This is relevant for the debian package build process. How can we solve this?

The same issue applies to libfolia and ucto.

MBMA structureel probleem met (atie) (ief)

CELEX levert regels af als:
(((associeer)[V],(atie)[N|V.])[N],(ief)[A|N.])[A]
(((restaureer)[V],(atie)[N|V.])[N],(ief)[A|N.])[A]

Met bedoelde lezingen als:
associatief ==> [ [ [associeer]V [atie] ]N ] ] [ief] ] A

in mbma-merged zijn die allemaal? fout:
associatief V 0 0 0 0 0 0+Reer>at 0 A_N* 0 0/P
restauratief V 0 0 0 0 0 0 0+Reer>at 0 A_N* 0 0/P

Er zijn 101 entries iig.

met als resultaat:
[ [associeer]V [ief]A_N* P ]A A

Oplossing 1:
associatief V 0 0 0 0 0 0+Reer>at N_V*+Hatie A_N* 0 0/P
geeft:
[ [ [ [associeer]V [atie]N_V* ]N [ief]A_N* ]A P ]A A

een andere oplossing:
associatief V 0 0 0 0 0 N_V*+Deer 0 0 0 A_N*+Rief>f/P

Maar dit geeft:
[ [ [associeer]V [atie]A_V* ]A [ief]P ]A A
Dit is een andere BUG denk ik: de /P (positive) inflectie is omgezte naar en P (preposition) tag
OMG

avoid use of exit()

Debiian advises against use of exit().
We us it generally to signal start-up problems. (missing files, syntax errors in the configuration and such)
These can be converted to exceptions. And maybe should.
One problem is already clear: These exceptions must be caught INSIDE the OpenMP thread where the exception is thrown. So take care!

spec 3.0 of OpenMP say reason that:
"A throw executed inside a parallel region must cause execution to resume within the same parallel region,
and the same thread that threw the exception C/C++ must catch it.."

@proycon : I will look into this.

Frog mblem crash: folia::ValueError: attribute 'class' is required for lemma (empty class passed)

mblem gets an empty class and crashes, input document in /vol/tensusers/proycon/vrie047wond01_01.tok.translated.folia.xml

frog-:Frogging froginput/vrie047wond01_01.tok.translated.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-mblem:attribute 'class' is required for lemma addLemma failed.
terminate called after throwing an instance of 'folia::ValueError'
  what():  attribute 'class' is required for lemma`
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x2aaab3934700 (LWP 142807)]
0x00002aaaac941c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00002aaaac941c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00002aaaac945028 in __GI_abort () at abort.c:89
#2  0x00002aaaac233535 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00002aaaac2316d6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00002aaaac230799 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00002aaaac23134a in __gxx_personality_v0 () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00002aaaac4e6fd3 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#7  0x00002aaaac4e735b in _Unwind_RaiseException () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#8  0x00002aaaac4e75dd in _Unwind_Resume_or_Rethrow () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#9  0x00002aaaac231969 in __cxa_rethrow () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00002aaaaad7c2c8 in Mblem::addLemma (this=this@entry=0x65b3e0, word=word@entry=0x64236a60, cls="") at mblem_mod.cxx:219
#11 0x00002aaaaad7c771 in Mblem::Classify (this=0x65b3e0, sword=<optimized out>) at mblem_mod.cxx:339
#12 0x00002aaaaad4a0b1 in FrogAPI::TestSentence () at FrogAPI.cxx:470
#13 0x00002aaaad8b034a in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#14 0x00002aaaac6f5184 in start_thread (arg=0x2aaab3934700) at pthread_create.c:312
#15 0x00002aaaaca08ffd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:11

MBMA nesting/glueing versus NOUN compounding

Probleem:
De regel voor circulatie werkt:
circulatie N 0 0 0+Rke>cu 0 V_N*+Heer N_V* 0 0 0/e
circulatie==> [ [ [ [cirkel]N [eer]V_N* ]V [atie]N_V* ]N e ]N N

Maar voor geldcirculatie:
geldcirculatie N 0 0 0 N 0 0 0+Rke>cu 0 V_N*+Heer N_V* 0 0 0/e
geldcirculatie==> [ [ [ [ [geld]N [cirkel]N ]N NN-compound [eer]V_N* ]V NN-compound [atie]N_V* ]N e ]N N

Hier wordt ten onrechte eerst 'geld' aan 'cirkel' gekoppeld tot een NN compound
We moeten sturen dat eerst de Noun 'circulatie' gebouwd wordt, VOORDAT de N+N regel gaat werken.
Dus de volgorde omkeren? Misschien een goed idee om eens te testen.

Frog sometimes segfaults when processing large batches

When processing large batches of FoLiA input and FoLiA output (though possibly also plain text input), Frog sometimes crashes with a segfault in a non-reproducible fashion. Thread safety issues seem to be the most likely suspect.

ner tool has no manpage

Not a major priority, but debian's lintian complains about this during packaging: binary-without-manpage usr/bin/ner. Would be nice to have fixed in a next release.

MBMA desorëntatie

De CELEX lezing van desoriëntatie is:
((des)[N|.N],(((Orient)[N],(eer)[V|N.])[V],(atie)[N|V.])[N])
of:
(((des)[V|.V],((Orient)[N],(eer)[V|N.])[V])[V],(atie)[N|V.])[N]

in mbma termen zoiets:
desoriëntatie N_N 0 0 N+RO>o 0 0 0 0 V_N+Heer N_V* 0 0 0/e
desoriëntatie V_^VV 0 0 N+RO>o 0 0 0 0 V_N*+Heer N_V* 0 0 0/e

Maar de uitkomsten zijn niet precies goed:
desoriëntatie==> [ [ [ [ [des]N_N [Oriënt]N ]N [eer]V_N ]V [atie]N_V* ]N e ]N N
Hier wordt 'des' geplakt aan Orient, en niet aan 'oriëntatie'

respectievelijk
desoriëntatie==> [ [des]V_^VV [ [ [Oriënt]N [eer]V_N* ]V [atie]N_V* ]N e ]N N
Hier wordt des helemaal niet aan de V geplakt omdat de glue regel niet matcht, en na applicatie van de * regels niet opnieuw geprobeerd wordt. En dan nog zal de tweede * regel de V al in een N (orientatie) omgezet hebben.

ik zie geen directe oplossing hiervoor.

Frog exits with non-zero exit code and "Nothing done" but performs fine

When running like this on a directory with FoLiA input:

frog --skip=mcpa --xmldir "." --threads 40 --testdir input/ -x

Frog works fine but exits with non-zero exit code (bit of the reverse of the #31 situation we had earlier):

  frog 0.13.8 (c) CLTS, ILK 1998 - 2017
  CLST  - Centre for Language and Speech Technology,Radboud University
  ILK   - Induction of Linguistic Knowledge Research Group,Tilburg University
  based on [ucto 0.9.7, libfolia 1.9, timbl 6.4.10, ticcutils 0.16, mbt 3.2.17]
  frog-:config read from: /vol/customopt/lamachine.dev/share/frog/nld/frog.cfg
  frog-:configuration version = 0.12
  frog-mblem:Initiating lemmatizer...
  frog-ner-:loaded 2 additional Named Entities from/vol/customopt/lamachiine.dev/share/frog/nld/ner.known
  frog-tok-:Initiating tokeniser...
  frog-tok-:tokconfig-nld: version=0.2
  frog-ner-mbt-:  Reading the lexicon from: /vol/customopt/lamachine.dev/share/frog/nld/ner.data.lex.ambi.05 (73735 words).
  frog-ner-mbt-:  Read frequent words list from: /vol/customopt/lamachine.dev/share/frog/nld/ner.data.top1000 (1000 words).
  frog-ner-mbt-:  Reading case-base for known words from: /vol/customopt/lamachine.dev/share/frog/nld/ner.data.known.ddwdwfWawawaa... 
  frog-ner-mbt-:  case-base for known words read.
  frog-ner-mbt-:  Reading case-base for unknown words from: /vol/customopt/lamachine.dev/share/frog/nld/ner.data.unknown.chnppddwdwFawawaasss... 
  frog-pos-tagger-mbt-:  Reading the lexicon from: /vol/customopt/lamachine.dev/share/frog/nld/Frog.mbt.1.0.lex.ambi.05 (229170 words).
  frog-pos-tagger-mbt-:  Read frequent words list from: /vol/customopt/lamachine.dev/share/frog/nld/Frog.mbt.1.0.top500 (500 words).
  frog-pos-tagger-mbt-:  Reading case-base for known words from: /vol/customopt/lamachine.dev/share/frog/nld/Frog.mbt.1.0.known.dddwfWawa... 
  frog-ner-mbt-:  case-base for unknown word read
  frog-ner-mbt-:  Sentence delimiter set to 'EL'
  frog-ner-mbt-:  Beam size = 1
  frog-ner-mbt-:  Known Tree, Algorithm = IGTREE
  frog-ner-mbt-:  Unknown Tree, Algorithm = TRIBL
  frog-ner-mbt-:
  frog-pos-tagger-mbt-:  case-base for known words read.
  frog-pos-tagger-mbt-:  Reading case-base for unknown words from: /vol/customopt/lamachine.dev/share/frog/nld/Frog.mbt.1.0.unknown.chnppdddwFawasss... 
  frog-pos-tagger-mbt-:  case-base for unknown word read
  frog-pos-tagger-mbt-:  Sentence delimiter set to '<utt>'
  frog-pos-tagger-mbt-:  Beam size = 1
  frog-pos-tagger-mbt-:  Known Tree, Algorithm = IGTREE
  frog-pos-tagger-mbt-:  Unknown Tree, Algorithm = IB1
  frog-pos-tagger-mbt-:
  frog-:Initialization done.
  frog-:Frogging input/gent.folia.xml
  frog-tok-:ucto: --filter=NO is automaticly set. inputclass equals outputclass!
  frog-:tokenisation took:  0 seconds, 116 milliseconds and 851 microseconds
  frog-:CGN tagging took:   376 seconds, 223 milliseconds and 482 microseconds
  frog-:NER took:           610 seconds, 230 milliseconds and 163 microseconds
  frog-:Mblem took:         41 seconds, 830 milliseconds and 724 microseconds
  frog-:Frogging in total took: 1427 seconds, 68 milliseconds and 952 microseconds
  frog-:resulting FoLiA doc saved in ./gent.folia.xml
  Nothing done.

Double tabs in server response format when skipping parts of the processing

Hello,

When skipping parts of the processing, FrogAPI inserts double tabs. This makes it more difficult to parse the results. Would it be possible to return a single tab per column? This way you can just split a line on tabs and don't have to do complicated counting. I'm not sure if any of the other Frog clients take this into account. Might be worth to asks the developers of the clients what their opinion is.

Regards,
Jeroen

CGN or D-coi part-of-speech tag set

Hello,

On the main page it states that the following information is present in the Frog output:
PoS tag (CGN tagset; according to MBT)

After investigating output and the types in this tag set, I found that it does not contain the types SPEC(afkorting) and SPEC(symbool). These are present in the D-coi tag set (the only addition I believe).

This leads me to believe Frog tags according to the D-coi tag set instead of the CGN tag set. Am I right about this?

Kind regards,
Guido

Frogging other FoLiA tags then <s>

https://github.com/JessedeDoes This might be of interest for you too!?

When frogging FoLiA documents, we currently only examine Sentences. Which always seemed to make sense. But there are other FoLiA tags bearing text, like <head> and <note> .

Is it desirable to add an option to Frog to also work on those tags? For Instance the Meertens institute seems to have a need to look deeper into the <head> nodes.

Attention: This would also mean a major change in FoLiA itself.!
<head> and <note> cannot be annotated at the moment.
Especially for <head> I wonder if this is correct.

Frog crash "NON printable element" / no such text

terminate called after throwing an instance of 'folia::NoSuchText'
what(): no such text: NON printable element: lemma

Program received signal SIGABRT, Aborted.
0x00002aaaaaf24c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00002aaaaaf24c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00002aaaaaf28028 in __GI_abort () at abort.c:89
#2  0x00002aaab0b68535 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00002aaab0b666d6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00002aaab0b66703 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00002aaab0b66922 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00002aaab05dfed6 in folia::FoliaImpl::text (this=0x2aaabd03f9b0, cls="current", retaintok=<optimized out>, strict=<optimized out>) at folia_impl.cxx:1050
#7  0x00002aaab05dab1b in folia::FoliaImpl::check_text_consistency (this=0x2aaabd03f9b0) at folia_impl.cxx:807
#8  0x00002aaab05e3371 in folia::FoliaImpl::xml (this=0x2aaabd03f9b0, recursive=<optimized out>, kanon=false) at folia_impl.cxx:917
#9  0x00002aaab05e334c in folia::FoliaImpl::xml (this=0x1b96c020, recursive=<optimized out>, kanon=false) at folia_impl.cxx:909
#10 0x00002aaab05e334c in folia::FoliaImpl::xml (this=0x1b96b860, recursive=<optimized out>, kanon=false) at folia_impl.cxx:909
#11 0x00002aaab05e334c in folia::FoliaImpl::xml (this=0x29def50, recursive=<optimized out>, kanon=false) at folia_impl.cxx:909
#12 0x00002aaab05e334c in folia::FoliaImpl::xml (this=0x1b8b6e70, recursive=<optimized out>, kanon=false) at folia_impl.cxx:909
#13 0x00002aaab0612a54 in folia::Document::to_xmlDoc (this=this@entry=0x29deb50, nsLabel=..., kanon=false) at folia_document.cxx:1693
#14 0x00002aaab0612e81 in folia::Document::toXml (this=this@entry=0x29deb50, nsLabel=..., kanon=<optimized out>) at folia_document.cxx:1701
#15 0x00002aaab06131b2 in folia::Document::save (this=this@entry=0x29deb50, os=..., nsLabel="", kanon=kanon@entry=false) at folia_document.cxx:423
#16 0x00002aaab00110f9 in save (kanon=false, os=..., this=0x29deb50) at /vol/customopt/lamachine/include/libfolia/folia_document.h:93
#17 FrogAPI::Frogtostring (this=<optimized out>,
    s=" heel goed\n !stat PDF\n [pdf] Kobus: +4-2, hooiberg: +2, tribbel: +1, jlo: -2 totaal: +3\n ja scheelt ons heel veel werk\n f00f, niet met ofzo, dat zwakth.\n !stat XML\n [XML] Kobus: +7-1,"...) at FrogAPI.cxx:1324
#18 0x00002aaaafd77865 in __pyx_pf_4frog_4Frog_2process_raw (__pyx_v_text=<optimized out>, __pyx_v_self=0x2aaaaac53418) at frog_wrapper.cpp:3643
#19 __pyx_pw_4frog_4Frog_3process_raw (__pyx_v_self=0x2aaaaac53418, __pyx_v_text=<optimized out>) at frog_wrapper.cpp:3570
#20 0x00002aaaafd746d3 in __Pyx_PyObject_CallMethO (arg=<optimized out>, func=<optimized out>) at frog_wrapper.cpp:6728
#21 __Pyx_PyObject_CallOneArg (func=<optimized out>, arg=0x29dcee0) at frog_wrapper.cpp:6759
#22 0x00002aaaafd7ad24 in __pyx_pf_4frog_4Frog_6process (__pyx_v_text=0x29dcee0, __pyx_v_self=0x2aaaaac53418) at frog_wrapper.cpp:4331
#23 __pyx_pw_4frog_4Frog_7process (__pyx_v_self=0x2aaaaac53418, __pyx_v_text=<optimized out>) at frog_wrapper.cpp:4124
#24 0x00000000004b227a in PyEval_EvalFrameEx ()
#25 0x0000000000569f94 in PyEval_EvalCode ()
#26 0x00000000004cc07c in ?? ()
#27 0x000000000047baa1 in PyRun_FileExFlags ()
#28 0x000000000047be7e in PyRun_SimpleFileExFlags ()
#29 0x00000000005bf713 in Py_Main ()
#30 0x000000000047e351 in main ()
(gdb) quit
A debugging session is active.


frog-:tokenisation took:  0 seconds, 177 milliseconds and 173 microseconds
frog-:CGN tagging took:   4 seconds, 462 milliseconds and 760 microseconds
frog-:IOB chunking took:  1 seconds, 941 milliseconds and 66 microseconds
frog-:NER took:           7 seconds, 499 milliseconds and 314 microseconds
frog-:MBMA took:          0 seconds, 172 milliseconds and 882 microseconds
frog-:Mblem took:         0 seconds, 109 milliseconds and 435 microseconds
frog-:MWU resolving took: 0 seconds, 11 milliseconds and 394 microseconds
frog-:Parsing (prepare) took: 0 seconds, 8 milliseconds and 269 microseconds
frog-:Parsing (pairs)   took: 1 seconds, 283 milliseconds and 299 microseconds
frog-:Parsing (rels)    took: 0 seconds, 73 milliseconds and 834 microseconds
frog-:Parsing (dir)     took: 0 seconds, 93 milliseconds and 892 microseconds
frog-:Parsing (csi)     took: 6 seconds, 170 milliseconds and 575 microseconds
frog-:Parsing (total)   took: 7 seconds, 648 milliseconds and 655 microseconds
frog-:Frogging in total took: 15 seconds, 437 milliseconds and 760 microseconds
frog-:problem frogging: xaf
frog-:saving to file bla failed: no such text: NON printable element: lemma

Release frog 0.13.8?

Similar to LanguageMachines/ucto#33: Last release was in January and there has been considerable work since (over 100 commits). Normal LaMachine users will still not benefit from all of this yet until release. I recommend releasing as soon as the version is deemed stable. (along with frogdata, after libfolia for folia 1.5)

repition of inlections in long output

Soms worden inflectie namen gedupliceerd:
[[ga]verb[t]/present-tense/singular/2nd-person-verb/present-tense/singular/2nd-person-verb]verb

dit moet zijn:
[[ga]verb[t]/present-tense/singular/2nd-person-verb]verb

Input zin: 'Het gaat om hunebedden.'

Maar er zijn ergere problemen:
de alternative lezing is:

[[ga]verb[t]/present-tense/singular/3rd-person-verb/present-tense/singular/2nd-person-verb/inver
sed]

Een tegenspraak !? Moet denk ik verworpen worden...

Frog requires sentence IDs

Not a priority, but for future reference: Frog can't deal with input FoLiA documents where the sentences do not have ID:

workaround: I created a foliaid tool in the FoLiA-tools that will generate IDs, after that Frog works fine.

terminate called after throwing an instance of 'folia::ValueError'
  what():  Unable to generate an id from ID= 

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x2aaab2729700 (LWP 21871)]
0x00002aaaac93fc37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00002aaaac93fc37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00002aaaac943028 in __GI_abort () at abort.c:89
#2  0x00002aaaac231535 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00002aaaac22f6d6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00002aaaac22e799 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00002aaaac22f34a in __gxx_personality_v0 () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00002aaaac4e4fd3 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#7  0x00002aaaac4e54f7 in _Unwind_Resume () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#8  0x00002aaaaadad70e in ~FoliaElement (this=0x2aab14007210, __in_chrg=<optimized out>) at /vol/customopt/lamachine.dev/include/libfolia/folia_impl.h:72
Python Exception <class 'IndexError'> list index out of range: 
#9  folia::EntitiesLayer::EntitiesLayer (this=this@entry=0x2aab14007210, a=std::map with 1 elements, d=d@entry=0x7fffffffd480, __in_chrg=<optimized out>, __vtt_parm=<optimized out>)
    at /vol/customopt/lamachine.dev/include/libfolia/folia_impl.h:2338
#10 0x00002aaaaadabb12 in addEntity (sent=sent@entry=0x25526e0, tagset="http://ilk.uvt.nl/folia/sets/frog-ner-nl", words=std::vector of length 1, capacity 1 = {...}, 
    confs=std::vector of length 1, capacity 1 = {...}, NER="per") at ner_tagger_mod.cxx:347
#11 0x00002aaaaadac0f8 in NERTagger::addNERTags (this=this@entry=0x2aaac00008c0, words=std::vector of length 10, capacity 16 = {...}, 
    tags=std::vector of length 10, capacity 16 = {...}, confs=std::vector of length 10, capacity 16 = {...}) at ner_tagger_mod.cxx:398
#12 0x00002aaaaadaceb1 in NERTagger::Classify (this=0x2aaac00008c0, swords=std::vector of length 10, capacity 16 = {...}) at ner_tagger_mod.cxx:504
#13 0x00002aaaaad49eac in FrogAPI::TestSentence () at FrogAPI.cxx:429
#14 0x00002aaaad8ae34a in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#15 0x00002aaaac6f3184 in start_thread (arg=0x2aaab2729700) at pthread_create.c:312
#16 0x00002aaaaca06ffd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Improve MBMA rules and rule handling

There is a problem with the current MBMA rules:
The rules are mainly based on CELEX, but some information about nesting is lost.
Example:
In CELEX (extract from a longer rule):
uitkijkpost ( ( (uit)[P], (kijk)[N] )[N], (post)[N] )
Saying:
'uit' and 'kijk' should be combined to the Noun 'uitkijk' before combining to the Noun 'uitkijkpost'

In the training file for MBMA this is simplified to the rule:
uitkijkpost P 0 0 N 0 0 0 N 0 0 0/e
and all nesting is lost, leading to the unwanted result:
[ [uit] [kijk] [post] ] N

A possible solution is to use the 'affix' rule syntax:
uitkijkpost N_*V 0 0 V 0 0 0 N 0 0 0/e
This will indeed lead to the desired nesting:
[ [ [uit] [kijk] ] [post] ] N
BUT: All information about the P is lost, and of course this Is Not An Affix.

In fact there are thousands of examples of this problem, so we need a generic solution.

Proposal:
Add a new symbol to the MBMA rule syntax to express:
"I am a node that glues to my neighbors an construct a new node"

For instance like this:
uitkijkpost N_^PV 0 0 V 0 0 0 N 0 0 0/e

where N_^PV just says this:
The P 'uit' is glued to the V 'kijk' to construct the Noun 'uitkijkpost', the properties of the P are to be preserved. (that it is a P mainly)

A lot of work has te be done:

  • is ^ a good symbol? or maybe ~?
  • write and test code in MBMA to handle the new symbol
  • edit all rules in the MBMA training data where the nesting is lost, compared to CELEX
  • test it in frog, especially on the consequences for compound detection

comment welcome

Perfomance and accuracy on tweets

I'm trying to real time process dutch tweets for domain specific named entity recognition. I build a ner system but struggling with finding a correct dependency parser, pos tagger. The core requirement is perfomance, do you think frog is suitable for this or should I look further to something like alpino.

Input from list of files

I often need to run frog over thousands of files. Hence I want to run several frogs. As it is, I need to distribute the inputfiles over so many directories. I would like to just split a list of files with full pathnames and start each of the n instances of frog with a 1/n part of the list.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.