Comments (11)
(somewhat related to LanguageMachines/ucto#24 , as ucto can't deal with this either yet)
from frog.
At the moment this is not possible in Frog.
None of the modules. ucto, mbma, mblem etc. are capable of handling splits.
Nor do we have any plans to accomplish this.
The fastest (but non-trivial) way to handle splits is to post process the Frog output with PiCCL/TiCCL and then rerun Frog on the hopefully resolved tokens.
from frog.
Another fix would be to have a wrapper that takes any content that contains spaces, and iterates over the space-delimited tokens, processing them individually with MBLEM and/or MBMA. These analyses could then be collated.
An even rougher fix would be to delete the spaces before giving them as input to MBLEM or MBMA ("vol daen" -> "voldaen"). This solution is brute, but makes sense, because the presence of a space within a tag indicates directly that the string in is a single token in present-day Dutch.
from frog.
The rough fix will lead to a lot of problems, i guess:
How do you determine which spaces are splits an which are none?
You would need a lexicon then, or?
from frog.
You would just treat "voldaen" as any word that you may or may not have seen before. You delete any whitespace.
The fact that there are one or more spaces within a tag signals that apparently in present-day Dutch everything between and should be one word, without spaces.
from frog.
this assumes that there are already 'tags' or tokens detected. But what to do in running text?
'ik geloof niet dat dit werkt'
'ikgeloofnietdatditwerkt'
from frog.
I'm not in favour of the brute solution of deleting spaces and sidetracking the problem that way (= information loss). I think ideally Frog should be able to handle spaces in tokens just as any other character, i.e. Frog should be completely agnostic about it and just accept whatever the tokenizer delivers (it would still be one <w>
after all). Am I right in thinking the source of this issue is that space is used as a delimiter in the underlying timbl modules, rather than tab?
from frog.
I'm not sure if that is the issue - the underlying delimiter may well be the comma, and the modules may work with spaces just the same. Alternatively the spaces could be written to another character ("_") and the whole process may just work fine. Perhaps first perform a test?
from frog.
I just ran into Frog indeed stumbling over spaces in tokens as foreseen :) First I thought it was another issue but it proves this issue, so I'll add this here:
frog-pos-tagger-:mismatch between number of <w> tags and the tagger result.
frog-pos-tagger-:words according to <w> tags:
frog-pos-tagger-:w[0]= ‘
frog-pos-tagger-:w[1]= AA
frog-pos-tagger-:w[2]= (
frog-pos-tagger-:w[3]= Floris
frog-pos-tagger-:w[4]= van
frog-pos-tagger-:w[5]= der
frog-pos-tagger-:w[6]= )
frog-pos-tagger-:w[7]= een
frog-pos-tagger-:w[8]= der
frog-pos-tagger-:w[9]= edelen
frog-pos-tagger-:w[10]= die
frog-pos-tagger-:w[11]= in
frog-pos-tagger-:w[12]= 1415
frog-pos-tagger-:w[13]= Jan
frog-pos-tagger-:w[14]= van
frog-pos-tagger-:w[15]= Arkel
frog-pos-tagger-:w[16]= gevankelijk
frog-pos-tagger-:w[17]= naar
frog-pos-tagger-:w[18]= 's
frog-pos-tagger-:w[19]= Hage
frog-pos-tagger-:w[20]= voerden
frog-pos-tagger-:w[21]= ,
frog-pos-tagger-:w[22]= waarvoor
frog-pos-tagger-:w[23]= zij
frog-pos-tagger-:w[24]= eene
frog-pos-tagger-:w[25]= goede
frog-pos-tagger-:w[26]= som
frog-pos-tagger-:w[27]= gelds
frog-pos-tagger-:w[28]= trokken
frog-pos-tagger-:w[29]= .
frog-pos-tagger-:w[30]= ’
frog-pos-tagger-:words according to POS tagger:
frog-pos-tagger-:word[0]='
frog-pos-tagger-:word[1]=AA
frog-pos-tagger-:word[2]=(
frog-pos-tagger-:word[3]=Floris
frog-pos-tagger-:word[4]=van
frog-pos-tagger-:word[5]=der
frog-pos-tagger-:word[6]=)
frog-pos-tagger-:word[7]=een
frog-pos-tagger-:word[8]=der
frog-pos-tagger-:word[9]=edelen
frog-pos-tagger-:word[10]=die
frog-pos-tagger-:word[11]=in
frog-pos-tagger-:word[12]=1415
frog-pos-tagger-:word[13]=Jan
frog-pos-tagger-:word[14]=steen <== THIS ONE MISSES IN THE OTHER
frog-pos-tagger-:word[15]=van
frog-pos-tagger-:word[16]=Arkel
frog-pos-tagger-:word[17]=gevankelijk
frog-pos-tagger-:word[18]=naar
frog-pos-tagger-:word[19]=de
frog-pos-tagger-:word[20]=Haag
frog-pos-tagger-:word[21]=voerden
frog-pos-tagger-:word[22]=,
frog-pos-tagger-:word[23]=waarvoor
frog-pos-tagger-:word[24]=zij
frog-pos-tagger-:word[25]=een
frog-pos-tagger-:word[26]=goede
frog-pos-tagger-:word[27]=som
frog-pos-tagger-:word[28]=geld
frog-pos-tagger-:word[29]=trokken
frog-pos-tagger-:word[30]=.
frog-pos-tagger-:word[31]='
frog-:problem frogging: aa__001biog01_01.tok.translated.folia.xml
frog-:POS tagger is confused IOB tagger is confused NER failed: '' AA ( Floris van der ) een der edelen die in 1415 Jan steen van Arkel gevankelijk naar de Haag voerden , waarvoor zij een goede som geld trokken . '' ==> ''//O AA//B-org (//O Floris//B-per van//I-per der//I-per )//I-per een//I-per der//I-per edelen//O die//O in//O 1415//O Jan//B-per steen//O van//O Arkel//B-loc gevankelijk//O naar//O de//O Haag//B-loc voerden//O ,//O waarvoor//O zij//O een//O goede//O som//O geld//O trokken//O .//O '//O '
The problem here is this word in the input document:
<w xml:id="aa__001biog01_01.TEI.2.text.body.div.p.152.s.1.w.14" class="WORD">
<t>Jan</t>
<t class="contemporary">Jan steen</t>
<lemma class="WNT:M028633.ADD.948" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/int_lemmaid_withcompounds.foliaset.ttl"/>
<lemma class="jan steen" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/int_lemmatext_withcompounds.foliaset.ttl"/>
<metric class="modernisationsource" value="inthistlexicon"/>
</w>
So the word "Jan" modernises to "Jan steen" (which is obviously odd and wrong, but not the actual issue here). Frog runs on the contemporary layer and breaks over the space (as we expected). I'll just have to disallow multiword tokens in the moderniser for now (or do an ugly patch with another delimiter like underscore), but this will at some point come back to haunt us if we build a specialised tagger/lemmatiser for Nederlab with proper multiword support and want to run Frog on its' output.
from frog.
There are a lot of issues at hand here.
First: In this case the 'sanity' check in Frog isn't aware of 'multiword' words.
I assume this can be fixed rather easy. (the error message is also confusing, because it used the wrong textclass)
But that is just a small part of the multitude of problems at hand.
Reverting to the original question:
- Can MBLEM handle the word 'vol daen'?
Not at the present, but modification is easy, yielding the lemma 'voldaen' (assuming someone magicly comes up with training data) - Can MBMA handle the word 'vol daen'?
Not at the present, but modification is 'easy' I think. I suggest by just analyzing the 'de-spaced' token. - Can MBT handle 'vol daen'?
NO, and it is very hard to do I think, unless just ignoring that it was 1 token.
How would Mbt recognize the single token 'vol daen' in Sentence like:
"Ik had een vol daen gevoel."?
MBT is sentence based, and assumes space delimited words/tokens. - Same problem with all MBT based modules in Frog. (NER, Chunker)
So to get this to work, MBT needs a complete rework, to get a variant that accepts sequences of tokens instead off sentences of words. (and training data too)
Then there is also the reverse problem of run-ons where 'voldaen' is to be split in 2 words.
The tagger has no clue, the lemmatizer will have no problems, after the modifications.
But how and when do we merge this knowledge?
QUICK HACK Proposal for multiple words in 1 <w> :
For tagging, we could remove the spaces, to assure that 1 FoLiA word, leads to 1 Tag.
Example:
<s id="s.1">
<w id="w.1">
<t>Een</t>
</w>
<w id="w.2">
<t>multi word</t>
</w>
<w id="w.3">
<t>test</t>
</w>
</s>
We tag this as if the second word was:
<w id="w.2">
<t>multiword</t>
</w>
The adapted MBMA and MBLEM, we can provide the text include the space.
Result:
<s xml:id="s.1">
<w xml:id="w.1">
<t>Een</t>
<pos class="LID(onbep,stan,agr)" confidence="0.981771" head="LID">
<feat class="onbep" subset="lwtype"/>
<feat class="stan" subset="naamval"/>
<feat class="agr" subset="npagr"/>
</pos>
<morphology>
<morpheme>
<t>een</t>
</morpheme>
</morphology>
<lemma class="een"/>
</w>
<w xml:id="w.2">
<t>multi word</t>
<pos class="N(soort,ev,basis,onz,stan)" confidence="0.733484" head="N">
<feat class="soort" subset="ntype"/>
<feat class="ev" subset="getal"/>
<feat class="basis" subset="graad"/>
<feat class="onz" subset="genus"/>
<feat class="stan" subset="naamval"/>
</pos>
<lemma class="multi word"/>
<morphology>
<morpheme>
<t>multi word</t>
</morpheme>
</morphology>
</w>
<w xml:id="w.3">
<t>test</t>
<pos class="N(soort,ev,basis,zijd,stan)" confidence="0.789112" head="N">
<feat class="soort" subset="ntype"/>
<feat class="ev" subset="getal"/>
<feat class="basis" subset="graad"/>
<feat class="zijd" subset="genus"/>
<feat class="stan" subset="naamval"/>
</pos>
<lemma class="test"/>
<morphology>
<morpheme>
<t>test</t>
</morpheme>
</morphology>
</w>
</s>
from frog.
So for now, this simple solution is implemented:
Frog accepts FoLiA with embedded spaces now. All spaces are removed for all taggers AND the parser, converting multi words into singe words.
from frog.
Related Issues (20)
- Frog Chunker creates invalid FoLiA HOT 2
- released frog (0.29) depends on unreleased libfolia (2.15) HOT 2
- Building on Ubuntu 22.04 LTS Pop!_OS HOT 1
- Token annotation error for XML output with non-standard rules HOT 3
- segmentation fault when invoked with a missing [[tokenizer]] section in the configuration HOT 5
- Server mode creates only 1 paragraph HOT 2
- Add JSON output as an alternative to 'tabbed' format HOT 3
- Frog breaks while processing large amount of txt data HOT 11
- Keep the deep_morph structure intact when resolving MWU's HOT 1
- Simplify option and configuration handling
- MWU output when no Parser is selected HOT 7
- Update debian package for v0.20
- Python Frog HOT 2
- Frog (through python-frog) accumulates a huge number of temporary files HOT 11
- Praktische vragen rondom grote datasets HOT 7
- Bug: frog server; frog-:connection lost unexpected : write to client failed HOT 2
- Segfault on FoLiA in to FoLiA out (speech data with events and utterances) HOT 7
- New release? HOT 3
- frog lemmatizer with --deep-morph misses a morpheme in FoLiA output
- [Docker] Initialization fails for nld-vnn and dum HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from frog.