Giter Site home page Giter Site logo

Comments (11)

proycon avatar proycon commented on May 29, 2024

(somewhat related to LanguageMachines/ucto#24 , as ucto can't deal with this either yet)

from frog.

kosloot avatar kosloot commented on May 29, 2024

At the moment this is not possible in Frog.
None of the modules. ucto, mbma, mblem etc. are capable of handling splits.
Nor do we have any plans to accomplish this.
The fastest (but non-trivial) way to handle splits is to post process the Frog output with PiCCL/TiCCL and then rerun Frog on the hopefully resolved tokens.

from frog.

antalvdb avatar antalvdb commented on May 29, 2024

Another fix would be to have a wrapper that takes any content that contains spaces, and iterates over the space-delimited tokens, processing them individually with MBLEM and/or MBMA. These analyses could then be collated.

An even rougher fix would be to delete the spaces before giving them as input to MBLEM or MBMA ("vol daen" -> "voldaen"). This solution is brute, but makes sense, because the presence of a space within a tag indicates directly that the string in is a single token in present-day Dutch.

from frog.

kosloot avatar kosloot commented on May 29, 2024

The rough fix will lead to a lot of problems, i guess:
How do you determine which spaces are splits an which are none?
You would need a lexicon then, or?

from frog.

antalvdb avatar antalvdb commented on May 29, 2024

You would just treat "voldaen" as any word that you may or may not have seen before. You delete any whitespace.

The fact that there are one or more spaces within a tag signals that apparently in present-day Dutch everything between and should be one word, without spaces.

from frog.

kosloot avatar kosloot commented on May 29, 2024

this assumes that there are already 'tags' or tokens detected. But what to do in running text?
'ik geloof niet dat dit werkt'
'ikgeloofnietdatditwerkt'

from frog.

proycon avatar proycon commented on May 29, 2024

I'm not in favour of the brute solution of deleting spaces and sidetracking the problem that way (= information loss). I think ideally Frog should be able to handle spaces in tokens just as any other character, i.e. Frog should be completely agnostic about it and just accept whatever the tokenizer delivers (it would still be one <w> after all). Am I right in thinking the source of this issue is that space is used as a delimiter in the underlying timbl modules, rather than tab?

from frog.

antalvdb avatar antalvdb commented on May 29, 2024

I'm not sure if that is the issue - the underlying delimiter may well be the comma, and the modules may work with spaces just the same. Alternatively the spaces could be written to another character ("_") and the whole process may just work fine. Perhaps first perform a test?

from frog.

proycon avatar proycon commented on May 29, 2024

I just ran into Frog indeed stumbling over spaces in tokens as foreseen :) First I thought it was another issue but it proves this issue, so I'll add this here:

frog-pos-tagger-:mismatch between number of <w> tags and the tagger result.                 
frog-pos-tagger-:words according to <w> tags:                                               
frog-pos-tagger-:w[0]= ‘                      
frog-pos-tagger-:w[1]= AA                     
frog-pos-tagger-:w[2]= (                      
frog-pos-tagger-:w[3]= Floris                 
frog-pos-tagger-:w[4]= van                    
frog-pos-tagger-:w[5]= der                    
frog-pos-tagger-:w[6]= )                      
frog-pos-tagger-:w[7]= een                    
frog-pos-tagger-:w[8]= der                    
frog-pos-tagger-:w[9]= edelen                 
frog-pos-tagger-:w[10]= die                   
frog-pos-tagger-:w[11]= in                    
frog-pos-tagger-:w[12]= 1415                  
frog-pos-tagger-:w[13]= Jan                   
frog-pos-tagger-:w[14]= van                   
frog-pos-tagger-:w[15]= Arkel                 
frog-pos-tagger-:w[16]= gevankelijk           
frog-pos-tagger-:w[17]= naar                  
frog-pos-tagger-:w[18]= 's                    
frog-pos-tagger-:w[19]= Hage                  
frog-pos-tagger-:w[20]= voerden               
frog-pos-tagger-:w[21]= ,                     
frog-pos-tagger-:w[22]= waarvoor              
frog-pos-tagger-:w[23]= zij                   
frog-pos-tagger-:w[24]= eene                  
frog-pos-tagger-:w[25]= goede                 
frog-pos-tagger-:w[26]= som                   
frog-pos-tagger-:w[27]= gelds                 
frog-pos-tagger-:w[28]= trokken               
frog-pos-tagger-:w[29]= .                     
frog-pos-tagger-:w[30]= ’                     
frog-pos-tagger-:words according to POS tagger:                                             
frog-pos-tagger-:word[0]='                    
frog-pos-tagger-:word[1]=AA                   
frog-pos-tagger-:word[2]=(                    
frog-pos-tagger-:word[3]=Floris               
frog-pos-tagger-:word[4]=van                  
frog-pos-tagger-:word[5]=der                  
frog-pos-tagger-:word[6]=)                    
frog-pos-tagger-:word[7]=een                  
frog-pos-tagger-:word[8]=der                  
frog-pos-tagger-:word[9]=edelen               
frog-pos-tagger-:word[10]=die                 
frog-pos-tagger-:word[11]=in                  
frog-pos-tagger-:word[12]=1415                
frog-pos-tagger-:word[13]=Jan                 
frog-pos-tagger-:word[14]=steen        <== THIS ONE MISSES IN THE OTHER       
frog-pos-tagger-:word[15]=van                 
frog-pos-tagger-:word[16]=Arkel               
frog-pos-tagger-:word[17]=gevankelijk         
frog-pos-tagger-:word[18]=naar                
frog-pos-tagger-:word[19]=de                  
frog-pos-tagger-:word[20]=Haag                
frog-pos-tagger-:word[21]=voerden             
frog-pos-tagger-:word[22]=,                   
frog-pos-tagger-:word[23]=waarvoor            
frog-pos-tagger-:word[24]=zij                 
frog-pos-tagger-:word[25]=een                 
frog-pos-tagger-:word[26]=goede               
frog-pos-tagger-:word[27]=som                 
frog-pos-tagger-:word[28]=geld                
frog-pos-tagger-:word[29]=trokken             
frog-pos-tagger-:word[30]=.                   
frog-pos-tagger-:word[31]='                   
frog-:problem frogging: aa__001biog01_01.tok.translated.folia.xml                           
frog-:POS tagger is confused IOB tagger is confused NER failed: '' AA ( Floris van der ) een der edelen die in 1415 Jan steen van Arkel gevankelijk naar de Haag voerden , waarvoor zij een goede som geld trokken . '' ==> ''//O AA//B-org (//O Floris//B-per van//I-per der//I-per )//I-per een//I-per der//I-per edelen//O die//O in//O 1415//O Jan//B-per steen//O van//O Arkel//B-loc gevankelijk//O naar//O de//O Haag//B-loc voerden//O ,//O waarvoor//O zij//O een//O goede//O som//O geld//O trokken//O .//O '//O ' 

The problem here is this word in the input document:

<w xml:id="aa__001biog01_01.TEI.2.text.body.div.p.152.s.1.w.14" class="WORD">
   <t>Jan</t>                                                                                                                                                             
   <t class="contemporary">Jan steen</t>
   <lemma class="WNT:M028633.ADD.948" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/int_lemmaid_withcompounds.foliaset.ttl"/>
   <lemma class="jan steen" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/int_lemmatext_withcompounds.foliaset.ttl"/>
    <metric class="modernisationsource" value="inthistlexicon"/>
</w>

So the word "Jan" modernises to "Jan steen" (which is obviously odd and wrong, but not the actual issue here). Frog runs on the contemporary layer and breaks over the space (as we expected). I'll just have to disallow multiword tokens in the moderniser for now (or do an ugly patch with another delimiter like underscore), but this will at some point come back to haunt us if we build a specialised tagger/lemmatiser for Nederlab with proper multiword support and want to run Frog on its' output.

from frog.

kosloot avatar kosloot commented on May 29, 2024

There are a lot of issues at hand here.
First: In this case the 'sanity' check in Frog isn't aware of 'multiword' words.
I assume this can be fixed rather easy. (the error message is also confusing, because it used the wrong textclass)
But that is just a small part of the multitude of problems at hand.
Reverting to the original question:

  • Can MBLEM handle the word 'vol daen'?
    Not at the present, but modification is easy, yielding the lemma 'voldaen' (assuming someone magicly comes up with training data)
  • Can MBMA handle the word 'vol daen'?
    Not at the present, but modification is 'easy' I think. I suggest by just analyzing the 'de-spaced' token.
  • Can MBT handle 'vol daen'?
    NO, and it is very hard to do I think, unless just ignoring that it was 1 token.
    How would Mbt recognize the single token 'vol daen' in Sentence like:
    "Ik had een vol daen gevoel."?
    MBT is sentence based, and assumes space delimited words/tokens.
  • Same problem with all MBT based modules in Frog. (NER, Chunker)
    So to get this to work, MBT needs a complete rework, to get a variant that accepts sequences of tokens instead off sentences of words. (and training data too)

Then there is also the reverse problem of run-ons where 'voldaen' is to be split in 2 words.
The tagger has no clue, the lemmatizer will have no problems, after the modifications.
But how and when do we merge this knowledge?

QUICK HACK Proposal for multiple words in 1 <w> :
For tagging, we could remove the spaces, to assure that 1 FoLiA word, leads to 1 Tag.

Example:

<s id="s.1">
	<w id="w.1">
	  <t>Een</t>
	</w>
	<w id="w.2">
	  <t>multi word</t>
	</w>
	<w id="w.3">
	  <t>test</t>
	</w>
      </s>

We tag this as if the second word was:

	<w id="w.2">
	  <t>multiword</t>
	</w>

The adapted MBMA and MBLEM, we can provide the text include the space.

Result:

      <s xml:id="s.1">
        <w xml:id="w.1">
          <t>Een</t>
          <pos class="LID(onbep,stan,agr)" confidence="0.981771" head="LID">
            <feat class="onbep" subset="lwtype"/>
            <feat class="stan" subset="naamval"/>
            <feat class="agr" subset="npagr"/>
          </pos>
          <morphology>
            <morpheme>
              <t>een</t>
            </morpheme>
          </morphology>
          <lemma class="een"/>
        </w>
        <w xml:id="w.2">
          <t>multi word</t>
          <pos class="N(soort,ev,basis,onz,stan)" confidence="0.733484" head="N">
            <feat class="soort" subset="ntype"/>
            <feat class="ev" subset="getal"/>
            <feat class="basis" subset="graad"/>
            <feat class="onz" subset="genus"/>
            <feat class="stan" subset="naamval"/>
          </pos>
          <lemma class="multi word"/>
          <morphology>
            <morpheme>
              <t>multi word</t>
            </morpheme>
          </morphology>
        </w>
        <w xml:id="w.3">
          <t>test</t>
          <pos class="N(soort,ev,basis,zijd,stan)" confidence="0.789112" head="N">
            <feat class="soort" subset="ntype"/>
            <feat class="ev" subset="getal"/>
            <feat class="basis" subset="graad"/>
            <feat class="zijd" subset="genus"/>
            <feat class="stan" subset="naamval"/>
          </pos>
          <lemma class="test"/>
          <morphology>
            <morpheme>
              <t>test</t>
            </morpheme>
          </morphology>
        </w>
      </s>

from frog.

kosloot avatar kosloot commented on May 29, 2024

So for now, this simple solution is implemented:
Frog accepts FoLiA with embedded spaces now. All spaces are removed for all taggers AND the parser, converting multi words into singe words.

from frog.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.