Giter Site home page Giter Site logo

Comments (10)

luckytoilet avatar luckytoilet commented on May 18, 2024 1

Thanks, I'll take a look. In the meantime, is there nothing that can be done on this project, at least fail more gracefully? For me, I'm only looking to use it as a part-of-speech tagger and I don't need to extract the case markings, but it fails to run at all. Maybe it would be better to ignore unrecognized morphological features rather than crashing.

from spacy-udpipe.

asajatovic avatar asajatovic commented on May 18, 2024 1

I've enabled a quick fix in #11. After some discussion, I am fairly confident this should remain in a separate branch (as the underlying issue is in spaCy). For now, you can use
pip install git+https://github.com/TakeLab/spacy-udpipe.git@feature/soft-morph-fail
to install the quick-fix version.

from spacy-udpipe.

rahonalab avatar rahonalab commented on May 18, 2024 1

Hello @asajatovic and hvala 🙏 for your quick response :-)
As far as I understand, the two Italian models as well as the Croatian one don't have the morphological features, right? The link you sent to me explain how to add the tag map to an existing model, so probably I'd have to write the whole set of morphological features for Italian to get it work. But I thought there was already a set of morphological feature, since the key_error contains something...

from spacy-udpipe.

asajatovic avatar asajatovic commented on May 18, 2024

Thanks for reporting this. After some code digging, I am confident this happens because of the way the tag maps for Romanian and Polish are defined. For the code snippet you provided, a morphology feature "Case" is extracted from "Pw3--r", an XPOS (Language-specific part-of-speech tag) of the word Ce. As "Case" is not in the supported FEATURES for the Morphology class (see this and this), an exception occurs. The same problem happens again for the word Ce and XPOS values "Person" and "PronType". An equivalent thing occurs for the word faci with XPOS value "Vmip2s" mapping to "Person", which again is not in FEATURES(link). You can access the xpostag attribute if you process the text using the 'raw' UDPipe model (nlp.udpipe(text)).

Since this library is only a wrapper for the UDPipe models and as tag maps are specific to each language, to solve the issue(s), I suggest you update the tag maps for the problematic languages. A good start would be https://spacy.io/usage/adding-languages#tag-map and making sure the tag map features are compliant with the ones defined in spaCy. 😄

from spacy-udpipe.

rahonalab avatar rahonalab commented on May 18, 2024

Hi!
I don't know whether this is related, but I cannot print out morphological features for Italian. I have tried both the standard isdt model and the vit model.

I have also tried tag_map:

>>> nlp = spacy_udpipe.load("it")
>>> for token in nlp("Il bello di questo mestiere è che ti fa crescere."): nlp.vocab.morphology.tag_map[token.tag_]
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'RD'

The function works with other languages, for instance English:

>>> nlp = spacy_udpipe.load("en")
>>> for token in nlp("Dogs are friendly."): nlp.vocab.morphology.tag_map[token.tag_]
... 
{74: 92, 'Number_plur': True}
{74: 100, 'Tense_pres': True, 'VerbForm_fin': True}
{74: 84, 'Degree_pos': True}
{74: 97, 'PunctType_peri': True}

but fails for others too, for instance, Croatian:

>>> nlp = spacy_udpipe.load("hr")
>>> for token in nlp("Magdalena već godinama radi u Državnom Restauratorskom Zavodu."): nlp.vocab.morphology.tag_map[token.tag_]
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'Npfsn'

I am using the latest version of spacy (2.2.3) and spacy-udpipe (0.1.0), both with the soft-morph-fail fix and without.

from spacy-udpipe.

asajatovic avatar asajatovic commented on May 18, 2024

@rahonalab Hi! The reason it does not work is because of the tag map for the Italian language (link).
Regarding the tag map for Croatian in spaCy, it doesn't yet exist.
Both are inherently related to spaCy and if you want to use morphological features, the tag map for a specific language should be updated in the spaCy repo. For more details see https://spacy.io/usage/adding-languages#tag-map.
All of this will be documented with some workarounds in a new spacy-udpipe release which is currently WIP. 😄
Edit: You can now install the latest package version (with the mentioned update ^) directly from the master branch!

from spacy-udpipe.

asajatovic avatar asajatovic commented on May 18, 2024

@rahonalab You are welcome! :)
You are right, there already exist morphological features for Italian, however spaCy recently changed the (language-agnostic) values in morphological FEATURES. The keys for TAG_MAP from tag_map.py should map exactly from and to morphological FEATURES. Regarding Italian, you should ideally only update the TAG_MAP, whereas for Croatian it can only be done from scratch (no existing TAG_MAP).
Also, the TAG_MAP for a specific language is and should be independent of any model for the same language.

from spacy-udpipe.

rahonalab avatar rahonalab commented on May 18, 2024

Thank you, now I start to understand something :-)
The Italian tag_map which is currently employed in the UD model has numbers in place of POS:XPOS

nlp.vocab.morphology.tag_map
{'AP__Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs': {74: 90},

whereas the Italian spacy 2.2.4 has:

(/usr/local/lib/python3.7/site-packages/spacy/lang/it)

TAG_MAP = {
    "AP__Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs": {POS: DET},

I saw your workaround to stop importing the 'wrong' TAG_MAP:

nlp = spacy_udpipe.load("it",ignore_tag_map=True)

Why don't you include an option to automatically import the tagmap from spacy?

from spacy-udpipe.

asajatovic avatar asajatovic commented on May 18, 2024

If available, a language-specific TAG MAP is automatically loaded for every spacy-udpipe andspacy language model. Keep in mind that TAG MAP is defined in spaCy, specifically for each language, and is loaded only from spaCy.

The workaround is simply there to enable proper POS tagging by ignoring morphological features if they are outdated (in other words, if the TAG_MAP values don't exactly match FEATURES values).

I hope this clears the confusion! :)

Edit: Regarding the numbers in place of XPOS:POS, that is fine as this also happens when you load a 'pure' spaCy model.

from spacy-udpipe.

asajatovic avatar asajatovic commented on May 18, 2024

Closing this issue as it is fixed in #12.

from spacy-udpipe.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.