Comments (10)
Thanks, I'll take a look. In the meantime, is there nothing that can be done on this project, at least fail more gracefully? For me, I'm only looking to use it as a part-of-speech tagger and I don't need to extract the case markings, but it fails to run at all. Maybe it would be better to ignore unrecognized morphological features rather than crashing.
from spacy-udpipe.
I've enabled a quick fix in #11. After some discussion, I am fairly confident this should remain in a separate branch (as the underlying issue is in spaCy
). For now, you can use
pip install git+https://github.com/TakeLab/spacy-udpipe.git@feature/soft-morph-fail
to install the quick-fix version.
from spacy-udpipe.
Hello @asajatovic and hvala 🙏 for your quick response :-)
As far as I understand, the two Italian models as well as the Croatian one don't have the morphological features, right? The link you sent to me explain how to add the tag map to an existing model, so probably I'd have to write the whole set of morphological features for Italian to get it work. But I thought there was already a set of morphological feature, since the key_error contains something...
from spacy-udpipe.
Thanks for reporting this. After some code digging, I am confident this happens because of the way the tag maps for Romanian and Polish are defined. For the code snippet you provided, a morphology feature "Case"
is extracted from "Pw3--r", an XPOS (Language-specific part-of-speech tag) of the word Ce
. As "Case"
is not in the supported FEATURES
for the Morphology
class (see this and this), an exception occurs. The same problem happens again for the word Ce
and XPOS values "Person"
and "PronType"
. An equivalent thing occurs for the word faci
with XPOS value "Vmip2s"
mapping to "Person"
, which again is not in FEATURES
(link). You can access the xpostag
attribute if you process the text
using the 'raw' UDPipe model (nlp.udpipe(text)
).
Since this library is only a wrapper for the UDPipe models and as tag maps are specific to each language, to solve the issue(s), I suggest you update the tag maps for the problematic languages. A good start would be https://spacy.io/usage/adding-languages#tag-map and making sure the tag map features are compliant with the ones defined in spaCy. 😄
from spacy-udpipe.
Hi!
I don't know whether this is related, but I cannot print out morphological features for Italian. I have tried both the standard isdt model and the vit model.
I have also tried tag_map:
>>> nlp = spacy_udpipe.load("it")
>>> for token in nlp("Il bello di questo mestiere è che ti fa crescere."): nlp.vocab.morphology.tag_map[token.tag_]
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'RD'
The function works with other languages, for instance English:
>>> nlp = spacy_udpipe.load("en")
>>> for token in nlp("Dogs are friendly."): nlp.vocab.morphology.tag_map[token.tag_]
...
{74: 92, 'Number_plur': True}
{74: 100, 'Tense_pres': True, 'VerbForm_fin': True}
{74: 84, 'Degree_pos': True}
{74: 97, 'PunctType_peri': True}
but fails for others too, for instance, Croatian:
>>> nlp = spacy_udpipe.load("hr")
>>> for token in nlp("Magdalena već godinama radi u Državnom Restauratorskom Zavodu."): nlp.vocab.morphology.tag_map[token.tag_]
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'Npfsn'
I am using the latest version of spacy (2.2.3) and spacy-udpipe (0.1.0), both with the soft-morph-fail fix and without.
from spacy-udpipe.
@rahonalab Hi! The reason it does not work is because of the tag map for the Italian language (link).
Regarding the tag map for Croatian in spaCy, it doesn't yet exist.
Both are inherently related to spaCy and if you want to use morphological features, the tag map for a specific language should be updated in the spaCy repo. For more details see https://spacy.io/usage/adding-languages#tag-map.
All of this will be documented with some workarounds in a new spacy-udpipe
release which is currently WIP. 😄
Edit: You can now install the latest package version (with the mentioned update ^) directly from the master
branch!
from spacy-udpipe.
@rahonalab You are welcome! :)
You are right, there already exist morphological features for Italian, however spaCy recently changed the (language-agnostic) values in morphological FEATURES
. The keys for TAG_MAP
from tag_map.py should map exactly from and to morphological FEATURES
. Regarding Italian, you should ideally only update the TAG_MAP
, whereas for Croatian it can only be done from scratch (no existing TAG_MAP
).
Also, the TAG_MAP
for a specific language is and should be independent of any model for the same language.
from spacy-udpipe.
Thank you, now I start to understand something :-)
The Italian tag_map which is currently employed in the UD model has numbers in place of POS:XPOS
nlp.vocab.morphology.tag_map
{'AP__Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs': {74: 90},
whereas the Italian spacy 2.2.4 has:
(/usr/local/lib/python3.7/site-packages/spacy/lang/it)
TAG_MAP = {
"AP__Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs": {POS: DET},
I saw your workaround to stop importing the 'wrong' TAG_MAP:
nlp = spacy_udpipe.load("it",ignore_tag_map=True)
Why don't you include an option to automatically import the tagmap from spacy?
from spacy-udpipe.
If available, a language-specific TAG MAP
is automatically loaded for every spacy-udpipe
andspacy
language model. Keep in mind that TAG MAP
is defined in spaCy
, specifically for each language, and is loaded only from spaCy
.
The workaround is simply there to enable proper POS tagging by ignoring morphological features if they are outdated (in other words, if the TAG_MAP
values don't exactly match FEATURES
values).
I hope this clears the confusion! :)
Edit: Regarding the numbers in place of XPOS:POS, that is fine as this also happens when you load a 'pure' spaCy
model.
from spacy-udpipe.
Closing this issue as it is fixed in #12.
from spacy-udpipe.
Related Issues (20)
- Option to "disable sentence segmentation" needed HOT 2
- (CS model v2.5) - strange problem with "aby" word HOT 3
- Error 190 with pt-bosque model HOT 6
- Allow pre-tokenised text HOT 1
- Multiprocessing error HOT 8
- Add __version__ HOT 1
- [E190] Token head out of range in `Doc.from_array()` for token index '1' with value '11' HOT 5
- Welsh UDpipe model HOT 4
- Extract morphological features
- Extract morphological features for Hindi HOT 3
- Noun chunks in Czech language HOT 1
- Issues with dependency tags for pretokenized text HOT 2
- Support spacy v3 ? HOT 3
- Text indexation on portuguese HOT 2
- Installation on Windows with Python 3.9 fails HOT 4
- How do I customise the model behaviour and save locally?
- sentence span is wrong if there are sentences containing only space tokens HOT 1
- New PyPi release?
- 'NoneType' object has no attribute 'newTokenizer'` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spacy-udpipe.