Giter Site home page Giter Site logo

Merged synsets are lost in translation about wn HOT 8 OPEN

ekaf avatar ekaf commented on July 30, 2024
Merged synsets are lost in translation

from wn.

Comments (8)

goodmami avatar goodmami commented on July 30, 2024 1

@ekaf I think the CILI as a resource is better thought of as the inventory of identifiers and their definitions than as a collection of mappings. The mappings to synsets should be maintained by the respective wordnet projects, although in practice we keep some mapping files in the CILI repository. Those mappings files are used when creating the WN-LMF exports of the PWN.

Let me try to describe ILI support in Wn. WN-LMF lexicons can link synsets to individual ILIs like this (example from OEWN 2021):

    <Synset id="oewn-15307914-n" ili="i117563" members="oewn-speed-n oewn-velocity-n" partOfSpeech="n" dc:subject="noun.time">
                                 ~~~~~~~~~~~~~

These ILIs are stored in Wn's database linked to the synsets. When a second lexicon is loaded containing synsets with the same ILIs, such as this (from OMW 1.4's Spanish wordnet):

    <Synset id="omw-es-15282696-n" ili="i117563" partOfSpeech="n" members="omw-es-velocidad-15282696-n" />
                                   ~~~~~~~~~~~~~

... then Wn is able to use the shared ILI to link the synsets across lexicons for translation or expanded relation traversal. Another thing we see is synsets with the special ILI in, which indicates that that version of the lexicon is proposing the synset as a candidate for a new ILI. For example:

    <Synset id="oewn-90002921-n" ili="in" members="oewn-snow_day-n" partOfSpeech="n" dc:subject="noun.time" dc:source="Colloquial WordNet">
                                 ~~~~~~~~

These proposed ILIs are not used for translation or expanded relation traversals. In Wn, the ILIs are represented by a class with an id, a status, and a definition. For example (here, the cili project has not been loaded in Wn):

>>> import wn
>>> oewn = wn.Wordnet('oewn')
>>> oewn.synsets('velocity')[0].ili.id  # an explicit ID
'i117563'
>>> oewn.synsets('velocity')[0].ili.status
'presupposed'
>>> oewn.synsets('velocity')[0].ili.definition()
>>> oewn.synsets('snow day')[0].ili.id  # ili="in" is special and the ID is None in Wn
>>> oewn.synsets('snow day')[0].ili.status
'proposed'
>>> oewn.synsets('snow day')[0].ili.definition()
'a day on which school or other events are cancelled due to snow'

Note:

  • The status presupposed means that the synset has an explicit ILI but there is no authoritative source to say whether the ILI is valid or not. The status proposed means that the lexicon used the special ILI in.
  • Explicit ILIs do not have ILI definitions in the lexicon, but proposed ILIs do. Note that ILI definitions are separate from synset definitions.

When the cili resource has been loaded, the presupposed statuses can change and their definitions become available:

>>> wn.download('cili')
...
>>> oewn.synsets('velocity')[0].ili.status
'active'
>>> oewn.synsets('velocity')[0].ili.definition()
'distance travelled per unit time'

The cili resource that is added here contains only a list of ILIs and their definitions (and maybe statuses in a future version: globalwordnet/cili#8), and does not contain any mappings to PWN 3.0 or 3.1 synsets.

Does that help?

from wn.

goodmami avatar goodmami commented on July 30, 2024

Thanks, I think I see the problem, but let me make sure I got it right: there is a gap in ILI-based translation coverage when the target synset (and thus its ILI) has been merged into another. In this case, PWN 3.0 (and OMW lexicons expanded from it) have two synsets, but in PWN 3.1 and OEWN they are merged into a single synset.

Due to the way Wn applies the ILI mappings

There seems to be a mistaken assumption here. Wn does not use the ILI mappings that you are referring to. The only resource from https://github.com/globalwordnet/cili/ that it uses (and only if you've downloaded it) is the released CILI inventory which includes the ILI identifiers and definitions. Inter-lexicon relationships via shared ILIs are identified only by the ili attribute on <Synset> elements in WN-LMF lexicons. This attribute's value is limited to a single ILI, so there is a technical limitation that we cannot map multiple ILIs to a synset. This also follows the theoretical constraint that ILIs should be mapped to no more than one synset, and vice versa, within a lexicon.

Therefore, I disagree that there is something here incorrect in Wn, but I do recognize how things could be improved. A satisfactory solution to this issue is thus not so much a bug fix as a new feature: to store (or identify) and subsequently use changes to synset-ILI mappings across versions. This sounds appealing but I also feel like it will be hard to do correctly in a transparent fashion (e.g., when calling Synset.translate()) rather than as a discrete mapping step across lexicons. For instance, what if you translate in the other direction where the single ILI is "split" into two? Or if the translation is between two other lexicons with different changes in mappings.

At this moment, using the PWN sense keys for translation seems to be the only way to bypass the problem in Wn.

You mean to look for senses with the same sense keys across lexicons? That might work to build the merge-mapping yourself, but it wouldn't be a solution in general because senses link synsets to words and therefore non-English lexicons should have different sense keys (but more likely they do not have them at all).

Here's how you could build the mapping:

>>> import wn
>>> en30 = wn.Wordnet('omw-en')
>>> en31 = wn.Wordnet('omw-en31')
>>> en31_sensekey_ili_map = {
...     s.metadata()['identifier']: s.synset().ili
...     for s in en31.senses()
... }
>>> en30_31_ilis = {ss.ili.id: set() for ss in en30.synsets()}
>>> for s in en30.senses():
...     ili = en31_sensekey_ili_map.get(s.metadata()['identifier'])
...     if ili:
...         en30_31_ilis[s.synset().ili.id].add(ili.id)
... 
>>> en30_31_ilis['i37881']
{'i37882'}
>>> en30_31_ilis['i37882']
{'i37882'}

This mapping is unidirectional, PWN 3.0 to PWN 3.1, but maybe it is useful nonetheless.

from wn.

ekaf avatar ekaf commented on July 30, 2024

Thanks @goodmami, I have corrected the formulation, since I don't want to imply that something is wrong with Wn. On the other hand, there is a problem in Wn, due to the way that the CILI mappings are applied, but I realize that this happens in OMW-data, when building the LMF databases.
I want to look more into this, and am missing a way to lookup the CILI mappings from within Wn. The CILI project is installed, but I have not yet found out how to load and query it.

from wn.

ekaf avatar ekaf commented on July 30, 2024

Thanks @goodmami, yes your explanations help a lot indeed.
Concerning my specific problem, i.e. obtaining translations for synsets that would have one according to the CILI, but had none when querying Wn, the code you provided for mapping from en-30 ilis to en-31 ilis indeed solves the problem for en-31. With oewn, a detour is necessary, since it has sensekeys encoded as sense.id, but it works equally well:


import wn

#---------------------------------------------------------------------
# adapted from english-wordnet/scripts/wordnet_yaml.py, by @jmccrae:

def unmap_sense_key(sk):
    e = sk.split("__")
    l = e[0][5:]
    r = "__".join(e[1:])
    return (l.replace("-ap-", "'").replace("-sl-", "/").replace("-ex-", "!").replace("-cm-",",").replace("-cl-",":") +
        "%" + r.replace(".", ":").replace("-sp-","_"))

#---------------------------------------------------------------------

def sense2key(sense, wnid="omw-en"):
    if wnid == 'oewn':
        return unmap_sense_key(sense.id)
    else:
        return sense.metadata()['identifier']

def map30(target):
    wnet = wn.Wordnet(target)
    wnid = wnet.lexicons()[0].id
    sk_ili = {sense2key(se, wnid): se.synset().ili for se in wnet.senses()}
    ilimap30 = {}
    for se in wn.Wordnet("omw-en").senses():
        ili = sk_ili.get(se.metadata()['identifier'])
        if ili and ili.status != "proposed":
            ilimap30[se.synset().ili.id] = ili.id
    return ilimap30

#---------------------------------------------------------------------

#target = "omw-en31"
target = "oewn"

ilimap = map30(target)

i1 = "i37881"
i2 = "i37882"

print(ilimap[i1])

i37882

print(ilimap[i2])

i37882

wnfi = wn.Wordnet("omw-fi")
wn2 = wn.Wordnet(target)

ss1 = wnfi.synsets(ili = i1)[0]
print(f"{ss1.ili.id}, {ss1.senses()},\n\
 {ss1.translate(target)}, {wn2.synsets(ili=ilimap[i1])[0]}")

Now, the mapping can provide a translation for this Finnish synset, which has none using Wn's translate() function.

i37881, [Sense('omw-fi-baseball-00471613-n'), Sense('omw-fi-baseball--peli-00471613-n')],
[], Synset('oewn-00472688-n')

So in Wn at present, we have to go through sense-key mappings in order to avoid this problem. I suppose there could be a more direct way to use the CILI mappings, without necessarily losing synsets in the translation, since CILI contains information about the merged synsets. But even then, it remains to be seen whether ILI mappings can match the performance of sense-key mappings.

from wn.

ekaf avatar ekaf commented on July 30, 2024

As @goodmami wrote:

what if you translate in the other direction where the single ILI is "split" into two?

Yes, the inverse problem is that currently, when translating in the opposite direction, Wn only returns one of the merged synsets:

i2 = "i37882"
print(wn.Wordnet("oewn").synsets(ili = i2)[0].translate("omw-fi"))

[Synset('omw-fi-00474568-n')]

In that case, the complete translation would be the union of the senses belonging to all the synsets obtained by reversing the ilimap from above:

def rev_dict(dic):
    rdic = {}
    for key,val in dic.items():
        if val not in rdic:
            rdic[val] = {key}
        else:
            rdic[val].add(key)
    return rdic

sources = rev_dict(ilimap)[i2]

print(f"{sources} --> {i2}")

{'i37881', 'i37882'} --> i37882

print([wn.Wordnet("omw-fi").synsets(ili = i)[0].senses() for i in sources])

[[Sense('omw-fi-baseball-00471613-n'), Sense('omw-fi-baseball--peli-00471613-n')], [Sense('omw-fi-baseball-00474568-n')]]

from wn.

goodmami avatar goodmami commented on July 30, 2024

@ekaf Can you remind me what is the expected fix here? Currently I'm leaning toward saying this is a data challenge (best solved with documentation) and not a bug or missing feature in the code, but maybe you have something in mind that would be appropriate for this library.

from wn.

ekaf avatar ekaf commented on July 30, 2024

There are relatively few (around 30) merged synsets between each English Wordnet version, so losing 30 synsets in translation may not seem a huge problem. However, it is not solved with documentation alone, and a solution in the library appears more helpul.

Since version 3.6.6 (see nltk/#2889), NLTK's wordnet.py library produces a sense-key based mapping "on the fly", at load time, preventing this problem from ever occurring. A similar approach can work in Wn, using code like in the comment above.

An alternative could be if the ILI project also produces lists of merged synsets, with one (or more) synset(s) deprecated and linked to a target synset. This approach is less versatile, because each future English Wordnet needs a separate list of deprecations: you would have to wait for such lists to be produced, then rely on their adequacy, and still need additional code to interpret the deprecations in Wn.

from wn.

goodmami avatar goodmami commented on July 30, 2024

@ekaf thank you for explaining. I'm not entirely sold on this solution because it encodes lexicon-specific information (the sense keys and where they are stored), which are really only relevant for the English wordnets, and I strive as much as possible for Wn to not favor any particular wordnet or language (with the exception of the included Morphy lemmatizer).

That said, so many wordnets are based on the English structure that it might make sense for practicality to beat purity here. The ILI solution would be more "pure", but, as you describe, that approach has other issues.

@fcbond, I'd like to get your perspective. Should Wn codify English-specific workarounds for merged synsets across wordnet versions? Or maybe the problem is rare enough that some documentation of the problem with a recipe for getting around it would suffice?

from wn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.