Giter Site home page Giter Site logo

ski's Introduction

Sense Key Index (SKI) for inter-operability between WordNet-related projects

CC BY 4.0 2017-21 Eric Kafe

Ongoing work: This is a pre-release version, please see the CHANGES.md file.

1. Introduction

According to WordNet's "senseidx" manual page:

A sense_key is the best way to represent a sense in semantic tagging or other systems that refer to WordNet senses. sense_keys are independent of WordNet sense numbers and synset_offsets, which vary between versions of the database.

As a consequence, sense keys provide a stable basis for the inter-operation between semantic web applications that rely on different versions of WordNet.

Princeton WordNet (PWN) includes a sense key index (the index.sense file) since version 1.4 from 1993, but the sense keys notation changed in 1995, with WordNet version 1.5, and has remained stable since then.

Thus, we can define the full PWN sense key index as the unique concatenation of all the stable index.sense files, with an extra field indicating the WordNet version mumber (currently 1.5 up to 3.1.1) from the different versions of the original Princeton WordNet distribution.

2. The Sense Key Index can be used to

  • Generate database components in various formats (text, tab, csv, prolog, rdf), to interface with any WordNet-related project: the GWA grid, OMW, WN-ontology, ILI, MCR, Freeling, etc...
  • Produce mappings between all WordNet versions
  • Map version-limited WordNet resources like the ILI or the MCR to other WordNet versions
  • Produce statistics about the permanence of sense keys or ILI identifiers across WordNet versions
  • and more forthcoming...

3. Included files

3.1 SKI databases

  • ski-pwn-sets.txt

sense key index for all the modern Princeton WordNet versions (currently 1.5, 1.6, 1.7, 1.7.1, 2.0, 2.1, 3.0, 3.1 and 3.1.1), derived from the corresponding Princeton WordNet index.sense files, retrieved from Wordnetcode

  • ski-violations.txt

mapping of the sense key violations found in Kafe, E. (2018): Persistent semantic identity in WordNet This mapping handles a small number of accidental cases, where a sense key changed meaning during a WordNet update.

  • ski-mcr30-2016.txt

sense key index for MCR30-2016, derived by joining the inverse SKI (ski-pwn-si-flat.txt) with the latest MCR "variant" files, retrieved from the MCR

  • ski-ili30.txt

sense key index for ILI30, derived by joining the inverse SKI (ski-pwn-si-flat.txt) with the GWA-ILI ili-map-pwn30.tab file retrieved from GWA/ILI

3.2 SKI tools

  • wn2ski

Builds the Sense Key Index from the original WordNet files, retrieved from Wordnetcode. These files are expected to be found in local subdirectories named dict-WordnetVersion.Number, ranging from dict-1.5/index.sense up to dict-3.1.1/index.sense. The latest WordNet version (2012) was retrieved from WordNet 3.1.1 for SQL.

  • pwn2maps

Generates synset offset mappings between all the WordNet versions from ski-pwn-sets.txt

  • pwn2flat

Generates the flat text relation file ski-pwn-flat.txt, between all synsets in all WordNet versions and their sense keys, and the inverse relation (ski-pwn-si-flat.txt, and ski-pwn-si-sets.txt). Also, outputs this relation as tab-separated 4-tuples (ski-pwn-flat.tab), designed for compatibility with the MCR. Also, outputs this relation in Prolog format (ski-pwn-flat.pl), designed for compatibility with the Prolog version of PWN. Additionally, produce a mapping from each sense key to its last known WordNet version (ski-pwn-last).

  • ili2map

(runs pwn2flat first, to generate the needed ski-pwn-flat.txt and ski-pwn-last.txt databases) Maps ILI-30 ids to all Princeton WordNet versions Maps ILI-30 ids to their last known Princeton WordNet version

  • mcr2free

Generates Freeling sense databases from MCR data

3.3 Output files

For your convenience, this release also includes all the output produced by running the SKI-tools, compressed with gzip:

  • ski-pwn-flat.txt.gz: the flat text version of ski-pwn-sets.txt
  • ski-pwn-si-flat.txt.gz: the same, inversed: map from synsets to sense keys
  • ski-pwn-si-sets.txt.gz: the previous, as sets: map from synsets to sets of sense keys
  • ski-pwn-flat.tab.gz: pwn-flat as tab-separated 4-tuples
  • ski-pwn-flat.pl.gz: Prolog version of the ski-flat relation, as triples
  • ski-pwn-last.txt.gz: mapping from sense keys to their last known synset offset
  • ski-ili.tar.gz: ILI mappings
  • ski-mappings-pwn1.tar.gz: mappings from PWN versions 1.x to all later versions
  • ski-mappings-pwn2.tar.gz: mappings from PWN versions 2.x and 3.x to all later versions
  • freeling_data-mcr30-2016.tar.gz: senses30.src databases for Freeling

4. References

  • Fellbaum, C.: WordNet, An Electronic Lexical Database. MIT Press, Cambridge, 1998.

  • Gonzalez-Agirre, A., Laparra, E., Rigau, G.: Multilingual central repository version 3.0: upgrading a very large lexical knowledge base. In: Proceedings of the Sixth International Global WordNet Conference (GWC2012). Matsue, Japan, 2012.

  • Kafe, E. (2018): Persistent semantic identity in WordNet. In: Cognitive Studies | Études cognitives,2018(18).

  • WordNet-team: Senseidx(5wn). In: WordNet manual. Princeton University, 2010.

ski's People

Contributors

ekaf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ski's Issues

Sense key identifiers should be valid XML IDs

I think it would be much better if the SKIs are also valid as XML IDs. This means they should fit the following pattern

NameStartChar ::=   ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
                          [#xD8-#xF6] | [#xF8-#x2FF] |
                          [#x370-#x37D] | [#x37F-#x1FFF] |
                          [#x200C-#x200D] | [#x2070-#x218F] |
                          [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
                          [#xF900-#xFDCF] | [#xFDF0-#xFFFD] |
                          [#x10000-#xEFFFF]

NameChar     ::=      NameStartChar | "-" | "." | [0-9] | #xB7 |
                        [#x0300-#x036F] | [#x203F-#x2040]

Source: http://www.w3.org/TR/REC-xml/#NT-Name

This means characters like '#' can't be used and should be escaped. How about ':'?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.