Giter Site home page Giter Site logo

noun_compound_senses's Introduction

Noun Compound Senses Dataset

This repository contains the Noun Compound Senses (NCS) dataset, used to assess the representation of idiomaticity in vector space models.

The NCS dataset has data for 280 and 180 noun compounds (NCs) in English and Portuguese, respectively, with different degrees of idiomaticity. For each compound, it contains 3 naturalistic corpus sentences and a neutral context (e.g., This is a/an NC.).

Due to copyright restrictions we do not release all the original (naturalistic) sentences. Instead, we include a script to obtain them from the ukWaC (Baroni et al., 2009) and brWaC (Wagner Filho et al., 2018) corpora (see below).

For all sentences in naturalistic and neutral contexts the dataset includes three variants (P1, P2, and P3) with the following characteristics:

  • P1: The original NC is replaced by a synonym (e.g., brain instead of gray matter).
  • P2: The original NC is replaced by its syntactic head and dependent, in two different sentences (e.g., gray, and matter).
  • P3: Each component of the original NC is replaced by a synonym (e.g., alligator sobs instead of crocodile tears).

The NCS dataset contains a total of 5,620 test items for English, and 3,600 for Portuguese, and it is based on the NC Compositionality dataset (Cordeiro et al., 2019; Reddy et al., 2011).

Obtaining the sentences

Requirements

  • Python 3
  • Pandas
  • ukWaC corpus in XML format (tagged). The 25 files (UKWAC-1.xml to UKWAC-25.xml) should be concatenated into a single one (e.g., cat UKWAC*xml > UKWAC_full.xml).
  • brWaC corpus in .conll format (single file brwac.conll)

Building the corpus

Use the script get_sentences.py to obtain the sentences from the WaC corpora:

python3 get_sentences.py --lang en --corpus UKWAC_full.xml

python3 get_sentences.py --lang pt --corpus brwac.conll

This should create the original_sents.csv files inside dataset/lang/naturalistic/.

Citation

If you use the Noun Compounds Senses dataset, please cite the following paper:

  • Garcia, Marcos, Tiago Kramer Vieira, Carolina Scarton, Marco Idiart and Aline Villavicencio. 2021. Probing for idiomaticity in vector space models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021). Association for Computational Linguistics.

References

Baroni, Marco, Silvia Bernardini, Adriano Ferraresi and Eros Zanchetta. 2009. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 43(3), 209-226.

Cordeiro, Silvio, Aline Villavicencio, Marco Idiart and Carlos Ramisch. 2019. Unsupervised compositionality prediction of nominal compounds. Computational Linguistics, 45(1):1โ€“57.

Reddy, Siva, Diana McCarthy and Suresh Manandhar. 2011. An empirical study on compositionality in compound nouns. In Fifth International Joint Conference on Natural Language Processing, IJCNLP 2011, Chiang Mai, Thailand, November 8-13, 2011, pages 210โ€“218. The Association for Computer Linguistics.

Wagner Filho, Jorge Alberto, Rodrigo Wilkens, Marco Idiart and Aline Villavicencio. 2019. The brWaC Corpus: A New Open Resource for Brazilian Portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). ELRA.

noun_compound_senses's People

Contributors

marcospln avatar taiqihe avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.