Giter Site home page Giter Site logo

ecbplus's Introduction

ecbPlus

The ECB+ corpus is an extension to the EventCorefBank (ECB, Bejan and Harabagiu, 2010). A newly added corpus component consists of 502 documents that belong to the 43 topics of the ECB but that describe different seminal events than those already captured in the ECB. All corpus texts were found through Google Search and were annotated with mentions of events and their times, locations, human and non-human participants as well as with within- and cross-document event and entity coreference information. The 2012 version of annotation of the ECB corpus (Lee et al., 2012) was used as a starting point for re-annotation of the ECB according to the ECB+ annotation guideline. The major differences with respect to the 2012 version of annotation of the ECB are:

(a) five event components are annotated in text:

actions (annotation tags starting with ACTION and NEG) times (annotation tags starting with TIME) locations (annotation tags starting with LOC) human participants (annotation tags starting with HUMAN) non-human participants (annotation tags starting with NON_HUMAN)

(b) specific action classes and entity subtypes are distinguished for each of the five main event components resulting in a total tagset of 30 annotation tags based on ACE annotation guidelines (LDC 2008), TimeML (Pustejovsky et al., 2003 and Sauri et al., 2005)

(c) intra- and cross-document coreference relations between mentions of the five event components were established: INTRA_DOC_COREF tag captures within document coreference chains that do not participate in cross-document relations; within document coreference was annotated by means of the CAT tool (Bartalesi et al., 2012) CROSS_DOC_COREF tag indicates cross-document coreference relations created in the CROMER tool (Girardi et al., 2014); all coreference branches refer by means of relation target IDs to the so called TAG_DESCRIPTORS, pointing to human friendly instance names (assigned by coders) and also to instance_id-s (d) events are annotated from an “event-centric” perspective, i.e. annotation tags are assigned depending on the role a mention plays in an event (for more information see ECB+ references).

CONTENTS

  • 982 ECB+ corpus texts in the XML format: ECB+.zip
  • ECB+ annotation guideline: NWR-2014-1.pdf
  • Index of sentences annotated with coreference: ECBplus_coreference_sentences.csv (see README.TXT)
  • LICENSEDATA.TXT
  • COPYING-CC.TXT
  • README.TXT.

The XML files with the annotated text are in the CROMER format (see references Gerardi et al 2014). This format distinguishes mentions from instances using identifiers. Coreference relations and other semantic relations are expressed between identifiers of entities and events. Instance identifiers refer back to mention in the text, which is represented as separate tokens. E.g.:

<ACTION_OCCURRENCE m_id="67"  >
  <token_anchor t_id="195"/>

represents a mention of an action anchored to the token with identifier "195", which is the word "plea" in the file 1_1ecbplus.xml. Cross-document coreference relations are expressed between mentions as between mention "67" and mention "89".

<CROSS_DOC_COREF r_id="21700" note="ACT15743050988266116" >
  <source m_id="67" />
  <target m_id="89" />
</CROSS_DOC_COREF>

The CROSS_DOC_COREF element has a nore attribute that functions as the instance identifier of the action. These instance identifiers establish cross-document coreferences.

When using this resource in publications please cite:

  • Agata Cybulska and Piek Vossen. 2014. Using a sledgehammer to crack a nut? Lexical diversity and event coreference resolution. In Proceedings of the 9th international conference on Language Resources and Evaluation (LREC2014) The ECB+ corpus is also described in:
  • Agata Cybulska and Piek Vossen. 2014. Using a sledgehammer to crack a nut? Lexical diversity and event coreference resolution. In Proceedings of LREC 2014.
  • Agata Cybulska and Piek Vossen. Guidelines for ECB+ Annotation of Events and their Coreference. 2014. (http://www.newsreader-project.eu/files/2013/01/NWR-2014-1.pdf)

Other relevant references:

  • Linguistic Data Consortium. 2008. Ace (automatic content extraction) english annotation guidelines for entities, version 6.6 2008.06.13. Technical report, June. http://projects.ldc.upenn.edu/ace/docs/English-Entities-Guidelines v6.6.pdf.
  • Bartalesi Lenzi, Valentina, Moretti, Giovanni, and Sprugnoli, Rachele. 2012. CAT: the CELCT Annotation Tool. In Proceedings of LREC 2012.
  • Cosmin Bejan and Sanda Harabagiu. 2010. Unsupervised Event Coreference Resolution with Rich Linguistic Features. In Proceedings of ACL 2010, pages 1412–1422.
  • Christian Girardi, Manuela Speranza, Rachele Sprugnoli and Sara Tonelli, 2014. CROMER: a Tool for Cross-Document Event and Entity Coreference. In Proceedings of the International Conference on Language Resources and Evaluation LREC 2014
  • Heeyoung Lee, Marta Recasens, Angel Chang, Mihai Surdeanu and Dan Jurafsky. 2012. Joint Entity and Event Coreference Resolution across Documents. In Proceedings of EMNLP 2012
  • Pustejovsky, James, Castano, Jose, Ingria, Bob, Sauri, Roser, Gaizauskas, Rob, Setzer, Andrea, and Katz, Graham. 2003. Timeml: Robust specification of event and temporal expressions in text. In Proceedings of Computational Semantics Workshop (IWCS-5)
  • Saur´ı, Roser, Littman, Jessica, Knippen, Robert, Gaizauskas, Robert, Setzer, Andrea, and Pustejovsky, James. (2005). Timeml 1.2.1 annotation guidelines, October. http://timeml.org/site/publications/timeMLdocs/annguide 1.2.1.pdf.

ecbplus's People

Contributors

piekvossen avatar

Stargazers

Joseph Gatto avatar  avatar 慢半拍 avatar  avatar Stefan Schouten avatar Juan Diego Rodriguez avatar Rehan Ahmed avatar

Watchers

James Cloos avatar  avatar  avatar Minh Le avatar  avatar Marten Postma avatar Chantal van Son avatar Pia Sommerauer avatar  avatar

ecbplus's Issues

Coreference chain in xml files

Hi,

Thanks for sharing this great dataset!

I have a question about the tag in xml files. For example, in 1_1ecb.xml, for this relation:

<CROSS_DOC_COREF r_id="37538" note="ACT16235311629112331" >
  <source m_id="41" />
  <target m_id="43" />
</CROSS_DOC_COREF>

Is the source m_id="41" in this document? But what is the document id for the target mention? i.e., where can I find target m_id="43"?

Also, what does this tag mean:
<ACTION_OCCURRENCE m_id="50" RELATED_TO="" TAG_DESCRIPTOR="t1b_checking_in_promises" instance_id="ACT16236402809085484" />?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.