Giter Site home page Giter Site logo

mlee's Introduction

This package contains the data of the MLEE (Multi-Level Event Extraction)
corpus, version 1.0.2 (revision 1).

This README provides a brief overview of the package contents. See the
LICENSE file included in the package for the data license, the
manuscript referenced at the bottom of this file for an introduction
of the corpus, and the project homepage

    http://www.nactem.ac.uk/MLEE/

for data visualizations, supplementary data and more information.


CONTENTS


This package contains the following:

* README:       this file
* LICENSE:      licenses of the texts and annotations
* standoff:     corpus data in standoff format (all annotations)
* conll:        corpus data in CoNLL format (entity annotations only)

Both of the standoff/ and conll/ directories contain the following
subdirectories:

* development:  development split of data, excluding test set
* test:         test split of the data, including all data
* full:         full corpus data

Each of the development/ and test/ directories further contain the
following:

* train:        training data for development/final test
* test:         test data for development/final test

The format and suggested use of the files contained in these
directories is explained below.


FORMAT

The corpus data is provided in two formats: BioNLP Shared Task-style
standoff format, and CoNLL shared task-style BIO-format.


Standoff format

The data in the standoff/ directory are provided in the standoff
format used by the brat annotation tool (http://brat.nlplab.org/). For
details of the format, see the documentation page
http://brat.nlplab.org/standoff.html

For the full corpus data in standoff/full/, all standoff annotations
for a single text file are provided in a single file (.ann). For the
data in standoff/development/ and standoff/test/, the annotations are
split into entity annotations (.a1) and event annotations (.a2). This
is intended to faciliate event extraction experiments where entity
annotations are provided as part of the input.


CoNLL format

The data in the conll/ directory is provided in the column-formatted
BIO representation used in many reference resources for mention
detection such as that of the CoNLL shared tasks (see
e.g. http://www.cnts.ua.ac.be/conll2002/ner/). 

Each line contains four TAB-separated columns: token text, start
offset, end offset, and tag. Each tag consist of one of the letters B,
I or O (for "begin", "in", and "out"), and the type of the entity for
the B and I tags. (The offsets into the source text are provided for
reference and can be ignored for most applications.)

The entity mention detection task is to learn to predict the tags
(last column) given the token texts (first column).


EVALUATION

The corpus is intended to serve as an evaluation standard. The
proposed approach to method development and evaluation is to use the
test/ data only for final evaluation after completing method
development and parameter selection.

PLEASE NOTE: the data in the development/ and test/ directories are
not separate: the development/ data is a split of the test/train/
data.


CONTACT

For any queries relating to the corpus, please contact Sampo Pyysalo
<[email protected]>


CHANGELOG

* 1.0.2 (11.09.2012): first public release 


REFERENCES

The corpus is presented in the following manuscript.

* Sampo Pyysalo, Tomoko Ohta, Makoto Miwa, Han-Cheol Cho, Jun'ichi
  Tsujii and Sophia Ananiadou (2012). Event extraction across multiple
  levels of biological organization. Bioinformatics 28(18):i575-i581.

The project page is located at http://www.nactem.ac.uk/MLEE/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.