Giter Site home page Giter Site logo

liqiandi / acter Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aylart/acter

0.0 0.0 0.0 4.8 MB

ACTER is a manually annotated dataset for term extraction, covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy).

acter's Introduction

ACTER Annotated Corpora for Term Extraction Research, version 1.4

ACTER is a manually annotated dataset for term extraction, covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy).

Readme structure:

  1. General
  2. Abbreviations
  3. Data Structure
  4. Annotations
  5. Additional Information
  6. Updates
  7. Error Reporting
  8. License

1. General

2. Abbreviations

Languages and domains:

  • "en" = English
  • "fr" = French
  • "nl" = Dutch
  • "corp" = corruption
  • "equi" = equitation (dressage)
  • "htfl" = heart failure
  • "wind" = wind energy

Types of terms/annotations:

  • "Spec" or "Specific": Specific Terms
  • "Com" or "Common": Common Terms
  • "OOD": Out-of-Domain Terms
  • "NE(s)": Named Entities

3. Data Structure

File structure under each language folder ("en", "fr", and "en") is identical:

ACTER
│   README.md
│   sources.txt
│
└───en
│   └───corp
│   |   └───annotations
│   |   |   |   corp_en_terms.ann
│   |   |   |   corp_en_terms_nes.ann
│   |   | 
│   |   └───texts
|   |       └───annotated
│   |       |   corp_en_01.txt
│   |       |   corp_en_02.txt
│   |       |   ...
│   |       |
|   |       └───unannotated
│   |           |   corp_en_03.txt
│   |           |   ...
|   |
│   └───equi (equivalent to "corp")
|   |
│   └───htfl (equivalent to "corp")
|   |
│   └───wind (equivalent to "corp")
|
└───fr (equivalent to "en")
└───nl (equivalent to "en")

As can be seen, there are corpora in three languages and four domains. All domains are available in all languages and they are comparable across these languages, meaning that they are not only about the same domain, but also have a similar style and size. However, they are not parallel corpora, so they cannot be aligned (not even on document level). The file names always mention the subject, language, and a unique id (e.g. corp_en_01.txt).

For each part of the corpus, both the plain text files and the annotations are included. There are two annotation files: one with only the term annotations (Specific Terms, Common Terms, and OOD Terms), and one with both term and Named Entity annotations. The labels are mentioned for each annotation (see also section 4).

The plain text files are split into those which have been annotated and those that have not been annotated. This means that all annotations were found in the parts of the corpora labelled as "annotated" and that the "unannotated" parts of the corpora may contain many more terms which are not (yet) in the gold standard. Currently, around 50k words in each corpus (combination language/domain) have been manually annotated.

There is a single case where a text has been annotated only partially: wind_fr_06; therefore the text has been split and the unannotated part is called wind_fr_06bis.

In addition, there is this readme file and a txt-file with the sources of all corpora.

4. Annotations

4.1 Format

The annotations are provided in simple UTF-8 encoded plain text files, with one annotation per line.

The term annotation files include all the term annotations (Specific, Common, and OOD Terms have been combined). A separate file (terms_nes) includes both terms and NEs. Since version 1.2, the labels are added to each annotation.

This means that the "terms.amm" and "terms_nes.ann" files now contain two types of information per line: the annotation (lowercased, unlemmatised, see further), followed by a tab and the label of this annotation ("Specific_Term", "Common_Term", "OOD_Term", or "Named_Entity"). In cases where a single annotation received different labels depending on the context, the most frequently given label is provided.

4.2 Casing, POS-tagging, and Lemmatisation

True-casing, POS-tagging & lemmatisation are non-trivial tasks but not the focus of this edition of TermEval. Therefore, all data will be lower-cased, non-lemmatised, and with only one entry per term.

For example, the English corpus on dressage contains the term “bent” (verb – past tense of “to bend”), but also “Bent” (proper noun – person name). While both capitalisation and POS differ, and “bent” is not the lemmatised form, there will be only one entry: “bent” (lowercased) in the gold standard (other full forms of the verb “to bend” have separate entries, if they are present and annotated in the corpus).

5. Additional Information

Websites:

Publications:

  • Rigouts Terryn, A., Hoste, V., & Lefever, E. (2018). A Gold Standard for Multilingual Automatic Term Extraction from Comparable Corpora: Term Structure and Translation Equivalents. Proceedings of LREC 2018.
  • Rigouts Terryn, A., Hoste, V., & Lefever, E. (2019). In No Uncertain Terms: A Dataset for Monolingual and Multilingual Automatic Term Extraction from Comparable Corpora. Language Resources and Evaluation, 54(2), 385–418. https://doi.org/10.1007/s10579-019-09453-9
  • Rigouts Terryn, A., Hoste, V., Drouin, P., & Lefever, E. (2020). TermEval 2020: Shared Task on Automatic Term Extraction Using the Annotated Corpora for Term Extraction Research (ACTER) Dataset. Proceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), 85–94.

The dataset has been updated since the publication of the former two papers. These papers also discuss aspects of the data which have not been made available yet, such as cross-lingual annotations and information on the span of the annotations.

Number of annotations per corpus:

Domain Language # term annotations # term + Named Entity annotations # Specific Terms # Common Terms # OOD Terms # Named Entities:

  • corp en 927 1173 278 642 6 247
  • equi en 1155 1575 777 309 69 420
  • htfl en 2361 2585 1883 319 157 226
  • wind en 1091 1534 781 296 14 443
  • corp fr 979 1207 298 675 5 229
  • equi fr 961 1181 701 234 26 220
  • htfl fr 2228 2374 1684 487 57 146
  • wind fr 773 968 444 308 21 195
  • corp nl 1047 1295 310 730 6 249
  • equi nl 1393 1544 1022 330 41 151
  • htfl nl 2074 2254 1559 449 66 180
  • wind nl 940 1245 577 342 21 305

Normalisation:

The following normalisation procedures are applied to both the original text files and the annotations:

  • unicodedata.normalize("NFC", text)
  • normalising all dashes to "-", all single quotes to "'" and all double quotes to '"'

6. Updates

Changes version 1.0 > version 1.1

  • English corpora:
    • corruption
      • Removed 1 NE: 'com(2007) 805 final'
    • wind energy
      • Removed 2 terms: 'variable pitch blades', 'renewable sources'
      • Removed 1 NE: 'skuodas'
  • French corpora:
    • corruption:
      • Removed 2 terms: 'indélicat', 'loi relative à la corruption'
    • equitation-dressage
      • Removed 2 terms: 'canons', 'équilibration'
    • wind energy
      • Added 1 term: 'systèmes mutisources-multistockages'
      • Removed 4 terms: 'systèmes mutisources', 'quadrature', 'inductance directe', 'résistance statorique'
      • Removed 98 NEs: 'bar', 'esk', 'akh', 'tht', 'enbw', 'rich', 'kama', 'man', 'sab', 'mer', 'deg', 'mor', 'aba', 'abo', 'ana', 'azm', 'joo', 'jen', 'pri', 'han', 'ree', 'dav', 'cou', 'hol', 'sau', 'lal', 'lei', 'vet', 'pur', 'per', 'her', 'hau', 'ans', 'slo', 'win', 'thi', 'ela', 'stem', 'cer', 'lav', 'ack', 'e.on', 'cim', 'luo', 'wik', 'ds1103', 'fag', 'and', 'alm', 'pan', 'rap', 'ric', 'saa', 'reb', 'bor', 'kin', 'sem', 'ecr', 'fau', 'ukt', 'kun', 'creg', 'sal', 'bou', 'crap', 'mog', 'nget', 'stu', 'sei', 'lec', 'dir', 'nor', 'abb', 'doh', 'rwe', 'mul', 'oud', 'bea', '96/92/ce', 'gar', 'eri', 'cal', 'goi', 'ish', 'fra', 'cra', 'bna', 'ull', 'des', 'ips', 'dro', 'uct', 'mat', 'ds 1104', 'mar', 'svk', 'bla', 'buh'
  • Dutch corpora:
    • corruption
      • Added 1 term: 'anticorruptie-eenheid'
      • Removed 4 terms: 'verslagen corruptiebestrijding', 'auditdiensten', 'anticorruptie', 'wet betreffende de omkoping'
    • equitation-dressage
      • Removed 2 terms: 'promotie', 'stuw'
    • wind energy
      • Removed 2 terms: 'windturbines een horizontale as', 'power coefficient'

Changes version 1.1 > version 1.2

  • Included domain of heart failure (test domain for TermEval shared task)

Changes version 1.2 > version 1.3

  • corrected wrong sources in htfl_nl
  • changed heart failure abbreviation to "htfl" to be consistent with four-letter domain abbreviations
  • created Github repository for data + submitted it to CLARIN

Changes version 1.3 > version 1.4

  • applied limited normalisation on both texts and annotations:
    • unicodedata.normalize("NFC", text)
    • normalising all dashes to "-", all single quotes to "'" and all double quotes to '"'

7. Error Reporting

The ACTER dataset is an ongoing project, so we are always looking to improve the data. Any questions or issues regarding this dataset may be reported via the Github repository at: https://github.com/AylaRT/ACTER and will be addressed asap.

8. License

  • License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) (https://creativecommons.org/licenses/by-nc-sa/4.0/)
  • Reference: Please cite the following Open Access paper if you use this dataset for your research (https://doi.org/10.1007/s10579-019-09453-9)
    • Authors: Ayla Rigouts Terryn, Véronique Hoste, Els Lefever
    • Title: In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora
    • Date of online publication: 26 March 2019
    • Date of print publication: 2020 (Volume 54, Issue 2, pages 385-418)
    • Journal: Language Resources and Evaluation (LRE)
    • Publisher: Springer

The data can be freely used and adapted for non-commercial purposes, provided the above mentioned paper is cited and any changes made to the data are clearly stated.

acter's People

Contributors

aylart avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.