ACTER Annotated Corpora for Term Extraction Research, version 1.4

ACTER is a manually annotated dataset for term extraction, covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy).

Readme structure:

General
Abbreviations
Data Structure
Annotations
Additional Information
Updates
Error Reporting
License

1. General

Creator: Ayla Rigouts Terryn
Association: LT3 Language and Translation Technology Team, Ghent University
Date of creation version 1.0: 17/12/2019
Date of creation current version 1.4: 15/07/2020
Last updated: 13/05/2020
Contact: [email protected]
Context: Ayla Rigouts Terryn's PhD project + first TermEval shared task (CompuTerm2020)
Shared Task: see https://termeval.ugent.be; workshop proceedings with overview paper at https://lrec2020.lrec-conf.org/media/proceedings/Workshops/Books/COMPUTERM2020book.pdf)
Annotation Guidelines: http://hdl.handle.net/1854/LU-8503113
Source: https://github.com/AylaRT/ACTER
License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) (https://creativecommons.org/licenses/by-nc-sa/4.0/)
Reference: Please cite the following Open Access paper if you use this dataset https://doi.org/10.1007/s10579-019-09453-9
- Authors: Ayla Rigouts Terryn, Véronique Hoste, Els Lefever
- Title: In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora
- Date of online publication: 26 March 2019
- Date of print publication: 2020 (Volume 54, Issue 2, pages 385-418)
- Journal: Language Resources and Evaluation (LRE)
- Publisher: Springer

2. Abbreviations

Languages and domains:

"en" = English
"fr" = French
"nl" = Dutch
"corp" = corruption
"equi" = equitation (dressage)
"htfl" = heart failure
"wind" = wind energy

Types of terms/annotations:

"Spec" or "Specific": Specific Terms
"Com" or "Common": Common Terms
"OOD": Out-of-Domain Terms
"NE(s)": Named Entities

3. Data Structure

File structure under each language folder ("en", "fr", and "en") is identical:

ACTER
│   README.md
│   sources.txt
│
└───en
│   └───corp
│   |   └───annotations
│   |   |   |   corp_en_terms.ann
│   |   |   |   corp_en_terms_nes.ann
│   |   | 
│   |   └───texts
|   |       └───annotated
│   |       |   corp_en_01.txt
│   |       |   corp_en_02.txt
│   |       |   ...
│   |       |
|   |       └───unannotated
│   |           |   corp_en_03.txt
│   |           |   ...
|   |
│   └───equi (equivalent to "corp")
|   |
│   └───htfl (equivalent to "corp")
|   |
│   └───wind (equivalent to "corp")
|
└───fr (equivalent to "en")
└───nl (equivalent to "en")

As can be seen, there are corpora in three languages and four domains. All domains are available in all languages and they are comparable across these languages, meaning that they are not only about the same domain, but also have a similar style and size. However, they are not parallel corpora, so they cannot be aligned (not even on document level). The file names always mention the subject, language, and a unique id (e.g. corp_en_01.txt).

For each part of the corpus, both the plain text files and the annotations are included. There are two annotation files: one with only the term annotations (Specific Terms, Common Terms, and OOD Terms), and one with both term and Named Entity annotations. The labels are mentioned for each annotation (see also section 4).

The plain text files are split into those which have been annotated and those that have not been annotated. This means that all annotations were found in the parts of the corpora labelled as "annotated" and that the "unannotated" parts of the corpora may contain many more terms which are not (yet) in the gold standard. Currently, around 50k words in each corpus (combination language/domain) have been manually annotated.

There is a single case where a text has been annotated only partially: wind_fr_06; therefore the text has been split and the unannotated part is called wind_fr_06bis.

In addition, there is this readme file and a txt-file with the sources of all corpora.

4. Annotations

4.1 Format

The annotations are provided in simple UTF-8 encoded plain text files, with one annotation per line.

The term annotation files include all the term annotations (Specific, Common, and OOD Terms have been combined). A separate file (terms_nes) includes both terms and NEs. Since version 1.2, the labels are added to each annotation.

This means that the "terms.amm" and "terms_nes.ann" files now contain two types of information per line: the annotation (lowercased, unlemmatised, see further), followed by a tab and the label of this annotation ("Specific_Term", "Common_Term", "OOD_Term", or "Named_Entity"). In cases where a single annotation received different labels depending on the context, the most frequently given label is provided.

4.2 Casing, POS-tagging, and Lemmatisation

True-casing, POS-tagging & lemmatisation are non-trivial tasks but not the focus of this edition of TermEval. Therefore, all data will be lower-cased, non-lemmatised, and with only one entry per term.

For example, the English corpus on dressage contains the term “bent” (verb – past tense of “to bend”), but also “Bent” (proper noun – person name). While both capitalisation and POS differ, and “bent” is not the lemmatised form, there will be only one entry: “bent” (lowercased) in the gold standard (other full forms of the verb “to bend” have separate entries, if they are present and annotated in the corpus).

5. Additional Information

Websites:

For more information about the TermEval shared task, visit: https://termeval.ugent.be
For more information about the CompuTerm workshop, visit: https://sites.google.com/view/computerm2020/
For more information about the annotation guidelines, visit: http://hdl.handle.net/1854/LU-8503113

Publications:

Rigouts Terryn, A., Hoste, V., & Lefever, E. (2018). A Gold Standard for Multilingual Automatic Term Extraction from Comparable Corpora: Term Structure and Translation Equivalents. Proceedings of LREC 2018.
Rigouts Terryn, A., Hoste, V., & Lefever, E. (2019). In No Uncertain Terms: A Dataset for Monolingual and Multilingual Automatic Term Extraction from Comparable Corpora. Language Resources and Evaluation, 54(2), 385–418. https://doi.org/10.1007/s10579-019-09453-9
Rigouts Terryn, A., Hoste, V., Drouin, P., & Lefever, E. (2020). TermEval 2020: Shared Task on Automatic Term Extraction Using the Annotated Corpora for Term Extraction Research (ACTER) Dataset. Proceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), 85–94.

The dataset has been updated since the publication of the former two papers. These papers also discuss aspects of the data which have not been made available yet, such as cross-lingual annotations and information on the span of the annotations.

Number of annotations per corpus:

Domain Language # term annotations # term + Named Entity annotations # Specific Terms # Common Terms # OOD Terms # Named Entities:

corp en 927 1173 278 642 6 247
equi en 1155 1575 777 309 69 420
htfl en 2361 2585 1883 319 157 226
wind en 1091 1534 781 296 14 443
corp fr 979 1207 298 675 5 229
equi fr 961 1181 701 234 26 220
htfl fr 2228 2374 1684 487 57 146
wind fr 773 968 444 308 21 195
corp nl 1047 1295 310 730 6 249
equi nl 1393 1544 1022 330 41 151
htfl nl 2074 2254 1559 449 66 180
wind nl 940 1245 577 342 21 305

Normalisation:

The following normalisation procedures are applied to both the original text files and the annotations:

unicodedata.normalize("NFC", text)
normalising all dashes to "-", all single quotes to "'" and all double quotes to '"'

6. Updates

Changes version 1.0 > version 1.1

English corpora:
- corruption
  - Removed 1 NE: 'com(2007) 805 final'
- wind energy
  - Removed 2 terms: 'variable pitch blades', 'renewable sources'
  - Removed 1 NE: 'skuodas'
French corpora:
- corruption:
  - Removed 2 terms: 'indélicat', 'loi relative à la corruption'
- equitation-dressage
  - Removed 2 terms: 'canons', 'équilibration'
- wind energy
  - Added 1 term: 'systèmes mutisources-multistockages'
  - Removed 4 terms: 'systèmes mutisources', 'quadrature', 'inductance directe', 'résistance statorique'
  - Removed 98 NEs: 'bar', 'esk', 'akh', 'tht', 'enbw', 'rich', 'kama', 'man', 'sab', 'mer', 'deg', 'mor', 'aba', 'abo', 'ana', 'azm', 'joo', 'jen', 'pri', 'han', 'ree', 'dav', 'cou', 'hol', 'sau', 'lal', 'lei', 'vet', 'pur', 'per', 'her', 'hau', 'ans', 'slo', 'win', 'thi', 'ela', 'stem', 'cer', 'lav', 'ack', 'e.on', 'cim', 'luo', 'wik', 'ds1103', 'fag', 'and', 'alm', 'pan', 'rap', 'ric', 'saa', 'reb', 'bor', 'kin', 'sem', 'ecr', 'fau', 'ukt', 'kun', 'creg', 'sal', 'bou', 'crap', 'mog', 'nget', 'stu', 'sei', 'lec', 'dir', 'nor', 'abb', 'doh', 'rwe', 'mul', 'oud', 'bea', '96/92/ce', 'gar', 'eri', 'cal', 'goi', 'ish', 'fra', 'cra', 'bna', 'ull', 'des', 'ips', 'dro', 'uct', 'mat', 'ds 1104', 'mar', 'svk', 'bla', 'buh'
Dutch corpora:
- corruption
  - Added 1 term: 'anticorruptie-eenheid'
  - Removed 4 terms: 'verslagen corruptiebestrijding', 'auditdiensten', 'anticorruptie', 'wet betreffende de omkoping'
- equitation-dressage
  - Removed 2 terms: 'promotie', 'stuw'
- wind energy
  - Removed 2 terms: 'windturbines een horizontale as', 'power coefficient'

Changes version 1.1 > version 1.2

Included domain of heart failure (test domain for TermEval shared task)

Changes version 1.2 > version 1.3

corrected wrong sources in htfl_nl
changed heart failure abbreviation to "htfl" to be consistent with four-letter domain abbreviations
created Github repository for data + submitted it to CLARIN

Changes version 1.3 > version 1.4

applied limited normalisation on both texts and annotations:
- unicodedata.normalize("NFC", text)
- normalising all dashes to "-", all single quotes to "'" and all double quotes to '"'

7. Error Reporting

The ACTER dataset is an ongoing project, so we are always looking to improve the data. Any questions or issues regarding this dataset may be reported via the Github repository at: https://github.com/AylaRT/ACTER and will be addressed asap.

8. License

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) (https://creativecommons.org/licenses/by-nc-sa/4.0/)
Reference: Please cite the following Open Access paper if you use this dataset for your research (https://doi.org/10.1007/s10579-019-09453-9)
- Authors: Ayla Rigouts Terryn, Véronique Hoste, Els Lefever
- Title: In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora
- Date of online publication: 26 March 2019
- Date of print publication: 2020 (Volume 54, Issue 2, pages 385-418)
- Journal: Language Resources and Evaluation (LRE)
- Publisher: Springer

The data can be freely used and adapted for non-commercial purposes, provided the above mentioned paper is cited and any changes made to the data are clearly stated.

liqiandi / acter Goto Github PK

acter's Introduction

ACTER Annotated Corpora for Term Extraction Research, version 1.4

1. General

2. Abbreviations

3. Data Structure

4. Annotations

4.1 Format

4.2 Casing, POS-tagging, and Lemmatisation

5. Additional Information

6. Updates

7. Error Reporting

8. License

acter's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent