Giter Site home page Giter Site logo

emvista / popcorn-dataset Goto Github PK

View Code? Open in Web Editor NEW
1.0 0.0 1.0 3.88 MB

POPCORN is a collaborative research project that aims to bring information extraction technologies to maturity for civil and defense security use cases. The project is carried out in close collaboration between Emvista, Airbus Defense and Space and the LIG (Grenoble Computer Science Laboratory).

License: MIT License

Python 100.00%

popcorn-dataset's Introduction

POPCORN Dataset

This directory contains the POPCORN French dataset. This dataset is divided into 400 validation and 400 training texts. The texts in this dataset were written and annotated manually. The texts are short and factual, in the style of an information report. Annotation based on the ontology described below enables training and evaluation of Information Extraction (Name Entity Recognition, Coreference Resolution and Relation Extraction) models.

POPCORN Dataset Format

The annotated texts are stored inside the corpus folder of this repository and split between the "train.json" and "test.json" files. Each file contains 400 texts stored as a 3 keys dictionary :

{
 "text": "the raw text as a string",
 "entities": [{
                "id": "id of the textual entity",
                "mentions": [{
                              "value": "textual value of the mention"
                              "start": "offset of the begin of the mention in the text as integer",
                              "end": "offset of the end of the mention in the text as integer"
                             },
                             ...
                            ],
                "type": "entity type as a string",
                "value": "if the entity is an attribute with a formatted value"
               },
               ...
              ],
 "relations": [ [subject_id, predicate, object_id],
                ...
               ]
}

POPCORN Ontology

This section lists the different types of entities, attributes and relations used to annotate the dataset. Although the classes are displayed in such a way as to indicate a given taxonomy, only the fine-grained classes (the most indented in the table) are annotated in the dataset. The parent classes are therefore given for information only and can be re-organised according to the use-case.

image_name png

image_name png

image_name png

In the provided annotations, gender is annotated as 2 relations (Male or Female) using the same entity as subject and object.

POPCORN Type Distribution

image_name png

image_name png

As shown in the previous figures this dataset is imbalanced both with entities and relations. Users may choose to discard low support classes.

Benchmark

Models Event Extraction (Macro F1) Entity Extraction (Macro F1) Attribute Extraction (Macro F1) Relation Extraction (Macro F1) Coreference Resolution (F1 : Avg MUC, B3, CEAF)
Unified Model 45.19 ± 2.07 68.38 ± 0.34 60.39 ± 2.76 46.49 ± 0.55 TBD
Boundary Smoothing Model 44.20 ± 1.01 63.56 ± 0.82 60.67 ± 0.95 N/A N/A

Above table list the best results of 2 architectures for POPCORN tasks.

Citation

Bastien Giordano, Maxime Prieur, Nakanyseth Vuth, Sylvain Verdy, Kévin Cousot, Gilles Sérasset, Guillaume Gadek, Didier Schwab, Cédric Lopez (2024) POPCORN: Fictional and Synthetic Intelligence Reports for Named Entity Recognition and Relation Extraction Tasks. In PRoceedings of the 28th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES'24), to appear, september 2024, Sevilla, Spain.

Contact

If you have any questions, please contact [email protected]

TODO

  • Upload the Unified Model implementation
  • Complete the benchmark section with coreference resolution results

popcorn-dataset's People

Contributors

cedricemvista avatar todaime avatar

Stargazers

 avatar

Forkers

todaime

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.