Giter Site home page Giter Site logo

nicolay-r / arekit Goto Github PK

View Code? Open in Web Editor NEW
53.0 6.0 3.0 22.93 MB

Document level Attitude and Relation Extraction toolkit (AREkit) for sampling and processing large text collections with ML and for ML

Home Page: https://nicolay-r.github.io/arekit-page/

License: MIT License

Python 99.92% Shell 0.08%
sentiment-analysis relation-extraction neural-networks datasets frames bert language-models nlp pandas pandas-dataframe

arekit's Introduction

AREkit 0.25.0

AREkit (Attitude and Relation Extraction Toolkit) -- is a python toolkit, devoted to document level Attitude and Relation Extraction between text objects from mass-media news.

Description

This toolkit aims at memory-effective data processing in Relation Extraction (RE) related tasks.

Figure: AREkit pipelines design. More on ARElight: Context Sampling of Large Texts for Deep Learning Relation Extraction paper

In particular, this framework serves the following features:

  • pipelines and iterators for handling large-scale collections serialization without out-of-memory issues.
  • 🔗 EL (entity-linking) API support for objects,
  • ➰ avoidance of cyclic connections,
  • 📏 distance consideration between relation participants (in terms or sentences),
  • 📑 relations annotations and filtering rules,
  • *️⃣ entities formatting or masking, and more.

The core functionality includes:

  • API for document presentation with EL (Entity Linking, i.e. Object Synonymy) support for sentence level relations preparation (dubbed as contexts);
  • API for contexts extraction;
  • Relations transferring from sentence-level onto document-level, and more.

Installation

pip install git+https://github.com/nicolay-r/[email protected]

Usage

Please follow the tutorial section on project Wiki for mode details.

How to cite

A great research is also accompanied by the faithful reference. if you use or extend our work, please cite as follows:

@inproceedings{rusnachenko2024arelight,
  title={ARElight: Context Sampling of Large Texts for Deep Learning Relation Extraction},
  author={Rusnachenko, Nicolay and Liang, Huizhi and Kolomeets, Maxim and Shi, Lei},
  booktitle={European Conference on Information Retrieval},
  year={2024},
  organization={Springer}
}

arekit's People

Contributors

nicolay-r avatar trellixvulnteam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

arekit's Issues

Separate experiments data_io provider into serialization and training stages

The idea comes from an application of SynonymsCollection, which is (might be) useless during neural networks training process on already serialized data. However the latter is important on input data serialization stage.
Therefore, there is a need to split DataIO API of experiment subfolder onto:

  1. SerializationData
  2. TrainingData
    Considering renaming from DataIO -> BaseData.

The source of synonyms for OpinionCollection (Trusted/Non trusted)

It is necessary to clarify whether the source of synonym is trusted or not.
Here 'trusted' means that the synonyms collection has been obtained from the same corpora as opinions, i.e. we guarantee the absence of duplicated synonymous opinions within a document.
Otherwise, we are able to skip (ommit) a duplicated opinions during OpinionCollection initialization

SynonymsCollection considered in ReadOnly mode only

Due to evaluation process assumes to perform mapping of model results towards etalon results, there is a need to utilize Synonyms in evaluation process.
It is also used for etalon collection initialization.

In general, it is important to have a read-only synonyms collection which could cover entries of input examples of a variety sources types, such as train, test, (dev) simultaneously.

OpinionOperation -- remove SynonymsCollection property.

Using specific experiment method that we refer to during experiment data serialization.
This allows us demarcate two type of SynonymCollection:

  1. For results evaluation (OpinionOperations)
  2. For additional search of synonymous entities during serialization stage. (described above)

Entity -- provide GroupIndex property

Добавить в класс Entity такое своство, которое возвращает int либо None.
Для чего это необходимо:

  1. Так можно уйти от испльзования коллекции синонимов, если синонимы применялись при составлении коллекции (уже были размечены сущности) => GroupIndex будет предоставлять такой индекс
  2. Получение правильной связи синонимов, так как сейчас при обращении в коллекцию используется лемматизация, что искажает результат
  3. Ускорение загрузки и обработки данных (за счет отключения коллекции синонимов)
  4. Удаление чтения синонимов из RuAttittudes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.