nicolay-r / arekit Goto Github PK

Document level Attitude and Relation Extraction toolkit (AREkit) for sampling and processing large text collections with ML and for ML

Home Page: https://nicolay-r.github.io/arekit-page/

License: MIT License

Python 99.92% Shell 0.08%

sentiment-analysis relation-extraction neural-networks datasets frames bert language-models nlp pandas pandas-dataframe

arekit's Introduction

AREkit 0.25.0

AREkit (Attitude and Relation Extraction Toolkit) -- is a python toolkit, devoted to document level Attitude and Relation Extraction between text objects from mass-media news.

Description

This toolkit aims at memory-effective data processing in Relation Extraction (RE) related tasks.

Figure: AREkit pipelines design. More on ARElight: Context Sampling of Large Texts for Deep Learning Relation Extraction paper

In particular, this framework serves the following features:

➿ pipelines and iterators for handling large-scale collections serialization without out-of-memory issues.
🔗 EL (entity-linking) API support for objects,
➰ avoidance of cyclic connections,
📏 distance consideration between relation participants (in terms or sentences),
📑 relations annotations and filtering rules,
*️⃣ entities formatting or masking, and more.

The core functionality includes:

API for document presentation with EL (Entity Linking, i.e. Object Synonymy) support for sentence level relations preparation (dubbed as contexts);
API for contexts extraction;
Relations transferring from sentence-level onto document-level, and more.

Installation

pip install git+https://github.com/nicolay-r/[email protected]

Usage

Please follow the tutorial section on project Wiki for mode details.

How to cite

A great research is also accompanied by the faithful reference. if you use or extend our work, please cite as follows:

@inproceedings{rusnachenko2024arelight,
  title={ARElight: Context Sampling of Large Texts for Deep Learning Relation Extraction},
  author={Rusnachenko, Nicolay and Liang, Huizhi and Kolomeets, Maxim and Shi, Lei},
  booktitle={European Conference on Information Retrieval},
  year={2024},
  organization={Springer}
}

arekit's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes zhiyuan-fan trellixvulnteam

arekit's Issues

Vocabulary and Embedding serialization performed in a non-CV dependent file

We utilize vocab.npz and term_embedding.npz for so.
In case of k-fold CV, these files becomes rewritten k times.

Output to Opinion Collections -- saved files must be consistent with documents used for comparison

Now we keep all the documents from the related output, while later on we utilize only those one that supposed for comparison:
DocumentsOperations.iter_doc_ids_to_compare

Neural Networks Optimizer -- learning-rate could not be modified via set_learning_rate property.

Problem: Optimizer was declared and intialized as a static variable of the DefaultNetworkConfig class:

AREkit/contrib/networks/context/configurations/base/base.py

Line 29 in da89b4c

__optimiser = tf.train.AdadeltaOptimizer(

Providing all the logging information into log_utils.py

contrib/miner -- is no longer supported

Support correct variety of sorces within a single experiment (doc ids mapping)

Now we consider a news_id as a parameter, which is unique within a particular collection. Using multiple collections may result in news_ids overlapping. Hence, there is a need in ids mapping.

SynonymsCollection -- considering stemmer as an optional parameter

OpinionContainingTextTermsMapper -- providing synonyms API instead of the related instance

Balancing during serialization -- the related flag is not working for neural networks.

RusentrelWithRuAttitudesExperiment -- neutral opinion annotation differs due to merged synonyms collection usage.

Solution: Use merged collection only during serialization stage, for entity values only (not for opinions)!

SynonymsCollection -- StemmerBasedSynonymsCollection provides incorrect results for RuAttitudes

Incorrect usage, we should avoid usage of StemmerBasedSynonymsCollection for RuAttitudes. Instead, it is assumed to obtain entity indices from the related labeling in RuAttitudes collection.

Neural Network Optimizer -- L2-layer-reg coefficient could not be modified.

StemmerBasedSynonymsCollection -- results for RuSentRel experiments depends on utilized stemmer

OpinionCollection -- opinions duplication error.

Provide --model-tag parameter to demarcate fine-tuned model by names

experiment evaluation -- documents might be skipped

Neural Network Input Samples -- Entities are not highlighted in pos_feature

Neural Network Input Sampler -- PosTagging 3x times reduces performance

There is a need to serialize such tags.

Neural Networks -- Provide list of supported models

OpinionProvider -- balancing is usesless as we already do balancing later

Separate experiments data_io provider into serialization and training stages

The idea comes from an application of SynonymsCollection, which is (might be) useless during neural networks training process on already serialized data. However the latter is important on input data serialization stage.
Therefore, there is a need to split DataIO API of experiment subfolder onto:

SerializationData
TrainingData
Considering renaming from DataIO -> BaseData.

Neural Network Input Serialization -- ParsedNewsCollection usage causes large amount of memory consumption

There is no need to store all the parsed news since sampling is an iterative process

Results Evaluation -- Providing 'Classification' and 'Extraction' evaluation modes

There is a need to provide another evaluation mode for 2-scale output classification format.

The source of synonyms for OpinionCollection (Trusted/Non trusted)

It is necessary to clarify whether the source of synonym is trusted or not.
Here 'trusted' means that the synonyms collection has been obtained from the same corpora as opinions, i.e. we guarantee the absence of duplicated synonymous opinions within a document.
Otherwise, we are able to skip (ommit) a duplicated opinions during OpinionCollection initialization

SynonymsCollection considered in ReadOnly mode only

Due to evaluation process assumes to perform mapping of model results towards etalon results, there is a need to utilize Synonyms in evaluation process.
It is also used for etalon collection initialization.

In general, it is important to have a read-only synonyms collection which could cover entries of input examples of a variety sources types, such as train, test, (dev) simultaneously.

SynonymsCollection -- Merging operation

Evaluation differences

0.19.3-lrec-rejected-rc:

AREkit/evaluation/evaluators/two_class.py

Line 28 in 6e97412

    
           self.__has_opinions_with_label(cmp_pair.TestOpinionCollection, self.__neg_label)

Current:

AREkit/common/evaluation/evaluators/two_class.py

Line 26 in da89b4c

    
           self.__has_opinions_with_label(cmp_pair.EtalonOpinionCollection, self.__neg_label)

tf model -- embedding could not be modified for loaded model

We provide embedding at nn compilation stage.

OpinionOperation -- remove SynonymsCollection property.

Using specific experiment method that we refer to during experiment data serialization.
This allows us demarcate two type of SynonymCollection:

For results evaluation (OpinionOperations)
For additional search of synonymous entities during serialization stage. (described above)

Network Input Sampler -- distance features has float64 type.

LSTM/BILSTM models are stalled at Filling bags collection stage

From nn benchmark, there is a problem that such models are stuck at 1-st iteration.

Contexts Extraction -- Providing distance between terms within context

From WIMS-2020 paper.
This parameter might be significant in terms of affection on results.
Using \thetta results in reduced amount of contexts to be extracted.

Provide unit tests for experiments evaluation

Provide evaluation of etalon results against etalon collection, in order to determine the maximal F1, P, R for a particular source (RuSentRel)

Neural Networks Input Samples -- Frames and Frame Roles are missed in samples

Such parameters were not passed:

AREkit/contrib/networks/core/data_handling/data.py

Line 199 in 51a4469

return InputSample.from_tsv_row(

Evaluation fails when the related samples collection is balanced.

Neural Network Input Samples -- Entities are not highlighted in term_type_feature

As all words has unicode type, we can only demarcate paddings (using -1 placeholder) from the actual sample content (words).
Entities remain equal 0 at this point.
Here is a method:

AREkit/contrib/networks/sample.py

Line 326 in c1b3984

def __create_term_types(terms):

Input samples -- provide statistics [pos, neg, neutral] amount.

Opinion Collection -- using init_as_custom call may affect on results.

As we now declare all the synonyms collection within the related experiment, there is no need to provide a custom SynonymsCollection iniitializer that supports the case of entries duplication.

Neural Network Input Sampler -- keeping word-embedding vectors during samples generation process causes MLE

The latter causes Memory Limit Exceeded.
The problem arises from the fact that NetworkSingleTextProvider performs caching embeddings for every term.

Neutral Annotation -- differs for RuSentRel collection comparing with the related output in 0.19.4-0.20.3

In https://github.com/nicolay-r/are-networks-experiments, 0.20.5-rc:

In prior versions:
0.19.4-lrec-rejected-rc

0.20.3-wims-rc:

Experiment Documents Operations -- missed check onto data_type support for doc_ids iteration

Update with a custom Callback and ModelIO initializations in 0.20.5

In some cases, we may consider to ModelIO and Callback as None, especially when there is no need to perform evaluation and utilize neural networks (ModelIO = None).
How to solve: merge changes from the related update in 0.20.4 version (57fa08d) to 0.20.5 version.

Neural Network Input Samples -- Provide entity inds in input serialization

Input Serialization -- LinkedTextOpinionCollection class is useless

Update model and related result dirs separation from 0.20.4

Update the latter (c7da1a0 ) in 0.20.5, in order to provide and support an optional separation of a model source dir (which keeps input samples/opinions) from the related results. By default, it is considered that the result dir equals to model dir.

RuAttitudesExperiment -- neural network training requires loaded collection

Entity -- provide GroupIndex property

Добавить в класс Entity такое своство, которое возвращает int либо None.
Для чего это необходимо:

Так можно уйти от испльзования коллекции синонимов, если синонимы применялись при составлении коллекции (уже были размечены сущности) => GroupIndex будет предоставлять такой индекс
Получение правильной связи синонимов, так как сейчас при обращении в коллекцию используется лемматизация, что искажает результат
Ускорение загрузки и обработки данных (за счет отключения коллекции синонимов)
Удаление чтения синонимов из RuAttittudes.