Giter Site home page Giter Site logo

nicolay-r / arelight Goto Github PK

View Code? Open in Web Editor NEW
35.0 4.0 2.0 25.56 MB

Granular Viewer of Sentiments Between Entities in Massively Large Documents and Collections of Texts, powered by AREkit

Home Page: https://link.springer.com/chapter/10.1007/978-3-031-56069-9_23

License: MIT License

Python 50.89% Shell 0.03% Jupyter Notebook 49.08%
deep-learning machine-learning nlp tensorflow sentiment-analysis relation-extraction brat deeppavlov arekit attitudes

arelight's Introduction

ARElight 0.24.0 • twitter

Open In Colab

👉 DEMO 👈

ARElight is an application for a granular view onto sentiments between mentioned named entities in texts.

Installation

pip install git+https://github.com/nicolay-r/[email protected]

Usage: Inference

Open In Colab

Infer sentiment attitudes from text file in English:

python3 -m arelight.run.infer  \
    --sampling-framework "arekit" \
    --ner-framework "deeppavlov" \
    --ner-model-name "ner_ontonotes_bert" \
    --ner-types "ORG|PERSON|LOC|GPE" \
    --terms-per-context 50 \
    --sentence-parser "nltk:english" \
    --tokens-per-context 128 \
    --bert-framework "opennre" \
    --batch-size 10 \
    --pretrained-bert "bert-base-cased" \
    --bert-torch-checkpoint "ra4-rsr1_bert-base-cased_cls.pth.tar" \
    --backend "d3js_graphs" \
    --docs-limit 500 \
    -o "output" \
    --from-files "<PATH-TO-TEXT-FILE>"

NOTE: Applying ARElight for non-english texts

Parameters

The complete documentation is available via -h flag:

python3 -m arelight.run.infer -h

Parameters:

  • sampling-framework we consider only arekit framework by default.
    • from-files -- list of filepaths to the related documents.
      • for the .csv files we consider that each line of the particular column as a separated document.
        • csv-sep -- separator between columns.
        • csv-column -- name of the column in CSV file.
    • collection-name -- name of the result files based on sampled documents.
    • terms-per-context -- total amount of words for a single sample.
    • sentence-parser -- parser utilized for document split into sentences; list of the [supported parsers].
    • synonyms-filepath -- text file with listed synonymous entries, grouped by lines. [example].
    • stemmer -- for words lemmatization (optional); we support [PyMystem].
    • NER parameters:
      • ner-framework -- type of the framework:
      • ner-model-name -- model name within utilized NER framework.
      • ner-types -- list of types to be considered for annotation, separated by |.
    • docs-limit -- the total limit of documents for sampling.
    • Translation specific parameters
      • translate-framework -- text translation backend (optional); we support [googletrans]
      • translate-entity -- (optional) source and target language supported by backend, separated by :.
      • translate-text -- (optional) source and target language supported by backend, separated by :.
  • bert-framework -- samples classification framework; we support [OpenNRE].
    • text-b-type -- (optional) NLI or None [supported].
    • pretrained-bert -- pretrained state name.
    • batch-size -- amount of samples per single inference iteration.
    • tokens-per-context -- size of input.
    • bert-torch-checkpoint -- fine-tuned state.
    • device-type -- cpu or gpu.
    • labels-fmt -- list of the mappings from label to integer value; is a p:1,n:2,u:0 by default, where:
      • p -- positive label, which is mapped to 1.
      • n -- negative label, which is mapped to 2.
      • u -- undefined label (optional), which is mapped to 0.
  • backend -- type of the backend (d3js_graphs by default).
    • host -- port on which we expect to launch localhost server.
    • label-names -- default mapping is p:pos,n:neg,u:neu.
  • -o -- output folder for result collections and demo.

Framework parameters mentioned above as well as their related setups might be ommited.

To Launch Graph Builder for D3JS and (optional) start DEMO server for collections in output dir:

cd output && python -m http.server 8000

Finally, you may follow the demo page at http://0.0.0.0:8000/

image

Layout of the files in output

output/
├── description/
    └── ...         // graph descriptions in JSON.
├── force/
    └── ...         // force graphs in JSON.
├── radial/
    └── ...         // radial graphs in JSON.
└── index.html      // main HTML demo page. 

Usage: Graph Operations

For graph analysis you can perform several graph operations by this script:

  1. Arguments mode:
python3 -m arelight.run.operations \
	--operation "<OPERATION-NAME>" \
	--graph_a_file output/force/boris.json \
  	--graph_b_file output/force/rishi.json \
  	--weights y \
  	-o output \
  	--description "[OPERATION] between Boris Johnson and Rishi Sunak on X/Twitter"
  1. Interactive mode:
python3 -m arelight.run.operations

arelight.run.operations allows you to operate ARElight's outputs using graphs: you can merge graphs, find their similarities or differences.

Parameters

  • --graph_a_file and --graph_b_file are used to specify the paths to the .json files for graphs A and B, which are used in the operations. These files should be located in the <your_output/force> folder.
  • --name -- name of the new graph.
  • --description -- description of the new graph.
  • --host -- determines the server port to host after the calculations.
  • -o -- option allows you to specify the path to the folder where you want to store the output. You can either create a new output folder or use an existing one that has been created by ARElight.

Parameter operation

Preparation

Consider that you used ARElight script for X/Twitter to infer relations from messages of UK politicians Boris Johnson and Rishi Sunak:

python3 -m arelight.run.infer ...other arguments... \
	-o output --collection-name "boris" --from-files "twitter_boris.txt"
	
python3 -m arelight.run.infer  ...other arguments... \
	-o output --collection-name "rishi" --from-files "twitter_rishi.txt"

According to the results section, you will have output directory with 2 files force layout graphs:

output/
├── force/
    ├──  rishi.json
    └──  boris.json

List of Operations

You can do the following operations to combine several outputs, ot better understand similarities, and differences between them:

UNION $(G_1 \cup G_2)$ - combine multiple graphs together.

  • The result graph contains all the vertices and edges that are in $G_1$ and $G_2$. The edge weight is given by $W_e = W_{e1} + W_{e2}$, and the vertex weight is its weighted degree centrality: $W_v = \sum_{e \in E_v} W_e(e)$.
    python3 -m arelight.run.operations --operation UNION \
        --graph_a_file output/force/boris.json \
        --graph_b_file output/force/rishi.json \
        --weights y -o output --name boris_UNION_rishi \
        --description "UNION of Boris Johnson and Rishi Sunak Twits"
    
    union

INTERSECTION $(G_1 \cap G_2)$ - what is similar between 2 graphs?

  • The result graph contains only the vertices and edges common to $G_1$ and $G_2$. The edge weight is given by $W_e = \min(W_{e1},W_{e2})$, and the vertex weight is its weighted degree centrality: $W_v = \sum_{e \in E_v} W_e(e)$.
    python3 -m arelight.run.operations --operation INTERSECTION \
        --graph_a_file output/force/boris.json \
        --graph_b_file output/force/rishi.json \
        --weights y -o output --name boris_INTERSECTION_rishi \
        --description "INTERSECTION between Twits of Boris Johnson and Rishi Sunak"
    
    intersection

DIFFERENCE $(G_1 - G_2)$ - what is unique in one graph, that another graph doesn't have?

  • NOTE: this operation is not commutative $(G_1 - G_2) ≠ G_2 - G_1)$)_
  • The results graph contains all the vertices from $G_1$ but only includes edges from $E_1$ that either don't appear in $E_2$ or have larger weights in $G_1$ compared to $G_2$. The edge weight is given by $W_e = W_{e1} - W_{e2}$ if $e \in E_1$, $e \in E_1 \cap E_2$ and $W_{e1}(e) &gt; W_{e2}(e)$.
    python3 -m arelight.run.operations --operation DIFFERENCE \
        --graph_a_file output/force/boris.json \
        --graph_b_file output/force/rishi.json \
        --weights y -o output --name boris_DIFFERENCE_rishi \
        --description "Difference between Twits of Boris Johnson and Rishi Sunak"
    
    difference

Parameter weights

You have the option to specify whether to include edge weights in calculations or not. These weights represent the frequencies of discovered edges, indicating how often a relation between two instances was found in the text analyzed by ARElight.

  • --weights
    • y: the result will be based on the union, intersection, or difference of these frequencies.
    • n: all weights of input graphs will be set to 1. In this case, the result will reflect the union, intersection, or difference of the graph topologies, regardless of the frequencies. This can be useful when the existence of relations is more important to you, and the number of times they appear in the text is not a significant factor.

    Note that using or not using the weights option may yield different topologies:

    weights

Powered by

How to cite

Our one and my personal interest is to help you better explore and analyze attitude and relation extraction related tasks with ARElight. A great research is also accompanied with the faithful reference. if you use or extend our work, please cite as follows:

@inproceedings{rusnachenko2024arelight,
  title={ARElight: Context Sampling of Large Texts for Deep Learning Relation Extraction},
  author={Rusnachenko, Nicolay and Liang, Huizhi and Kolomeets, Maxim and Shi, Lei},
  booktitle={European Conference on Information Retrieval},
  year={2024},
  organization={Springer}
}

arelight's People

Contributors

guardeec avatar nicolay-r avatar trellixvulnteam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

arelight's Issues

Keep embedding locally

Since this project was a part of the AREkit previosly, the incomplete refactoring has been performed.

Download script could not be executed.

Docker version

Issues that we are encountered with:

  • DeepPavlov resources might not unzipped (zipp==3.6.0 issue)
  • For pyMystem3: nlpub/pymystem3#21
  • Download AREkit data

Feedback -- Serialization Generalization

Can you explain a little bit what kind of data and which kind of models need to be available in a language to apply your framework?

Present limitations:

  • Focused on the neural networks.
  • Frames annotation is hidden
  • Embedding -- only related to neural networks.

Clarification on language support

When coming to the readme, I'm presented with english language, though all screenshots of the tool showcase cyryllic characters. A search in the readme for "languages" does not yield any results.

It would be very benefitial if you could clarify in the readme which languages are supported. I'm sure I can dig this up reading through the research paper but IMO that's unnessary complex to users coming to this repo,

Reference to AREkit constants

  • setup constants
  • provide batch-size as a parameter

data = {"text_a": [], "text_b": [], "row_ids": []}
for row_ind, row in samples:
# Considering unique rows only.
if row["id"] in used_row_ids:
continue
data["text_a"].append(row['text_a'])
data["text_b"].append(row['text_b'])
data["row_ids"].append(row_ind)
used_row_ids.add(row["id"])
batch_size = 10
for i in range(0, len(data["text_a"]), 10):
texts_a = data["text_a"][i:i + batch_size]
texts_b = data["text_b"][i:i + batch_size]
row_ids = data["row_ids"][i:i + batch_size]

`infer_bert` -- raises "Attempt to free invalid pointer" on loading and inferring tensorflow model

When running

python infer_bert.py --from-files ../data/texts-inosmi-rus/e1.txt \
    --labels-count 3 \
    --terms-per-context 50 \
    --tokens-per-context 128 \
    --text-b-type nli_m \
    -o output/brat_inference_output

I get

...
INFO:tensorflow:Restoring parameters from /content/ARElight/data/models/ra-20-srubert-large-neut-nli-pretrained-3l-finetuned/ra-20-srubert-large-neut-nli-pretrained-3l
  0%|                                                                                       | 0/1253 [00:00<?, ?opins/s]src/tcmalloc.cc:283] Attempt to free invalid pointer 0x107e00000 

and the process freezes.

Google colab, Python 3.7, tensorflow 1.15.0, numpy 1.21.6, deeppavlov 0.11.0, arekit installed from git. Tried restarting the runtime, doesn't help.

Remove RuSentRel collection trainings

reason: this project is dedicated to the processing of a single file or a list of files. Hence there is a need to exclude collections.

Fix readme as well.

This is not functionality of 0.22.1

SynonymsCollection -- missed element results in inference script exception

Китай все-таки намерен ввести санкционные меры против РФ и в дальнешем, Югославии.

Causes:

 File "/media/nicolay/96ed6537-b931-4f7e-8ac4-8407527ddbf9/proj/REmarker/venv/lib/python3.6/site-packages/arekit/common/news/entities_grouping.py", line 15, in apply_core
    group_index = self.__value_to_group_id_func(entity.Value)
  File "/media/nicolay/96ed6537-b931-4f7e-8ac4-8407527ddbf9/proj/REmarker/venv/lib/python3.6/site-packages/arekit/common/synonyms.py", line 57, in get_synonym_group_index
    return self.__get_group_index(value)
  File "/media/nicolay/96ed6537-b931-4f7e-8ac4-8407527ddbf9/proj/REmarker/venv/lib/python3.6/site-packages/arekit/common/synonyms.py", line 130, in __get_group_index
    return self.__by_sid[sid]
KeyError: 'китай'

Possible Solution: Considering Synoyms Collection Expansion!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.