Giter Site home page Giter Site logo

hebert's Introduction

HeBERT: Pre-trained BERT for Polarity Analysis and Emotion Recognition

HeBERT is a Hebrew pre-trained language model. It is based on Google's BERT architecture and it is a BERT-Base config.

HeBERT was trained on three dataset:

  1. A Hebrew version of OSCAR: ~9.8 GB of data, including 1 billion words and over 20.8 million sentences.
  2. A Hebrew dump of Wikipedia: ~650 MB of data, including over 63 million words and 3.8 million sentences
  3. Emotion User Generated Content (UGC) data that was collected for the purpose of this study (described below).

We evaluated the model on downstream tasks: emotions recognition and sentiment analysis.

Emotion UGC Data Description

Our UGC data include comments posted on news articles collected from 3 major Israeli news sites, between January 2020 to August 2020. The total size of the data is ~150 MB, including over 7 million words and 350K sentences. ~4000 sentences were annotated by crowd members (3-10 annotators per sentence) for overall sentiment (polarity) and eight emotions: anger, disgust, expectation , fear, joy, sadness, surprise and trust.

For our robustness analyses, we also collected and annotated two additional datasets. The first contains a random set of comments taken from our in-domain dataset (that is, comments that were posted on Covid-related news articles). The second is a random set of comments taken from an out-of-domain dataset containing comments that were posted in response to non-Covid-related articles from the same news sites. An additional explanation can be found in section 5.1 of our article. The percentage of sentences in which each emotion appeared is found in the table below.

anger disgust expectation fear happy sadness surprise trust sentiment
Main Dataset 0.78 0.83 0.58 0.45 0.12 0.59 0.17 0.11 0.25
Random Comments from the Corpus 0.79 0.87 0.46 0.17 0.03 0.30 0.00 0.03 0.02
Out of Domain 0.76 0.89 0.62 0.10 0.08 0.36 0.02 0.13 0.12

All the datasets can be found on "data.zip" in this git (where each row stands for a different annotator of a sentence). The agreed score which we used to train and test our models, can be found in the column 'agreed score' (if we found sufficient agreement). See our article for more details on the annotation process.
If you use our datasets please cite us (can be found below).

Performance

Emotion Recognition

emotion f1-score precision recall
anger 0.96 0.99 0.93
disgust 0.97 0.98 0.96
expectation 0.82 0.80 0.87
fear 0.79 0.88 0.72
happy 0.90 0.97 0.84
sadness 0.90 0.86 0.94
sentiment 0.88 0.90 0.87
surprise 0.40 0.44 0.37
trust 0.83 0.86 0.80

The above metrics for positive class (meaning, the emotion is reflected in text) for the main dataset

Sentiment (Polarity) Analysis

precision recall f1-score
natural 0.83 0.56 0.67
positive 0.96 0.92 0.94
negative 0.97 0.99 0.98
accuracy 0.97
macro avg 0.92 0.82 0.86
weighted avg 0.96 0.97 0.96

How to use

For Emotion Recognition Model

An online model can be found at huggingface spaces or as colab notebook

# !pip install pyplutchik==0.0.7
# !pip install transformers==4.14.1

!git clone https://github.com/avichaychriqui/HeBERT.git
from HeBERT.src.HebEMO import *
HebEMO_model = HebEMO()

HebEMO_model.hebemo(input_path = 'examples/text_example.txt')
# return analyzed pandas.DataFrame  

hebEMO_df = HebEMO_model.hebemo(text='החיים יפים ומאושרים', plot=True)

For masked-LM model (can be fine-tunned to any down-stream task)

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT")
model = AutoModel.from_pretrained("avichr/heBERT")

from transformers import pipeline
fill_mask = pipeline(
    "fill-mask",
    model="avichr/heBERT",
    tokenizer="avichr/heBERT"
)
fill_mask("הקורונה לקחה את [MASK] ולנו לא נשאר דבר.")

For sentiment classification model (polarity ONLY):

from transformers import AutoTokenizer, AutoModel, pipeline
tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT_sentiment_analysis") #same as 'avichr/heBERT' tokenizer
model = AutoModel.from_pretrained("avichr/heBERT_sentiment_analysis")

# how to use?
sentiment_analysis = pipeline(
    "sentiment-analysis",
    model="avichr/heBERT_sentiment_analysis",
    tokenizer="avichr/heBERT_sentiment_analysis",
    return_all_scores = True
)

sentiment_analysis('אני מתלבט מה לאכול לארוחת צהריים')	
>>>  [[{'label': 'natural', 'score': 0.9978172183036804},
>>>  {'label': 'positive', 'score': 0.0014792329166084528},
>>>  {'label': 'negative', 'score': 0.0007035882445052266}]]

sentiment_analysis('קפה זה טעים')
>>>  [[{'label': 'natural', 'score': 0.00047328314394690096},
>>>  {'label': 'possitive', 'score': 0.9994067549705505},
>>>  {'label': 'negetive', 'score': 0.00011996887042187154}]]

sentiment_analysis('אני לא אוהב את העולם')
>>>  [[{'label': 'natural', 'score': 9.214012970915064e-05}, 
>>>  {'label': 'possitive', 'score': 8.876807987689972e-05}, 
>>>  {'label': 'negetive', 'score': 0.9998190999031067}]]

Our model is also available on AWS! for more information visit AWS' git

Named-entity recognition (NER)

The ability of the model to classify named entities in text, such as persons' names, organizations, and locations; tested on a labeled dataset from Ben Mordecai and M Elhadad (2005), and evaluated with F1-score. Colab notebook

How to use

	from transformers import pipeline
	
	# how to use?
	NER = pipeline(
	    "token-classification",
	    model="avichr/heBERT_NER",
	    tokenizer="avichr/heBERT_NER",
	)
	NER('דויד לומד באוניברסיטה העברית שבירושלים')

Contact us

Avichay Chriqui
Inbal yahav
The Coller Semitic Languages AI Lab
Thank you, תודה, شكرا

If you used this model please cite us as :

Chriqui, A., & Yahav, I. (2022). HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition. INFORMS Journal on Data Science, forthcoming.

@article{chriqui2021hebert,
  title={HeBERT \& HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition},
  author={Chriqui, Avihay and Yahav, Inbal},
  journal={INFORMS Journal on Data Science},
  year={2022}
}

hebert's People

Contributors

avichaychriqui avatar inbalyahav avatar

Stargazers

AvivTurgeman avatar  avatar 鹿目藍華 avatar Yara Kyrychenko avatar Cayden Dunn avatar Gila Cohen Rona avatar We-Agri Ltd avatar Karin Brisker avatar  avatar  avatar Roman Vainshtein avatar  avatar Netanel Ravid avatar zeev grim avatar  avatar  avatar Rafael Bodill avatar Amit Palomo avatar  avatar  avatar  avatar Eyal avatar Hadar Sharvit avatar  avatar  avatar  avatar  avatar Michael Toker avatar zeevgg avatar  avatar Dror Hilman PhD avatar Tamir Goldman avatar Barak Brill avatar Gil Klein avatar  avatar  avatar Hyunji (Hayley) Park avatar Yahav Bar avatar Yam Horesh avatar  avatar Orchan Magramov avatar Shaked Cohen avatar Serj Smor avatar Alexey Zagalsky avatar Michael Goncharov avatar Nir Shney-dor avatar Eitan Sela avatar Yu-Lun Chiang avatar Dimid Duchovny avatar Shay Sirek avatar Dubi Kaufmann avatar  avatar Dan Ofer avatar  avatar  avatar  avatar David Dorfman avatar Yuval Ginor avatar ehud baumatz avatar Shauli Ravfogel avatar Doron Adler avatar  avatar Vladimir Gurevich avatar Shmuel Goldfarb avatar  avatar Idan Morad avatar Daniel Hershcovich avatar Saar Eliad avatar Omer Koren avatar Yotam avatar Moti Dabastani avatar Gal Bracha avatar

Watchers

Eyal avatar  avatar  avatar  avatar Eitan Sela avatar Kostas Georgiou avatar Yotam avatar  avatar

hebert's Issues

Question about Hebert

Can we use this model for text summarization?
I want to fine tune it, using AutoModelForSeq2SeqLM.
If so, how would you do it ?

Thank you,

Nadi

data and pretrained model

First i want to say that your work is great!!!

I have two questions:

  1. When are you planning to publish your the model for Emotion Recognition?
  2. Can you also publish the UGC annotated dataset (sentiment and emotion)?

How to use for emotion detection?

This work is very valuable!!!

I would like to ask:
How to use for emotion detection?
Your performance of Emotion Recognitionis also from pretained HeBERT model? or only UGC model. Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.