Giter Site home page Giter Site logo

sentimentanalysis's Introduction

Sentiment Analysis

CRAN_Status_Badge

SentimentAnalysis performs a sentiment analysis of textual contents in R. This implementation utilizes various existing dictionaries, such as QDAP, Harvard IV or Loughran-McDonald. Furthermore, it can also create customized dictionaries. The latter uses LASSO regularization as a statistical approach to select relevant terms based on an exogenous response variable.

Overview

The most important functions in SentimentAnalysis are:

  • Compute sentiment scores from contents stored in different formats with analyzeSentiment().

  • If desired, convert the continuous scores to either binary sentiment classes (negative or positive) or tertiary directions (negative, neutral or positive). This conversion can be done with convertToBinary() or convertToDirection() respectively.

  • Compare the calculated sentiment socres with a baseline (i.e. a gold standard). Here, compareToResponse() performs a statistical evaluation, while plotSentimentResponse() enables a visual comparison.

  • Generate customized dictionaries with the help of generateDictionary() as part of an advanced analysis. However, this prerequisites a response variable (i.e. the baseline).

To see examples of these functions in use, check out the help pages, the demos and the vignette.

Usage

This section shows the basic functionality of how to perform a sentiment analysis. First, install the package from CRAN. Then load the corresponding package SentimentAnalysis.

# install.packages("SentimentAnalysis")

library(SentimentAnalysis)

Quick demonstration

This simple example shows how to perform a sentiment analysis of a single string. The result is a two-level factor with levels “positive” and “negative.”

# Analyze a single string to obtain a binary response (positive / negative)
sentiment <- analyzeSentiment("Yeah, this was a great soccer game of the German team!")
convertToBinaryResponse(sentiment)$SentimentGI
#> [1] positive
#> Levels: negative positive

Small example

The following demonstrates some of the functionality provided by SentimentAnalysis. It also shows its visualization and evaluation capabilities.

# Create a vector of strings
documents <- c("Wow, I really like the new light sabers!",
               "That book was excellent.",
               "R is a fantastic language.",
               "The service in this restaurant was miserable.",
               "This is neither positive or negative.",
               "The waiter forget about my a dessert -- what a poor service!")

# Analyze sentiment
sentiment <- analyzeSentiment(documents)

# Extract dictionary-based sentiment according to the QDAP dictionary
sentiment$SentimentQDAP
#> [1]  0.3333333  0.5000000  0.5000000 -0.3333333  0.0000000 -0.4000000

# View sentiment direction (i.e. positive, neutral and negative)
convertToDirection(sentiment$SentimentQDAP)
#> [1] positive positive positive negative neutral  negative
#> Levels: negative neutral positive

response <- c(+1, +1, +1, -1, 0, -1)

compareToResponse(sentiment, response)
#> Warning in cor(sentiment, response): the standard deviation is zero
#> Warning in cor(x, y): the standard deviation is zero

#> Warning in cor(x, y): the standard deviation is zero
#> Warning in cor(sentiment, response): the standard deviation is zero
#>                              WordCount  SentimentGI  NegativityGI PositivityGI
#> cor                        -0.18569534  0.990011498 -9.974890e-01  0.942954167
#> cor.t.statistic            -0.37796447 14.044046450 -2.816913e+01  5.664705543
#> cor.p.value                 0.72465864  0.000149157  9.449687e-06  0.004788521
#> lm.t.value                 -0.37796447 14.044046450 -2.816913e+01  5.664705543
#> r.squared                   0.03448276  0.980122766  9.949843e-01  0.889162562
#> RMSE                        3.82970843  0.450102869  1.186654e+00  0.713624032
#> MAE                         3.33333333  0.400000000  1.100000e+00  0.666666667
#> Accuracy                    0.66666667  1.000000000  6.666667e-01  0.666666667
#> Precision                          NaN  1.000000000           NaN          NaN
#> Sensitivity                 0.00000000  1.000000000  0.000000e+00  0.000000000
#> Specificity                 1.00000000  1.000000000  1.000000e+00  1.000000000
#> F1                                 NaN  1.000000000           NaN          NaN
#> BalancedAccuracy            0.50000000  1.000000000  5.000000e-01  0.500000000
#> avg.sentiment.pos.response  3.25000000  0.333333333  8.333333e-02  0.416666667
#> avg.sentiment.neg.response  4.00000000 -0.633333333  6.333333e-01  0.000000000
#>                            SentimentHE NegativityHE PositivityHE SentimentLM
#> cor                          0.4152274 -0.083045480    0.3315938   0.7370455
#> cor.t.statistic              0.9128709 -0.166666667    0.7029595   2.1811142
#> cor.p.value                  0.4129544  0.875718144    0.5208394   0.0946266
#> lm.t.value                   0.9128709 -0.166666667    0.7029595   2.1811142
#> r.squared                    0.1724138  0.006896552    0.1099545   0.5432361
#> RMSE                         0.8416254  0.922958207    0.8525561   0.7234178
#> MAE                          0.7500000  0.888888889    0.8055556   0.6333333
#> Accuracy                     0.6666667  0.666666667    0.6666667   0.8333333
#> Precision                          NaN          NaN          NaN   1.0000000
#> Sensitivity                  0.0000000  0.000000000    0.0000000   0.5000000
#> Specificity                  1.0000000  1.000000000    1.0000000   1.0000000
#> F1                                 NaN          NaN          NaN   0.6666667
#> BalancedAccuracy             0.5000000  0.500000000    0.5000000   0.7500000
#> avg.sentiment.pos.response   0.1250000  0.083333333    0.2083333   0.2500000
#> avg.sentiment.neg.response   0.0000000  0.000000000    0.0000000  -0.1000000
#>                            NegativityLM PositivityLM RatioUncertaintyLM
#> cor                         -0.40804713    0.6305283                 NA
#> cor.t.statistic             -0.89389841    1.6247248                 NA
#> cor.p.value                  0.42189973    0.1795458                 NA
#> lm.t.value                  -0.89389841    1.6247248                 NA
#> r.squared                    0.16650246    0.3975659                 NA
#> RMSE                         0.96186547    0.7757911          0.9128709
#> MAE                          0.92222222    0.7222222          0.8333333
#> Accuracy                     0.66666667    0.6666667          0.6666667
#> Precision                           NaN          NaN                NaN
#> Sensitivity                  0.00000000    0.0000000          0.0000000
#> Specificity                  1.00000000    1.0000000          1.0000000
#> F1                                  NaN          NaN                NaN
#> BalancedAccuracy             0.50000000    0.5000000          0.5000000
#> avg.sentiment.pos.response   0.08333333    0.3333333          0.0000000
#> avg.sentiment.neg.response   0.10000000    0.0000000          0.0000000
#>                            SentimentQDAP NegativityQDAP PositivityQDAP
#> cor                         0.9865356369   -0.944339551    0.942954167
#> cor.t.statistic            12.0642877257   -5.741148345    5.664705543
#> cor.p.value                 0.0002707131    0.004560908    0.004788521
#> lm.t.value                 12.0642877257   -5.741148345    5.664705543
#> r.squared                   0.9732525629    0.891777188    0.889162562
#> RMSE                        0.5398902495    1.068401367    0.713624032
#> MAE                         0.4888888889    1.011111111    0.666666667
#> Accuracy                    1.0000000000    0.666666667    0.666666667
#> Precision                   1.0000000000            NaN            NaN
#> Sensitivity                 1.0000000000    0.000000000    0.000000000
#> Specificity                 1.0000000000    1.000000000    1.000000000
#> F1                          1.0000000000            NaN            NaN
#> BalancedAccuracy            1.0000000000    0.500000000    0.500000000
#> avg.sentiment.pos.response  0.3333333333    0.083333333    0.416666667
#> avg.sentiment.neg.response -0.3666666667    0.366666667    0.000000000

# Optional visualization: plotSentimentResponse(sentiment$SentimentQDAP, response)

Dictionary generation

Research in finance and social sciences nowadays utilizes content analysis to understand human decisions in the face of textual materials. While content analysis has received great traction lately, the available tools are not yet living up to the needs of researchers. This package implements a novel approach named “**dictionary generation” to study tone, sentiment and reception of textual materials.

The approach utilizes LASSO regularization to extract words from documents that statistically feature a positive and negative polarity. This immediately reveals manifold implications for practitioners, finance research and social sciences: researchers can use R to extract text components that are relevant for readers and test their hypothesis based on these.

  • Proellochs, Feuerriegel and Neumann (2018): Statistical inferences for polarity identification in natural language, PLOS ONE 13(12):e0209323. DOI: 10.1371/journal.pone.0209323
  • Proellochs, Feuerriegel and Neumann (2015): Generating Domain-Specific Dictionaries Using Bayesian Learning, Proceedings of the 23rd European Conference on Information Systems (ECIS 2015), Muenster, Germany. DOI: 10.2139/ssrn.2522884

License

SentimentAnalysis is released under the MIT License

Copyright (c) 2023 Stefan Feuerriegel & Nicolas Pröllochs

sentimentanalysis's People

Contributors

nproellochs avatar sfeuerriegel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sentimentanalysis's Issues

Error: C stack usage [large number] is too close to the limit

Hello,
I am trying to use the SentimentAnalysis package to analyze a collection of TripAdvisor reviews for research. I have a dataset of 178 reviews in a single variable column assigned as SA. Then tried to run this command:

sentiment <- analyzeSentiment(SA)

Which prompted the error

Error: C stack usage 7977028 is too close to the limit

I tried using the direct example from the documentation
analyzeSentiment( SA, language = "english", aggregate = NULL, rules = defaultSentimentRules(), removeStopwords = TRUE, stemming = TRUE )

Which led to a similar error

Error: C stack usage 7976404 is too close to the limit

I have been using R Studio Cloud (aka Posit Cloud) for this research but I attempted the exact same code in the desktop RStudio version and encountered this error instead:

Error: node stack overflow
Error during wrapup: node stack overflow
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

Some initial Googling of the issue makes it seem like it is an issue with recursive code but I do not think that would be the case here. Another thing I cannot make sense of is that I tried this exact same code with nearly the exact same data perhaps 6 months ago and it worked perfectly. Since then I have added less than 10 new reviews to the data (going from 170 to 178). Could that minor increase really cause this issue? If so, what would be a workaround for analyzing all 178 reviews properly? I would appreciate any suggestions for fixing this problem.
Thank you

Italian Dictionary

I should like to use SentimentAnalysis package with a Dictionary in italian language.
How can I do it ?

Thanks

German sentiment dictionary available?

Hi Stefan,

Thanks to issue #2 I know how to add a new dictionary if available. Do you happen to know a German dictionary somewhere that can be used in this package? I found this one. It seems allright except for the lack of adjectives. They all seem nouns. Many thanks in advance!

Parallel run

Hello and thank you for the excellent package.

I would like to ask whether we can run some functions in parallel?
I am interested, mainly, for the dictionary creation function.
Regards,
Sotiris

Fix row naming convention

First, great package already! Very excited about where this can go, I love all of your teaching as well. I think there is an issue with your intended row names for the analyzeSentiment function

analyzeSentiment(crude)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  duplicate row.names: character(0)

The function itself works and return results, it just appears the row names aren't being assigned.

screen shot 2016-06-27 at 9 00 43 am

Test statistic being displayed instead of p-value

Hello,

There is a typo in analyzeSentiment.R that is causing compareToResponse to show an incorrect cor.p.value. Note, for example, that the example compareToResponse output in the Readme shows the same value for cor.t.statistic and cor.p.value, which is obviously incorrect:
#> WordCount SentimentGI NegativityGI
#> cor -0.18569534 0.9900115 -0.99748901
#> cor.t.statistic -0.37796447 14.0440465 -28.16913204
#> cor.p.value -0.37796447 14.0440465 -28.16913204

The problem is in line 321 of analyzeSentiment.R. Currently, the line is the following:
"cor.p.value"=unlist(lapply(colnames(sentiment), function(x) cor.test(sentiment[, x], response)$statistic)),

I believe that this should instead read

"cor.p.value"=unlist(lapply(colnames(sentiment), function(x) cor.test(sentiment[, x], response)$p.val)),

Best,
Adam

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.