Giter Site home page Giter Site logo

syuzhet's Introduction

Syuzhet

An R package for the extraction of sentiment and sentiment-based plot arcs from text.

The name "Syuzhet" comes from the Russian Formalists Victor Shklovsky and Vladimir Propp who divided narrative into two components, the "fabula" and the "syuzhet." Syuzhet refers to the "device" or technique of a narrative whereas fabula is the chronological order of events. Syuzhet, therefore, is concerned with the manner in which the elements of the story (fabula) are organized (syuzhet).

The Syuzhet package attempts to reveal the latent structure of narrative by means of sentiment analysis. Instead of detecting shifts in the topic or subject matter of the narrative (as Ben Schmidt has done), the Syuzhet package reveals the emotional shifts that serve as proxies for the narrative movement between conflict and conflict resolution. This was an idea inspired by the late Kurt Vonnegut in an essay titled "Here's a Lesson in Creative Writing" in his collection A Man Without A Country ( Random House, 2007). A lecture Vonnegut gave on this subject is available via youTube

Thanks to Lincoln Mullen for early feedback on this package (see http://rpubs.com/lmullen/58030).

Installation

This package is now available on CRAN (http://cran.r-project.org/web/packages/syuzhet/).

install.packages("syuzhet")

You can install the most current development version from gitHub using the devtools package:

# install.packages("devtools")
devtools::install_github("mjockers/syuzhet")

References

Syuzhet incorporates four sentiment lexicons:

The default "Syuzhet" lexicon was developed in the Nebraska Literary Lab under the direction of Matthew L. Jockers

The "afinn" lexicon was developed by Finn Arup Nielsen as the AFINN WORD DATABASE See: See http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010 The AFINN database of words is copyright protected and distributed under "Open Database License (ODbL) v1.0" http://www.opendatacommons.org/licenses/odbl/1.0/ or a similar copyleft license.

The "bing" lexicon was developed by Minqing Hu and Bing Liu as the OPINION LEXICON See: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

The "nrc" lexicon was developed by Mohammad, Saif M. and Turney, Peter D. as the NRC EMOTION LEXICON.
See: http://saifmohammad.com/WebPages/lexicons.html The NRC EMOTION LEXICON is released under the following terms of use: Terms of use:

  1. This lexicon can be used freely for research purposes.
  2. The papers listed below provide details of the creation and use of the lexicon. If you use a lexicon, then please cite the associated papers.
  3. If interested in commercial use of the lexicon, send email to the contact.
  4. If you use the lexicon in a product or application, then please credit the authors and NRC appropriately. Also, if you send us an email, we will be thrilled to know about how you have used the lexicon.
  5. National Research Council Canada (NRC) disclaims any responsibility for the use of the lexicon and does not provide technical support. However, the contact listed above will be happy to respond to queries and clarifications.
  6. Rather than redistributing the data, please direct interested parties to this page: http://www.purl.com/net/lexicons

-- Crowdsourcing a Word-Emotion Association Lexicon, Saif Mohammad and Peter Turney, To Appear in Computational Intelligence, Wiley Blackwell Publishing Ltd.

-- Tracking Sentiment in Mail: How Genders Differ on Emotional Axes, Saif Mohammad and Tony Yang, In Proceedings of the ACL 2011 Workshop on ACL 2011 Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA), June 2011, Portland, OR. Paper (pdf)

-- From Once Upon a Time to Happily Ever After: Tracking Emotions in Novels and Fairy Tales, Saif Mohammad, In Proceedings of the ACL 2011 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), June 2011, Portland, OR. Paper

-- Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon", Saif Mohammad and Peter Turney, In Proceedings of the NAACL-HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, June 2010, LA, California.

Links to the papers are available here: http://www.purl.org/net/NRCemotionlexicon

CONTACT INFORMATION Saif Mohammad Research Officer, National Research Council Canada email: [email protected] phone: +1-613-993-0620

syuzhet's People

Contributors

amrrs avatar chrismuir avatar hadley avatar lmullen avatar mjockers avatar pbulsink avatar tmmcguire avatar trinker avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

syuzhet's Issues

Rely on StanfordCoreNLP package?

Right now syuzhet calls StanfordCoreNLP as a system() command. In principle it is better to use an rJava wrapper. There is a StanfordCoreNLP package already: it's just not hosted at CRAN because the models are too big. You can find the package at DataCube. And it can be installed with the following command:

install.packages("StanfordCoreNLP", repos = "http://datacube.wu.ac.at/", type = "source")

See the documentation here: ?StanfordCoreNLP_Pipeline. It's not a particularly well designed API, but it wouldn't be too hard to write some accessor functions to clean things up. It gives access to the sentiment parser.

I'd suggest re-writing get_stanford_sentiment() to rely on this package. Unfortunately there is no good way to express dependencies on packages which are not on CRAN. I had this problem with the gender package since CRAN won't take data packages above 5MB. So I wrote a function which checks to see if the genderdata package is installed, and if not, prompts the user to install it. You could do the same thing to check if the StanfordCoreNLP package is installed. Here is the function from gender.

What is the distinction between `scale_vals` and `scale_range`?

@mjockers:

Can you explain the distinction between calling get_transformed_values() with scale_vals = TRUE, scale_range = TRUE, and both of those being false? Below are three plots of the same text using the different values. The shape of the curve is the same, as expected, but the zero moves up and down. I understand how the scales are calculated. But what difference does this make for (1) interpreting the plot of a single text and (2) for comparing among the texts in a corpus?

Unscaled:

101178399 nlm nih gov-101178399_djvu-un

Scaled with scale_vals = TRUE:

101178399 nlm nih gov-101178399_djvu-vals

Scaled with scale_range = TRUE:
101178399 nlm nih gov-101178399_djvu-range

Using Syuzhet in Spanish

I am using the following code for sentiment analysis.

alltweets$clean_text <- str_replace_all(alltweets$text, "@\w+", "")
Sentiment <- get_nrc_sentiment(alltweets$clean_text)
alltweets_senti <- cbind(alltweets, Sentiment)

sentimentTotals <- data.frame(colSums(alltweets_senti[,c(11:18)]))
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals), sentimentTotals)
rownames(sentimentTotals) <- NULL

The full version can be found here: http://rpubs.com/cosmopolitanvan/r_isis_tweets_analytics

I have used this and modified it several times without a problem, but I am not sure on how to use it with Spanish. The code from https://github.com/mjockers/syuzhet/blob/master/vignettes/syuzhet-vignette.Rmd does not work for me either. I would be grateful for any guidance, as I am not good with coding.

Problem with translated dictionaries (spanish)

I realized that the NRC dictionary in spanish has multiple appearances for the same word in the same sentiment. I look into the structure for the original translated dictionaries at http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm and figured out that this has a same word in spanish being used to translate several words.

For instance, the word "asesino" serves as translation for the words "assassin", "cutthroat", "murderer", "murderous" and "slayer" which in turn, due to data structure, returns incorrect data when being used with the packages functions.

The output for get_nrc_sentiment(char_v = c("mira, un asesino"), language = "spanish")
is anger = 5, disgust = 3, fear = 5, sadness = 4, surprise = 2 and negative = 5; while the output for get_nrc_sentiment(char_v = c("look, an assassin"), language = "english") is anger = 1, fear = 1, sadness = 1 and negative = 1.

I think it'd be look if you could fix this, for the package is quite useful. I would expect this same issue to appear in other translation.

get_percentage_values occasionally returns 101 results

On my architecture (Macbook air OS X 10.9, full system profile on request), the function get_percentage_values occasionally returns 101 results instead of 100: at lengths 201, 203, 205, etc. This makes it hard to directly compare results.

The distribution is actually regular in kind of a pretty way if you plot it: groups of 4,3,2,1 progressively farther apart as n gets higher. Maybe a floating point issue of some sort? It seems that 100.00000 is occasionally rounding up to 101.

In any case, using cut instead of seq_along and ceiling fixes it--I'm posting a patch.

library(syuzhet)
lengths = sapply(200:1500,function(n) {
  dummy_sentiment_vector = sample(c(-1,0,1),n,replace=T) 
  length(get_percentage_values(dummy_sentiment_vector))
})
plot(200:1500,lengths)

image

get_tokens function missing in 3.2.3

Hi,

When install syuzhet library for 3.2.3 get_tokens function is missing.

get_tokens missing

How do i update the same. I tried to copy paste the function and re-zip the file it does not work giving and MD5 Sums error.

Regards,
Ren.

Update to newest tidyr version 1.2.0

Hey, thank you for your package!

Unfortunately, it does not work with the newest version of tidyr anymore.

Here is the problem:

Warning message: spread_()was deprecated in tidyr 1.2.0. ℹ Please usespread() instead. ℹ The deprecated feature was likely used in the syuzhet package.

Kind regards

Incorrect values get_sentiment and get_nrc_sentiment with Swedish text

Hi! I am very new to R and GitHub and coding overall so I apologize for any following mistakes!

I am trying to do a sentiment analysis of a Swedish novel with the help of the syuzhet package but noticed the get_sentiment and get_nrc_sentiment function read the value of certain words incorrectly. I first noticed it with my custom lexicon but then did a test with the nrc lexicon as well and saw that both give incorrect values for words with the letters ö, ä and å in them. Most of the time these words get value 0 (while they should be getting 1 or -1), but I’ve also seen a case where the word gets assigned a positive value (1) while it should be negative (-1).

I've changed RStudio's default encoding to utf-8 and my system's locale to Swedish but nothing has helped.
How could I solve this problem? This is the code I would use to get my results:

# For the nrc lexicon

binas_historia <- read_file(file.choose())
bina_words <- get_tokens(binas_historia, pattern = "\\W")
sentiment_b_nrc <- get_nrc_sentiment(bina_words, language = "swedish")
overzichtje_nrc <- data.frame(bina_words, nrc_data)

# For the Swedish (custom) lexicon 

binas_historia <- read_file(file.choose())
bina_words <- get_tokens(binas_historia, pattern = "\\W")
sensaldo_lexicon <- read.table("HP/Thesis/sensaldo-fullform.txt", 
header = FALSE,
col.names = c("word", "category", "value"), 
colClasses = c("character", "character", "numeric"),
encoding = "UTF-8")
sentiment_b_s <- get_sentiment(bina_words, method = "custom", lexicon = sensaldo_lexicon)
overzichtje_sensaldo <- data.frame(bina_words, sentiment_b_s)

Update package description with language information

Would it be possible to update the Cran description (and/or GitHub readme) to include the languages and/or scripts that Syuzhet supports? Unlike when #29 was posted, NRC has support for multiple non-Latin scripts, but there's a mismatch between what it supports and the list of languages that throw errors when you try to use then with NRC in Syuzhet. Having the information about what languages are compatible upfront would be really helpful for managing expectations. (Happy to suggest some prose, including mitigation strategies for improving results for inflected languages, if it'd help.)

How To Obtain a Complete Data Frame of Sentiment Values for Pride and Prejudice

I have just run the following code in RStudio:

install.packages("syuzhet")
install.packages("zoo")

rm(list = ls())
library(syuzhet)
library(zoo)

t1 <- get_text_as_string(path_to_file = "Pride.txt")

get_sentiment(t1)

t1_sentences <- get_sentences(t1)

t1_sentiments <- get_sentiment(t1_sentences)

str(t1_sentiments)
head(t1_sentiments)
#t1_sentences[c(550)]
#options(max.print=1000000)

par(mar=c(2,2,1,1))
simple_plot(t1_sentiments, title = "Pride Sentiments")

t1_window <- round(length(t1_sentiments)*.1)
t1_rolled <- rollmean(t1_sentiments, k = t1_window)
t1_scaled <- rescale_x_2(t1_rolled)

plot(t1_scaled$x, t1_scaled$z, type="l", col="blue", xlab="Narrative Time", ylab="Emotional Valence", main = "Pride and PortraitV1 with Rolling Means")

t1_rolled

In running this code, I get the following message right at the end:

[986] 0.7185185 0.7185185 0.7191919 0.7239899 0.7267677
[991] 0.7313131 0.7302189 0.7348485 0.7389731 0.7388889
[996] 0.7414983 0.7425084 0.7404040 0.7395623 0.7382997
[ reached getOption("max.print") -- omitted 4345 entries ]

Can anyone suggest a modification to the code so that the getOption does not omit any entries and the data frame includes every single entry, please?

Adding NRC kind of Dictionaries

Hi,

I've been using syuzhet and got to say it's just awesome. What I would like to add is my own dictionary in a NRC kind of way with my own sentiments and values.

I'm kind of new to R and I tried to replicate the get_nrc_sentiment() and get_nrc_values in order to acomplish getting my own dictionary. I came up with something like this:

`get_custom_sentiment <- function(char_v, cl=NULL, lowercase = TRUE, myDictionary){
if (!is.character(char_v)) stop("Data must be a character vector.")
if(!is.null(cl) && !inherits(cl, 'cluster')) stop("Invalid Cluster")
#lexicon <- dplyr::filter_(nrc, ~lang == language) #I commented this since I wanted my own dictionary
if(lowercase){
char_v <- tolower(char_v)
}
word_l <- strsplit(char_v, "[^A-Za-z']+")

if(is.null(cl)){
nrc_data <- lapply(word_l, get_custom_values, lexicon = myDictionary)
}
else{
nrc_data <- parallel::parLapply(cl=cl, word_l, lexicon = myDictionary, get_custom_values(myDictionary))
}

result_df <- as.data.frame(do.call(rbind, nrc_data), stringsAsFactors=F)
#reorder the columns
my_col_order <- c(
"anger",
"anticipation",
"disgust",
"fear",
"joy",
"sadness",
"surprise",
"trust",
"negative",
"positive"
)
result_df[, my_col_order]
}`

And for getting the values:

`get_custom_values <- function(word_vector,lexicon = NULL){
#if (is.null(lexicon)) {
#lexicon <- dplyr::filter_(nrc, ~lang == language)
#} Again, commented this since I want my own Dictionary
if (! all(c("word", "sentiment", "value") %in% names(lexicon)))
stop("lexicon must have a 'word', a 'sentiment' and a 'value' field")

data <- dplyr::filter_(myDictionary, ~word %in% word_vector)
data <- dplyr::group_by_(data, ~sentiment)
data <- dplyr::summarise_at(data, "value", sum)

all_sent <- unique(myDictionary$sentiment)
sent_present <- unique(data$sentiment)
sent_absent <- setdiff(all_sent, sent_present)
if (length(sent_absent) > 0) {
missing_data <- dplyr::data_frame(sentiment = sent_absent, value = 0)
data <- rbind(data, missing_data)
}
tidyr::spread_(data, "sentiment", "value")
}`

If I pass the dictionary directly into the function code (not as a variable) the code works, but if I pass it as a variable I get this error:

Error in get_custom_sentiment(DB$text, feelings) :
Invalid Cluster

My guess is there is something in how I format de custom dictionary that doesn't let the function find the clusters in this line
if(!is.null(cl) && !inherits(cl, 'cluster')) stop("Invalid Cluster")

But also if that was the case why does it work if I put the dictionary directly inside the function.
I hope this makes sense and that you could point me in the right direction to work on this.

spread_() deprecation warning from get_nrc_sentiment

I used the following command, from the section "Obtain sentiment scores" in this page: https://www.r-bloggers.com/2021/05/sentiment-analysis-in-r-3/

> apple <- read.csv("/Users/me/Downloads/Data1.csv", header = T)
> tweets <- iconv(apple$text)
> s <- get_nrc_sentiment(tweets)
Warning message:
`spread_()` was deprecated in tidyr 1.2.0.
ℹ Please use `spread()` instead.
ℹ The deprecated feature was likely used in the syuzhet package.
  Please report the issue to the authors.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated. 
> lifecycle::last_lifecycle_warnings()
[[1]]
<warning/lifecycle_warning_deprecated>
Warning:
`spread_()` was deprecated in tidyr 1.2.0.
ℹ Please use `spread()` instead.
ℹ The deprecated feature was likely used in the syuzhet package.
  Please report the issue to the authors.
---
Backtrace:
    ▆
 1. └─syuzhet::get_nrc_sentiment(tweets)
 2.   └─base::lapply(word_l, get_nrc_values, lexicon = lexicon)
 3.     └─syuzhet (local) FUN(X[[i]], ...)
 4.       └─tidyr::spread_(data, "sentiment", "value")

Error in get_nrc_sentiment for Hungarian language

I try to get the sentiment of Hungarian words. It works properly in English or Italian but in Hungarian.

packageVersion('syuzhet')
[1] ‘1.0.7’
syuzhet::get_nrc_sentiment(c('love', 'hate', 'apple'))

  anger anticipation disgust fear joy sadness surprise trust negative positive
1     0            0       0    0   1       0        0     0        0        1
2     1            0       1    1   0       1        0     0        1        0
3     0            0       0    0   0       0        0     0        0        0
syuzhet::get_nrc_sentiment(c('love', 'hate', 'apple'), language='english')
  anger anticipation disgust fear joy sadness surprise trust negative positive
1     0            0       0    0   1       0        0     0        0        1
2     1            0       1    1   0       1        0     0        1        0
3     0            0       0    0   0       0        0     0        0        0
syuzhet::get_nrc_sentiment(c('amore', 'odio', 'mela'), language='italian')
  anger anticipation disgust fear joy sadness surprise trust negative positive
1     0            0       0    0   0       0        0     0        0        0
2     2            0       2    2   0       2        0     0        2        0
3     0            0       0    0   0       0        0     0        0        0
syuzhet::get_nrc_sentiment(c('love', 'hate', 'apple'), language='hungarian')
#Error in value[[jvseq[[jjj]]]] : subscript out of bounds
syuzhet::get_nrc_sentiment(c('szeretet', 'utálat', 'alma'), language='hungarian')
#Error in value[[jvseq[[jjj]]]] : subscript out of bounds

Related stackoverflow question: https://stackoverflow.com/questions/77631640/r-package-syuzhet-does-not-work-in-hungarian

syuzhet R Package Needs updating to rlang 0.4.8

Downloading syuzhet 1.0.4 for a class I am teaching, it presents an error that it requires rlang 0.4.7, but installing it loads 0.4.6.

Requested Matthew Jockers to update syuzhet, if possible, to rlang 0.4.8.

Steve Kaisler

Example in `get_transformed_values()` returns only NA

Running example(get_transformed_values) returns only a vector of one hundred NAs. I suspect that the problem is that the example text has fewer than 100 sentences. One possible fix is to use Joyce's Portrait that is now in inst/extdata.

Deprecated code

Hi!

Thanks for producing and sharing a great package.

I had these warnings come up that you might consider in any new releases.

filter_() is deprecated as of dplyr 0.7.0.
Please use filter() instead.
See vignette('programming') for more help
This warning is displayed once every 8 hours.

Call lifecycle::last_warnings() to see where this warning was generated.group_by_() is deprecated as of dplyr 0.7.0.
Please use group_by() instead.
See vignette('programming') for more help
This warning is displayed once every 8 hours.

Call lifecycle::last_warnings() to see where this warning was generated.data_frame() is deprecated as of tibble 1.1.0.
Please use tibble() instead.
This warning is displayed once every 8 hours.
Call lifecycle::last_warnings() to see where this warning was generated.

Thanks again.
Johnny

Error in system(cmd, input = text_vector, intern = TRUE, ignore.stderr = TRUE) : 'cd' not found

Hi,
I am kind of new to this sentiment and R thing.
I am currently still figuring out which code will be best for my purpose.
Anyway, I am currently trying syuzhet. But as soon as I reach the point where I like to try the stanford method I get the above error. I have already installed the stanford-corenlp-full-2015-12-09. I have further typed the tagger_path to the location of this package. What else do I need to do???

Code:

https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html

library(syuzhet)
my_example_text <- "I begin this story with a neutral statement.
Basically this is a very silly test.
You are testing the Syuzhet package using short, inane sentences.
I am actually very happy today.
I have finally finished writing this package.
Tomorrow I will be very sad.
I won't have anything left to do.
I might get angry and decide to do something horrible.
I might destroy the entire package and start from scratch.
Then again, I might find it satisfying to have completed my first R package.
Honestly this use of the Fourier transformation is really quite elegant.
You might even say it's beautiful!"
s_v <- get_sentences(my_example_text)

class(s_v)

[1] "character"

str(s_v)

chr [1:12] "I begin this story with a neutral statement." ...

head(s_v)

Methods

BING

sentiment_vector <- get_sentiment(s_v, method="bing")
sentiment_vector

AFINN

afinn_vector <- get_sentiment(s_v, method="afinn")
afinn_vector

NRC

nrc_vector <- get_sentiment(s_v, method="nrc")
nrc_vector

Stanford

Stanford Example: Requires installation of coreNLP and path to directory

tagger_path <- "C:/Users/Steffen/Documents/R/win-library/3.2/stanford-corenlp-full-2015-12-09"
stanford_vector <- get_sentiment(s_v, method="stanford", tagger_path)
stanford_vector

Thanks for a quick answer,
Steffen

Drop openNLP/rJava dependency

Hi @mjockers I currently import the syuzhet dictionary into my own lexicon package per this PR: #19

There is one downside...syuzhet utulizes openNLP to split sentences. I am writing to request an alternative sentence segmentation resource such as my own textshape package. My argument for dropping openNLP is 4 fold: (A) it provides a significant setup hurdle for many users, (B) it's less accurate, (C) much slower than alternatives and (D) openNLP strips the original element to sentence hierarchy.

First, openNLP provides a significant user setup hurdle. openNLP has an rJava dependency. rJava is a thorn in many users sides, including experienced users (e.g.: trinker/qdap#232) and is difficult (or impossible) if you're trying to set up computing in a cloud service like Microsoft Azure. I have a network of packages that in turn rely on lexicon all making them a slave to rJava. Dropping the openNLP removes a java dependency making syuzhet R based and thus easier to set up.

Second, openNLP is less accurate than the textshape alternative I am proposing. Here we see similar use:

library(syuzhet)
library(textshape)
library(tidyverse)

my_example_text <- "I begin this story with a neutral statement.  
  Basically this is a very silly test.  
  You are testing the Syuzhet package using short, inane sentences.  
  I am actually very happy today. 
  I have finally finished writing this package.  
  Tomorrow I will be very sad. 
  I won't have anything left to do. 
  I might get angry and decide to do something horrible.  
  I might destroy the entire package and start from scratch.  
  Then again, I might find it satisfying to have completed my first R package. 
  Honestly this use of the Fourier transformation is really quite elegant.  
  You might even say it's beautiful!"

get_sentences(my_example_text)

textshape::split_sentence(my_example_text)

Now we amp it up with a subset of joyces_portrait and try openNLP getting an n = 727 vs. textshape getting an n = 758. Thats ~30 less sentences detected by the openNLP algorithm. I have used several reputable text programs (script @ end) for segmentation and compare their number of sentences: (A) coreNLP n = 756, (B) textblob* n = 757, (C) nltk* n = 757, (D) spacy* n = 757 &, (E) pattern* n = 758. We see textshape is much closer to these other segmentation tools than openNLP.

*Analyzed using: http://textanalysisonline.com

Third, openNLP is slow. Let's demo this by taking the subset of joyces_portrait and multiplying it by 100. The code below shows that the textshape approach is 21 times faster on this text at segmenting.

> ## subset of joyces_portrait
> x <- readLines('https://gist.githubusercontent.com/trinker/03a75e5fe935223d87085e50d01b981e/raw/f83d4170a11e4d60040a5f9d14ef9c3a0d7c22af/example_text')
> y <- paste(rep(x, 100), collapse = ' ')
> 
> gar <- gc(); start <- Sys.time()
> a <- get_sentences(y) 
> Sys.time() - start
Time difference of 52.11476 secs
> 
> gar <- gc(); start <- Sys.time()
> b <- textshape::split_sentence(y) 
> Sys.time() - start
Time difference of 2.43971 secs

Finally, get_sentences strips out the element ordering. In a book this may be less important but even then one wants to keep chapters or acts straight and it is difficult. The example below shows that get_sentences returns one vector of segmented sentences. textshape returns a list of 3, one for each act in the play.

z <- tibble::tibble(

  act = 1:3,
  text = c("I begin this story with a neutral statement.  
    Basically this is a very silly test.  
    You are testing the Syuzhet package using short, inane sentences.  
    I am actually very happy today. 
    I have finally finished writing this package.",  
    "Tomorrow I will be very sad. 
    I won't have anything left to do. 
    I might get angry and decide to do something horrible.",   
    "I might destroy the entire package and start from scratch.  
    Then again, I might find it satisfying to have completed my first R package. 
    Honestly this use of the Fourier transformation is really quite elegant.  
    You might even say it's beautiful!"
  )
)

get_sentences(z$text)
textshape::split_sentence(z$text)

## > get_sentences(z$text)
##  [1] "I begin this story with a neutral statement."                                
##  [2] "Basically this is a very silly test."                                        
##  [3] "You are testing the Syuzhet package using short, inane sentences."           
##  [4] "I am actually very happy today."                                             
##  [5] "I have finally finished writing this package."                               
##  [6] "Tomorrow I will be very sad."                                                
##  [7] "I won't have anything left to do."                                           
##  [8] "I might get angry and decide to do something horrible."                      
##  [9] "I might destroy the entire package and start from scratch."                  
## [10] "Then again, I might find it satisfying to have completed my first R package."
## [11] "Honestly this use of the Fourier transformation is really quite elegant."    
## [12] "You might even say it's beautiful!"                                          
## > textshape::split_sentence(z$text)
## [[1]]
## [1] "I begin this story with a neutral statement."                     
## [2] "Basically this is a very silly test."                             
## [3] "You are testing the Syuzhet package using short, inane sentences."
## [4] "I am actually very happy today."                                  
## [5] "I have finally finished writing this package."                    
## 
## [[2]]
## [1] "Tomorrow I will be very sad."                          
## [2] "I won't have anything left to do."                     
## [3] "I might get angry and decide to do something horrible."
## 
## [[3]]
## [1] "I might destroy the entire package and start from scratch."                  
## [2] "Then again, I might find it satisfying to have completed my first R package."
## [3] "Honestly this use of the Fourier transformation is really quite elegant."    
## [4] "You might even say it's beautiful!" 

This means that get_sentences will not play nicely in a dplyr mutate statement as the length returned is longer than the input resulting in an error. textshape on the other hand returns a list column:

> z %>%
+     dplyr::mutate(sents = get_sentences(text))
Error in mutate_impl(.data, dots) : 
  wrong result size (12), expected 3 or 1
> 
> z %>%
+     dplyr::mutate(sents = textshape::split_sentence(text)) 
# A tibble: 3 × 3
    act
  <int>
1     1
2     2
3     3
# ... with 2 more variables: text <chr>, sents <list>

Proposed non-OpenNLP Sentence Segmentation Function

This would be a possible non-openNLP segmentation approach with textshape: By switching to this function syuzhet could drop its openNLP dependency.

#' Sentence Tokenization
#' @description
#' Parses a string into a vector of sentences.
#' @param text_of_file A Text String
#' @param as_vector If \code{TRUE} the result is unlisted.  If \code{FALSE}
#' the result stays as a list of the original text string elements split into 
#' sentences.
#' @return A Character Vector of Sentences
#' @export
#' 
get_sentences <- function(text_of_file, as_vector = TRUE){
  if (!is.character(text_of_file)) stop("Data must be a character vector.")
  splits <- textshape::split_sentence(text_of_file)
  if (isTRUE(as_vector)) splits <- unlist(splits)
  splits
}

Thank you for your time and consideration for dropping the openNLP dependency from syuzhet.

additional code for comparing segmentation lengths of prominent text analysis software

## subset of joyces_portrait
x <- readLines('https://gist.githubusercontent.com/trinker/03a75e5fe935223d87085e50d01b981e/raw/f83d4170a11e4d60040a5f9d14ef9c3a0d7c22af/example_text')

## syuzhet via openNLP n = 727
get_sentences(x) %>%
    unlist() %>% 
    length()

## textshape n = 758
textshape::split_sentence(x) %>%
    unlist() %>% 
    length() 



## coreNLP n = 756
cmd <- "java -cp \"C:/stanford-corenlp-full-2016-10-31/*\" -mx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators \"tokenize,ssplit\""
results <- system(cmd, input = x, intern = TRUE, ignore.stderr = TRUE)
grep('^Sentence', results, value = TRUE) %>%
    length()


## http://textanalysisonline.com/
## textblob n = 757
readLines('https://gist.githubusercontent.com/trinker/3732befb8a7b0425ae9d7efddefab94e/raw/e98d6ec1a074af41df8aa6d447947d869b58a6c4/example_text_split') %>%
    stringi::stri_split_fixed('<br><br>') %>%
    unlist() %>%
    length()

## nltk n = 757
readLines('https://gist.githubusercontent.com/trinker/aed0942a326372df88884a8e61fe3122/raw/4c315adf27ff6a50c645fc7bada0c2eb9b43c4d0/example_text_nltk') %>%
    stringi::stri_split_fixed('<br><br>') %>%
    unlist() %>%
    length()

## spacy n = 757
readLines('https://gist.githubusercontent.com/trinker/8bfdfb4dd46e6787913ed542cb82e56e/raw/baf0c0a49cb0de53fac1cacff027ebe34cfda24a/example_text_spacy') %>%     
    stringi::stri_split_fixed('<br><br>') %>%
    unlist() %>%
    length()

## pattern n = 758
readLines('https://gist.githubusercontent.com/trinker/935b942aec8ff305d19de7e0c01129e2/raw/158831e219e386681b30f220765a79cf7d597ea6/example_text_pattern') %>%     
    stringi::stri_split_fixed('<br><br>') %>%
    unlist() %>%
    length()

score zero to any non-english sentence

I tried to use get_nrc_sentiment with lang = "portuguese" and NRC method but it aways returns zero for all sentiments. I was wondering if I shoud do something more to use it.

portuguese

hi there

whats the best approach to use the portuguese translation of NRC Emotion Lexicon dictionary in functions like get_nrc_sentiment and values?

thank you
daniel

Call get_sentiment() method for non-english data

I am calling get_sentiment() method for Russian text. But for 99% data, it returns 0.00 score. only for 2-3 post, which has some english word, it returned positive or negative value.

I tested with all dictionaries "bing","nrc","afinn"," syuzhet" but results are almost same "Zero".

Can you please tell, how to use this method for Russian or other languages content?

issues with get_nrc_sentiment

Hi,
I am trying to perform sentiment analysis using the NRC lexicon on Twitter data however when I use get_nrc_sentiment it takes too long to compute. I do have a huge dataset.

How can I reduce the time consumption?
Please advise. Also, I am new to R.
Thank you.

German language delivers wrong polarity

Analysing German text delivers wrong polarity

my_example_text <- "Frau am Bahnhof geschlagen und ausgeraubt Täter sind 7 Araber auch andere Fahrgäste sind wohl Opfer geworden"
get_nrc_sentiment(my_example_text, language = "german")

This German text is quite easy and should deliver negative emotions somehow I get everywhere "0"

But by using I get somehow a result that can be comprehended which is "-3"
get_sentiment(my_example_text, method = "nrc", language = "german")

Can you help to fix it?

Considering negators in sentiment

Matt thanks for putting this package together. I was testing some of the different functionalities and I noticed that maybe negators (e.g., not, never) are not accounted for. Maybe I'm not understanding the use or this is beyond the scope of the package or not a part of the algorithms you've implemented.

Here are some examples demonstrating what I'm talking about. I'd expect to see negative sentiment here.

print(get_sentiment("John is never successful at tennis."))
setNames(lapply(c("afinn", "bing", "nrc"), function(x) {
    try(get_sentiment("this is not good at all", method=x))
}), c("afinn", "bing", "nrc"))
$afinn
[1] 3

$bing
[1] 1

$nrc
[1] 1

Add support for non-Ascii languages

currently, only languages with Latin alphabets are supported by NRC and custom methods. I have made a fork which supports Unicode alphabets by altering the regex used, and by using the CRAN package ‘tokenizers’ for unicode tokenisation .

The regex has been altered from "[^A-Za-z']+" to "\s+", looking only at whitespace to split the strings. There are more complex alternatives using perl style regex to include only non unicode alphabet characters if this causes issues.

get_sentiment output is invisible

Just a minor issue - I was going through the vignette examples and noticed that get_sentiment output is invisible and can only be assigned. So running the first command returns nothing.

get_sentiment("Why is this function invisible?", method="bing")

x <- get_sentiment("Why do I need to create a new object to print", method="bing")
x
[1] -1

I think wrapping return(result) in a else{} statement causes this since its not evaluated.

Support for Cyrillic Slavic languages?

Would it be possible to adjust the tokenization to accommodate Cyrillic Slavic languages (e.g. Russian, Bulgarian, Serbian) that are supported by NRC? I understand the challenges with handling tokenization more broadly, and can't vouch for how it would work with some of the other non-Latin alphabets, but I think adding in the Cyrillic Unicode ranges to the current whitespace-oriented code should work okay. Thank you!

mixed message function error

Whenever I try to use the function for prose texts, the same error pops up as below. (I used the same code from "Introduction to the Syuzhet Package"):

path_to_a_text_file <- "CMS_02.txt"
sample <- get_text_as_string(path_to_a_text_file)
sample_sents <- get_sentences(sample)
test <- lapply(sample_sents, mixed_messages)
Error in sign(get_sentiment(tokens)) :
non-numeric argument to mathematical function

Can anyone help me solve this issue?

Support for user defined sentiment lexicons

Right now the package only works with the built in sentiment lexicons. It makes sense to add functionality that will allows users to work with their own lexicons or with lexicons in other languages.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.