juliasilge / tidytext Goto Github PK

View Code? Open in Web Editor NEW

1.2K 64.0 181.0 132 MB

Text mining using tidy tools :sparkles::page_facing_up::sparkles:

Home Page: https://juliasilge.github.io/tidytext/

License: Other

R 96.60% TeX 3.16% Rez 0.24%

text-mining r tidyverse tidy-data natural-language-processing

tidytext's Introduction

tidytext: Text mining using tidy tools

Authors: Julia Silge, David Robinson
License: MIT

Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. Much of the infrastructure needed for text mining with tidy data frames already exists in packages like dplyr, broom, tidyr, and ggplot2. In this package, we provide functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. Check out our book to learn more about text mining using tidy data principles.

Installation

You can install this package from CRAN:

install.packages("tidytext")

Or you can install the development version from GitHub with remotes:

library(remotes)
install_github("juliasilge/tidytext")

Tidy text mining example: the `unnest_tokens` function

The novels of Jane Austen can be so tidy! Let’s use the text of Jane Austen’s 6 completed, published novels from the janeaustenr package, and transform them to a tidy format. janeaustenr provides them as a one-row-per-line format:

library(janeaustenr)
library(dplyr)

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(line = row_number()) %>%
  ungroup()

original_books
#> # A tibble: 73,422 × 3
#>    text                    book                 line
#>    <chr>                   <fct>               <int>
#>  1 "SENSE AND SENSIBILITY" Sense & Sensibility     1
#>  2 ""                      Sense & Sensibility     2
#>  3 "by Jane Austen"        Sense & Sensibility     3
#>  4 ""                      Sense & Sensibility     4
#>  5 "(1811)"                Sense & Sensibility     5
#>  6 ""                      Sense & Sensibility     6
#>  7 ""                      Sense & Sensibility     7
#>  8 ""                      Sense & Sensibility     8
#>  9 ""                      Sense & Sensibility     9
#> 10 "CHAPTER 1"             Sense & Sensibility    10
#> # ℹ 73,412 more rows

To work with this as a tidy dataset, we need to restructure it as one-token-per-row format. The unnest_tokens() function is a way to convert a dataframe with a text column to be one-token-per-row:

library(tidytext)
tidy_books <- original_books %>%
  unnest_tokens(word, text)

tidy_books
#> # A tibble: 725,055 × 3
#>    book                 line word       
#>    <fct>               <int> <chr>      
#>  1 Sense & Sensibility     1 sense      
#>  2 Sense & Sensibility     1 and        
#>  3 Sense & Sensibility     1 sensibility
#>  4 Sense & Sensibility     3 by         
#>  5 Sense & Sensibility     3 jane       
#>  6 Sense & Sensibility     3 austen     
#>  7 Sense & Sensibility     5 1811       
#>  8 Sense & Sensibility    10 chapter    
#>  9 Sense & Sensibility    10 1          
#> 10 Sense & Sensibility    13 the        
#> # ℹ 725,045 more rows

This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regex pattern.

Now that the data is in a one-word-per-row format, we can manipulate it with tidy tools like dplyr. We can remove stop words (available via the function get_stopwords()) with an anti_join().

tidy_books <- tidy_books %>%
  anti_join(get_stopwords())

We can also use count() to find the most common words in all the books as a whole.

tidy_books %>%
  count(word, sort = TRUE) 
#> # A tibble: 14,375 × 2
#>    word      n
#>    <chr> <int>
#>  1 mr     3015
#>  2 mrs    2446
#>  3 must   2071
#>  4 said   2041
#>  5 much   1935
#>  6 miss   1855
#>  7 one    1831
#>  8 well   1523
#>  9 every  1456
#> 10 think  1440
#> # ℹ 14,365 more rows

Sentiment analysis can be implemented as an inner join. Three sentiment lexicons are available via the get_sentiments() function. Let’s examine how sentiment changes across each novel. Let’s find a sentiment score for each word using the Bing lexicon, then count the number of positive and negative words in defined sections of each novel.

library(tidyr)
get_sentiments("bing")
#> # A tibble: 6,786 × 2
#>    word        sentiment
#>    <chr>       <chr>    
#>  1 2-faces     negative 
#>  2 abnormal    negative 
#>  3 abolish     negative 
#>  4 abominable  negative 
#>  5 abominably  negative 
#>  6 abominate   negative 
#>  7 abomination negative 
#>  8 abort       negative 
#>  9 aborted     negative 
#> 10 aborts      negative 
#> # ℹ 6,776 more rows

janeaustensentiment <- tidy_books %>%
  inner_join(get_sentiments("bing"), by = "word", relationship = "many-to-many") %>% 
  count(book, index = line %/% 80, sentiment) %>% 
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

janeaustensentiment
#> # A tibble: 920 × 5
#>    book                index negative positive sentiment
#>    <fct>               <dbl>    <int>    <int>     <int>
#>  1 Sense & Sensibility     0       16       32        16
#>  2 Sense & Sensibility     1       19       53        34
#>  3 Sense & Sensibility     2       12       31        19
#>  4 Sense & Sensibility     3       15       31        16
#>  5 Sense & Sensibility     4       16       34        18
#>  6 Sense & Sensibility     5       16       51        35
#>  7 Sense & Sensibility     6       24       40        16
#>  8 Sense & Sensibility     7       23       51        28
#>  9 Sense & Sensibility     8       30       40        10
#> 10 Sense & Sensibility     9       15       19         4
#> # ℹ 910 more rows

Now we can plot these sentiment scores across the plot trajectory of each novel.

library(ggplot2)

ggplot(janeaustensentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(book), ncol = 2, scales = "free_x")

For more examples of text mining using tidy data frames, see the tidytext vignette.

Tidying document term matrices

Some existing text mining datasets are in the form of a DocumentTermMatrix class (from the tm package). For example, consider the corpus of 2246 Associated Press articles from the topicmodels dataset.

library(tm)
data("AssociatedPress", package = "topicmodels")
AssociatedPress
#> <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
#> Non-/sparse entries: 302031/23220327
#> Sparsity           : 99%
#> Maximal term length: 18
#> Weighting          : term frequency (tf)

If we want to analyze this with tidy tools, we need to transform it into a one-row-per-term data frame first with a tidy() function. (For more on the tidy verb, see the broom package).

tidy(AssociatedPress)
#> # A tibble: 302,031 × 3
#>    document term       count
#>       <int> <chr>      <dbl>
#>  1        1 adding         1
#>  2        1 adult          2
#>  3        1 ago            1
#>  4        1 alcohol        1
#>  5        1 allegedly      1
#>  6        1 allen          1
#>  7        1 apparently     2
#>  8        1 appeared       1
#>  9        1 arrested       1
#> 10        1 assault        1
#> # ℹ 302,021 more rows

We could find the most negative documents:

ap_sentiments <- tidy(AssociatedPress) %>%
  inner_join(get_sentiments("bing"), by = c(term = "word")) %>%
  count(document, sentiment, wt = count) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  arrange(sentiment)

Or we can join the Austen and AP datasets and compare the frequencies of each word:

comparison <- tidy(AssociatedPress) %>%
  count(word = term) %>%
  rename(AP = n) %>%
  inner_join(count(tidy_books, word)) %>%
  rename(Austen = n) %>%
  mutate(AP = AP / sum(AP),
         Austen = Austen / sum(Austen))


comparison
#> # A tibble: 4,730 × 3
#>    word             AP     Austen
#>    <chr>         <dbl>      <dbl>
#>  1 abandoned 0.000170  0.00000493
#>  2 abide     0.0000291 0.0000197 
#>  3 abilities 0.0000291 0.000143  
#>  4 ability   0.000238  0.0000148 
#>  5 able      0.000664  0.00151   
#>  6 abroad    0.000194  0.000178  
#>  7 abrupt    0.0000291 0.0000247 
#>  8 absence   0.0000776 0.000547  
#>  9 absent    0.0000436 0.000247  
#> 10 absolute  0.0000533 0.000128  
#> # ℹ 4,720 more rows

library(scales)
ggplot(comparison, aes(AP, Austen)) +
  geom_point(alpha = 0.5) +
  geom_text(aes(label = word), check_overlap = TRUE,
            vjust = 1, hjust = 1) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")

For more examples of working with objects from other text mining packages using tidy data principles, see the vignette on converting to and from document term matrices.

Community Guidelines

This project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms. Feedback, bug reports (and fixes!), and feature requests are welcome; file issues or seek support here.

tidytext's People

Contributors

Stargazers

Watchers

Forkers

poldham magnuson8 yosuke-yasuda jimhester noahbullock sjtu2008 qgzang kevark ktargows seankross jackstat ameliamn sara-shepherd vnijs jewelryland davechilders lfthwjx parker00811 zhao-hailei mansmeg umeshach udhai17 jiunnguo babooppa6 nfte segranp rewatsang yaalexus khan007 angoodkind yixf-self chanstri fernote7 chicago-dave surio rlugojr datamaniac03 keyur9 liuxiaodong20090512 ofraklein omaymas interzone2001 yzharold dsbib sunilkumar87 strategist922 vinodhinir datagold2017 xkuang jonmcalder zhaoxiaohe carlganz dnzengou freddyalfonsoboulton andreemidio krm94 pokhrelj tonywangcn ktaranov jkeirstead nanaakwasiabayieboateng nishitpatel01 muntasirmasum egnha anhnguyendepocen jrosen48 guhjy ybj2004 midnight93 iamjoshbinder nemochina2008 datascience2017 hanjostudy thrinu bawcos ccdubbs gabeboer tinaqian2017 wendyanthony saberry harshvardhan000 economistgame kaustavpakira kanishkamisra teslaa22 lizl90 tlfmcooper vbraguimcanto tmastny eleakin sarowarmonjil mhamine skoluguri30 aedobbyn austinsmom lepennec sharmanatasha godfatherace dthai91 mirsatra

tidytext's Issues

Rename unnest_tokens to eliminate semantic overlap/ambiguity with `tidyr::unnest`

Currently the naming overlaps with tidyr::unnest(), which takes unwraps list columns. By contrast tidytext::unnest_tokens() acts on a character vector column. For example, tokenise() would avoid this collision of names / semantics, though at the cost of losing the sense that multiple rows are being produced for each input text field. tokenise_to_rows or tokens_to_rows addresses this, but doesn't feel very elegant. I'm afraid I don't have the solution. But there must be a name that captures both that sense, without overlapping with tidyr::unnest()?

Upcoming tokenizers change

tokenizers is about to change in a couple of ways with the next release:

Substantial changes to the skip-ngram tokenizer;
The introduction of a n_min field (which currently causes all sorts of interesting warnings when unnest_tokens is used.

I suspect we should sync up with @lmullen on release dates and conditions to make sure that we push out the next tidytext version at roundabouts the same time as the next tokenizers version, to avoid wailing and gnashing of teeth. Thoughts?

Fix main vignette

The chapters don't line up with the text for the last bit of the main vignette. Text needs to be edited.

Moving pair_count (and future related functions) to "widyr" package

I've started developing the widyr package (the docs and examples there are still rather primitive) that handles the common pattern of

tidydata -acast->
  matrix -some operation->
  matrix -melt->
  tidydata

One notable example of this pattern (what got me onto the topic) is pair_count. I'd thus like to deprecate tidytext's pair_count and move it there (and in the new package name it pairwise_count just to avoid any future name clash). I think there's a variety of applications and I think it would be easier to experiment and develop them out of the way of text mining applications, especially since tidytext is on CRAN and has active users.

Does this sound reasonable?

Rename stopwords to stop_words?

If either tm or quanteda is loaded after tidytext, their stopwords function can replace our dataset. In any case it means the same name appears across three text-mining packages.

Should we rename the stopwords dataset to stop_words to avoid this?

Casting to matrix with term constraints

First of all I want to thank you and David Robinson. tidytext and Text Mining in R are absolutely brilliant! I have a feature suggestion, which may or may not be within the scope of the package.

I've been working with the apartment rental listing data from this Kaggle competition. It contains text tags for each apartment listing. I've been using tidytext to analyse, transform and cast to DTMs.

My goal is to predict the interest level of a listing given its feature tags. The cast_ functions have been tremendous for creating a set of training data for training a model, but I have a challenge. The training data produces an n x m DTM - so there are m terms in the set. I train a model with this matrix, but the test data won't necessarily have the same m terms. I could include both in the dataset for casting and then separate them, but then I have to recast every time I have a new listing to predict.

It would be great to constrain the set of m terms that get cast to the DTM (kind of like the opposite of stop words). Then I could do something like:

allowed_terms <- c("bright", "hardwood", "cosy")

dtm <- get_some_text() %>%
  unnest_tokens(term, text) %>%
  count(document, term) %>%
  cast_sparse(document, term, n, allowed_terms = allowed_terms)

This would mean any new observation with a set of features could be standardised as input into the model (for example if a new observation contained none of the allowed terms, then it would be cast to a 1 x m matrix of zeroes).

I would be happy to work on this feature and make a pull request if you thought it was a reasonable feature to be included. Thanks again, your work is amazing!

Cheers,
DJ

Documentation for tokenizing by regex

We need an example and something in a vignette somewhere on how to tokenize using regex. Specifically I was asked how to get out @-mentions and hashtags from twitter data.

unnest_tokens drops attributes of its input data.frame

Is this intended behaviour? I note that tidyr::unnest also destroys attributes, though tidyr::nest preserves them (#272 at tidyr). Makes it harder to use unnest_tokens with subclasses of data.frame, since you then have to take care to save the old attributes before calling unnest_tokens, and then restore them afterwards.

Custom tokenizer in unnest_tokens

Hi, thanks for developing this helpful package!

unnest_tokens seems to be using tokenizers from "tokenizers" package but how about making it possible to use a tokenizer users choose?

I'm from Japan and tokenizer for Japanese is totally different, so I want to use a custom tokenizer.

Implement tidier for topic models from mallet

Let's write a tidier for the topic models from mallet along the lines of the LDA tidier for topicmodels.

Internationalise stop_words

From #22 :

Include multilanguage options for stopwords. This is a useful thing with or without other internationalisation changes. So this would consist of:

Expanding the stopword sets stored in the package
Adding a parameter to stop_langs to select languages
Adding a new function, stop_languages() that returns a set of which languages are supported by stop_words.

I will probably work on this over the next few days.

Add a few more tests?

I added codecov today and things look pretty decent. If we wanted to get a bit higher on test coverage eventually, it would be good to write some more tests for sparse_casters.R and some (one?) for dictionary_tidiers.R.

Here's the link showing details:
https://codecov.io/gh/juliasilge/tidytext

Contributions

Hi Julia!

I totally love your package and way to handle text data. Im working with topic models and was thinking of trying to start with a tidytopic package for topic models using the same format. Is tidytext stabile enough to build upon, you would say? Do you have any suggestions on API to enable additional packages?

You can reach me on mons.magnusson at gmail dot com if you want to discuss more in details.

Make unnest_tokens() an S3 generic

... with the existing implementation as the default method.
This would allow alternative implementations, dispatched based on an additional item in the class() vector, while preserving use of the tidytext grammar.

We should check for tm installation before using it in examples

e.g. using

if (requireNamespace("tm", quietly = TRUE))

Provide nest_tokens() as an approximate inverse to unnest_tokens()

Of course this can be achieved with tidyr::nest(). Proposing a wrapper to allow tidytext users to operate within the grammar of the package.

bind_tf_idf() returns NA for TF column when id column is integer

bind_tf_idf() iterates over the id column when calculating TF. However, it gets confused if id is numeric. This is due to how the underlying code for bind_tf_idf_() performs iteration in the following piece of code:

[...]
tbl$tf <- tbl[[n_col]]/as.numeric(doc_totals[documents])
[...]

I solved it by by simply doing as.character() on documents before passing it as a selector to doc_totals, like so:

tbl$tf <- tbl[[n_col]]/as.numeric(doc_totals[as.character(documents)])

But I'm sure there are more elegant solutions.

It should perhaps be mentioned that I discovered this by sampling from a bigger dataset where the id column was indeed numeric. This is a very common thing in the databases I'm working with (SQL based).

Minimal reprex:

# Corpus data has an id column where max(id) > nrow(my_corpus)
my_corpus <- data_frame(
  id = rep(c(2, 3), each = 3),
  word = c("an", "interesting", "text", "a", "boring", "text"),
  n = c(1, 1, 3, 1, 2, 1)
)

# This results in NA values in the TF column for the second document, with id == 3
my_corpus %>% bind_tf_idf(word, id, n)

Also, thanks for a great package! It has greatly reduced by work load for a number of text analysis tasks.

Should we change "method =" to "unit =" in unnest_tokens

In unnest_tokens, we have method = "words", "sentences", etc. But really this should describe the unit we are tokenizing on. Maybe unit =, or something similar, is a better argument name?

(Sorry this was posted early, keyboard mishap)

missing LOUGHRAN lexicon?

Hello,

First of all, many thanks for this wonderful package that really helps processing text data better and more efficiently!

It seems I am unable to load the loughran lexicon. Packages are up to date.
What is the problem here?


library(tidytext)
library(tidyverse)


test = data_frame('text' = c('this is a nice test yay'))


# WORKS
test %>% unnest_tokens(word, text) %>% 
  anti_join(stop_words, by = "word") %>%
  inner_join(get_sentiments('bing'), by = "word") 

# A tibble: 2 × 2
   word sentiment
  <chr>     <chr>
1  nice  positive
2   yay  positive

# FAILS
test %>% unnest_tokens(word, text) %>% 
  anti_join(stop_words, by = "word") %>%
  inner_join(get_sentiments('loughran'), by = "word") 

Error in match.arg(lexicon) : 
  'arg' should be one of “afinn”, “bing”, “nrc”

Suggestion to add BM25 Score

I suggest to add a function to bind BM25 score (which is based on a probabilistic term weighting model). It is useful in some cases as it gives control over:

Term frequency saturation
Document/Field length normalization

It is commonly used as a ranking function by search engines.

I implemented a function bind_bm25 in the forked repo HERE

# bind_bm25 is given bare names -------------------

bind_bm25 <- function(tbl, term_col, document_col, n_col, k = 1.2, b = 1) {
  bind_bm25_(tbl,
               col_name(substitute(term_col)),
               col_name(substitute(document_col)),
               col_name(substitute(n_col)),
               k = k,
               b = b)
}

# bind_bm25_ is given strings -------------------------

bind_bm25_ <- function(tbl, term_col, document_col, n_col, k = 1.2, b = 1) {
  terms <- tbl[[term_col]]
  documents <- tbl[[document_col]]
  n <- tbl[[n_col]]

  doc_totals <- tapply(n, documents, sum)
  avg_dl <- mean(doc_totals)

  idf <- log(length(doc_totals) / table(terms))

  tbl$tf_bm25 <- ((k+1)*n)/(n+(k*((1-b)+b*(as.numeric(doc_totals[documents])/avg_dl))))
  tbl$idf <- as.numeric(idf[terms])
  tbl$bm25 <- tbl$tf_bm25 * tbl$idf

  tbl
}

Column Selector doesn't work

I have 4 columns in a tibble.
1st is the Date, then the other 3 are <chr> type.
When I try to create ngrams by specifying the actual column of "Text" that I am interested in, it always selects a different column. I have tried rearranging the column positions but it still doesn't work.
I ended up having to select only the Date and the Text column in order to create ngrams.
Here is the final code:
consumer_ngrams <- consumer_small %>% select(Date, Text)%>% mutate(feedback_no = row_number())%>% unnest_tokens(ngram, Text, token = "ngrams", n = 2)

A "cooccur" function

I realize a common pattern is to look for frequencies that words cooccur in a document. Right now the best way to do that is to create a document-term matrix then multiply it with its own transpose, and even that has some gotchas that can catch you.

I realized there is a dplyr-like verb that can describe this, such that you could do books %>% cooccur(chapter, word) and get out a tidy dataset.

Ideally I'd like to make it work in the group_by() syntax but haven't gotten that to work yet, so here's some preliminary work that I'll continue on later

#' Count pairs of items that cooccur within a group
cooccur_ <- function(d, group_col, item_col, sort = FALSE) {
  requireNamespace("Matrix")
  sparse_mat <- cast_sparse_(d, group_col, item_col, value_col = 1)
  cooccur_mat <- t(sparse_mat) %*% sparse_mat

  ret <- tidy(cooccur_mat) %>%
    rename(value1 = row, value2 = column, count = value) %>%
    tbl_df()

  if (sort) {
    ret <- arrange(ret, desc(count))
  }
  ret
}

mtcars %>% cooccur_("am", "carb")

Stemming?

Do we want to implement a tidy interface to a stemming algorithm? I am aware of the one in SnowballC, which we are already importing via dependency.

Tf-idf calculation

Tf-idf (term frequency-inverse document frequency) can be calculated for a one-term-per-document-per-row tidy dataset using a few group_by and mutate statements, but we should add this as one or more verbs since it's a common operation.

One way I could imagine it would be a function d %>% tf_idf(term, document, n) that adds three columns:

tf for term frequency (n / total in doc)
idf for inverse document frequency (log(n_distinct(document) / # of docs with term))
tf_idf for the product of the above

There are variations that we'd have to explore and add options for

Nicer deprecation?

The deprecation here isn't really deprecation, since that's defined as:

While a deprecated software feature remains in the software, its use may raise warning messages recommending alternative practices; deprecated status may also indicate the feature will be removed in the future. Features are deprecated rather than immediately removed, to provide backward compatibility and give programmers time to bring affected code into compliance with the new standard.

pairwise_count just refuses to be used; it's in effect entirely removed. Could I suggest either using .Deprecated and allowing it to continue running, going for .Defunct instead, or straight-up removing it?

Text Tidying tools for R?

Have you thought about putting together a list of commonly done operations to clean text for analysis within tidytext? I know this is much more of a data cleansing task, but I think having some resources laid out such as packages like qdap, textclean, stringi/stringr can be helpful when prepping text for analysis purposes. The data below is wine reviews from here: http://snap.stanford.edu/data/cellartracker.txt.gz, which has almost 2 million wine reviews. It is relatively large, but might even be worth using as an example within you book. Consider this example of wine reviews from a raw txt file, and removing html tags, html unicode, and key value pairs of data:

wine/name: 2001 Le Clos du Caillou Côtes du Rhône Villages Tres Vieilles Vignes Reserve
wine/wineId: 3688
wine/variant: Red Rhone Blend
wine/year: 2001
review/points: 93
review/time: 1096243200
review/userId: 1
review/userName: Eric
review/text: This wine continues to wow me. The nose is sweet and herbal with garrigue pouring from the glass along with a laser-focused core of brambly raspberry and some duskier scents of licorice and, from the right glass, some roasted meat. On the palate this is impeccably balanced, pure and powerful. The wine seems to have closed up a tiny bit from prior tastings, yet it juggles well between a core of sweet fruit and some nice elements of minerality. There is a lush, silky, enveloping texture that just washes over the palate in a very inviting way. The wine finishes powerfully with some hints of sour cherry and lovely, dusty, tooth-coating tannins. Delicious!

I also decided to try an experiment to see which Riedel glass format would work best for the wine. I took out four glasses and filled each glass roughly 1/6th of the way up. (My wife wondered who else was coming over for dinner...) Generally speaking, the most significant difference was on the nose, but I was surprised at how much some of the differences actually carried onto the palate. It seems like that initial sip is heavily flavored by what you smell as you tip the glass, far more than I ever imagined:
Riedel Vinum Chianti/Zin: Excellent all-purpose glass. The nose was nicely balanced with the raspberry really coming out and nice waves of garrigue.
Riedel Vinum Syrah: Hands down the best glass. It was like a bigger, more intense, more focused version of the Chianti glass. The garrigue really comes forward, but the big, tall bowl and narrow nose seems to amplify the bouquet quite a bit. Also, complex elements of smoke and roasted meat come to life. Wow, this is almost like a different wine.
Riedel Vinum Bordeaux: This one was puzzling. For some reason the nose on this glass was the least focused and most diffuse. I could smell all the same elements as in the prior two glasses, but the wine literally seemed to wander. It didn't know what it wanted to be. I had to work to smell the same things, and even then it was like I had to convince myself of what I was smelling. The one word summary is diffuse.
Riedel Vinum Burgundy: This is the one glass that brings forward a sharp, almost painful note of alcohol. I really did not enjoy the wine from this glass.

I actually found this experiment a bit disturbing. I like my Riedel's a lot, but I never really expected the marketing hype to be as true to reality as I saw tonight. I have a bad feeling that in the future I will have more glassware to wash, as frankly I will likely try the same wine from a few glasses. It is highly educational!

DAY 2: Still going strong. Smooth as velvet, rich, deep and integrated. Excellent!

As much as I would love to start utilizing your package to analyze reviews like these, there is definitely more than one way to parse out the html tags, and convert unicode characters to make this file readable. However, it has been quite a challenge to find resources on tidying data like this for analysis work with tidytext.
Maybe adding a chapter to the tidytext mining book regarding text cleaning would be the best approach? Even character representation such as #244; can mess up the encoding of the file you want to read in, which gives file encoding errors in r.It might be worthwhile to have a brief section regarding file encoding, and how that impacts text data for analysis. Not to sound harsh, but this is simply my feedback. I look forward to working with your package for my analysis!

Broken unit test

So the dictionary tidier is now broken. I'm not sure where the change happened (it's either in quantedata or broom - I suspect quantedata) but input values must be named. This means that, starting from the unit test's position of 'a list of named elements':

If you throw in an (unnamed) list, it breaks
If you throw in a (named) list, it works but then flubs up with the test expectations, because the output looks different
If you throw in two named vectors, it works and tidy then rejects it.

@dgrtwo any idea? I've got this patch I'm reworking for Andy and this is messing with it some :/.

purrr >= 0.1.1 is required

(tidytext is an awesome package, thanks for that!)

When purrr 0.1.0 is installed, tidytext loads fine, but tidy.Corpus does not work, since purr::transpose is not found. It was (see here) introduced only with version 0.1.1 of purrr -- so this should go into the DESCRIPTION of tidytext, I think.

Installation Error

install_github("juliasilge/tidytext")
Downloading GitHub repo juliasilge/tidytext@master
from URL https://api.github.com/repos/juliasilge/tidytext/zipball/master
Installing tidytext
--2016-11-15 12:43:46-- https://cran.rstudio.com/src/contrib/tokenizers_0.1.4.tar.gz
Resolving cran.rstudio.com... 52.84.129.209
Connecting to cran.rstudio.com|52.84.129.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 50453 (49K) [application/x-gzip]
Saving to: “/tmp/RtmpbgvQJi/tokenizers_0.1.4.tar.gz”

 0K .......... .......... .......... .......... ......... 100% 1.23M=0.04s

2016-11-15 12:43:46 (1.23 MB/s) - “/tmp/RtmpbgvQJi/tokenizers_0.1.4.tar.gz” saved [50453/50453]

Installing tokenizers
'/usr/lib64/R/bin/R' --no-site-file --no-environ --no-save --no-restore
--quiet CMD INSTALL '/tmp/RtmpbgvQJi/devtoolsb60b281702c/tokenizers'
--library='/home/R/x86_64-redhat-linux-gnu-library/3.2'
--install-tests

installing source package ‘tokenizers’ ...
** package ‘tokenizers’ successfully unpacked and MD5 sums checked
** libs
g++ -m64 -std=c++0x -I/usr/include/R -I/usr/local/include -I"/home/R/x86_64-redhat-linux-gnu-library/3.2/Rcpp/include" -fpic -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -c RcppExports.cpp -o RcppExports.o
g++ -m64 -std=c++0x -I/usr/include/R -I/usr/local/include -I"/home/R/x86_64-redhat-linux-gnu-library/3.2/Rcpp/include" -fpic -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -c shingle_ngrams.cpp -o shingle_ngrams.o
shingle_ngrams.cpp: In function ‘Rcpp::CharacterVector generate_ngrams_internal(Rcpp::CharacterVector, uint32_t, uint32_t, std::tr1::unordered_set<std::basic_string<char, std::char_traits, std::allocator >, std::tr1::hash<std::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::basic_string<char, std::char_traits, std::allocator > > >&, std::vector<std::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::basic_string<char, std::char_traits, std::allocator > > >&, std::string)’:
shingle_ngrams.cpp:28: error: expected initializer before ‘:’ token
shingle_ngrams.cpp:35: error: expected primary-expression before ‘ngram_out_len’
shingle_ngrams.cpp:35: error: expected ‘)’ before ‘ngram_out_len’
shingle_ngrams.cpp:35: error: ‘ngram_out_len’ was not declared in this scope
shingle_ngrams.cpp:36: error: ‘ngram_out_len’ was not declared in this scope
shingle_ngrams.cpp:44: error: ‘len’ was not declared in this scope
shingle_ngrams.cpp: In function ‘Rcpp::ListOf<Rcpp::Vector<16, Rcpp::PreserveStorage> > generate_ngrams_batch(Rcpp::ListOf<const Rcpp::Vector<16, Rcpp::PreserveStorage> >, uint32_t, uint32_t, Rcpp::CharacterVector, Rcpp::String)’:
shingle_ngrams.cpp:80: error: expected initializer before ‘:’ token
shingle_ngrams.cpp:83: error: expected primary-expression before ‘for’
shingle_ngrams.cpp:83: error: expected ‘;’ before ‘for’
shingle_ngrams.cpp:83: error: expected primary-expression before ��for’
shingle_ngrams.cpp:83: error: expected ‘)’ before ‘for’
make: *** [shingle_ngrams.o] Error 1
ERROR: compilation failed for package ‘tokenizers’
removing ‘/home/R/x86_64-redhat-linux-gnu-library/3.2/tokenizers’
Error: Command failed (1)

sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] devtools_1.12.0

loaded via a namespace (and not attached):
[1] httr_1.2.1 R6_2.2.0 tools_3.2.3 withr_1.0.2 curl_2.2
[6] memoise_1.0.0 knitr_1.11 git2r_0.15.0 digest_0.6.10

Thanks!

sentence tokenizer not consistent

The sentence tokenizer isn't consistent in breaking the text into different sentences. From what I see, it sometimes fails to break the sentence at "periods(.)"

to_lower parameter in unnest_tokens insufficient for ngrams/skip_ngrams

A quick bug report:

In unnest_tokens, to_lower = FALSE is insufficient for token = 'ngrams' and 'skip_ngrams' without specifying the accompanying parameter (lowercase = FALSE) for the function called from tokenizers Reprex:

library(tidytext)

df <- data.frame(text = "Here's some sample text. Hello world!")

# all lowercase
unnest_tokens(df, ngram, text, token = 'ngrams', to_lower = FALSE)
#> # A tibble: 4 × 1
#>                ngram
#>                <chr>
#> 1 here's some sample
#> 2   some sample text
#> 3  sample text hello
#> 4   text hello world

# keeps case
unnest_tokens(df, ngram, text, token = 'ngrams', to_lower = FALSE, lowercase = FALSE)
#> # A tibble: 4 × 1
#>                ngram
#>                <chr>
#> 1 Here's some sample
#> 2   some sample text
#> 3  sample text Hello
#> 4   text Hello world

# all lowercase
unnest_tokens(df, ngram, text, token = 'ngrams', to_lower = FALSE)
#> # A tibble: 4 × 1
#>                ngram
#>                <chr>
#> 1 here's some sample
#> 2   some sample text
#> 3  sample text hello
#> 4   text hello world

# keeps case
unnest_tokens(df, ngram, text, token = 'skip_ngrams', to_lower = FALSE, lowercase = FALSE)
#> # A tibble: 6 × 1
#>                 ngram
#>                 <chr>
#> 1 Here's sample Hello
#> 2     some text world
#> 3  Here's some sample
#> 4    some sample text
#> 5   sample text Hello
#> 6    text Hello world

With lowercase but without to_lower is all lowercase, as well.

The other options for token besides "ngram" and "skip_ngram" are ok (well, 'words' and 'characters' are mysteriously erroring out, but that's a different issue).

unnest_tokens doesn't work for data.tables

Repro:

library(data.table)
library(tidytext)

d <- data.table(a = c("hello world", "goodbye world"))
unnest_tokens(d, word, a)

Error:

Error in `[[<-.data.frame`(`*tmp*`, output_col, value = c("hello", "world",  : 
  replacement has 4 rows, data has 0

could not find function "get_sentiments"

require(tidytext)
ap_td<-tidy(AssociatedPress)
ap_td
ap_sentiments<-ap_td %>%inner_join(get_sentiments("afinn"), by=c(term="word"))

I see this:

Error in is.data.frame(y) : could not find function "get_sentiments"

Adding support for latent semantic analysis

I think it would be fairly easy to add support for the lsa package to tidytext and broom. See example below.

# Put some docs in a vector
library("dplyr")
doc1 <- c("pets dog cat ferret")
doc2 <- c("sandwiches turkey ham")
doc3 <- c("cat ferret cat bird")
doc4 <- c("turkey beef sandwiches")
myvector <- c(doc1,doc2,doc3,doc4)
mydf <- data_frame(id = 1:4, text = myvector)

# Create a corpus
library("quanteda")
mycorpus <- corpus(mydf, text_field = "text")
mytokens <- tokens(mycorpus)
mydfm <- dfm(mytokens)

# Perform LSA
mytdm <- convert(mydfm, to = "lsa")
mytdm_weighted = lw_logtf(mytdm) * gw_idf(mytdm)
myLSAspace = lsa(mytdm_weighted, dims=2)

# Here's how broom::augment could add 
# factor scores back to the original data frame
factor_scores <- as_data_frame(myLSAspace$dk)
(augmented <- bind_cols(mydf, factor_scores))

# Here's how tidytext:tidy could tidy the factor loadings
library("tidyverse")
# as.data.frame is used to maintain row names until
# rownames_to_column can get them
loadings_tidy <- as.data.frame(myLSAspace$tk) %>%
  rownames_to_column() %>%
  rename(term = rowname) %>%
  gather(factor, loading, # The new variables.
         starts_with("V"), # These go into "loading".
         -term) %>%  # term is not "gathered".
  arrange(factor, desc(loading)) %>% # Sort
  select(factor, term, loading) # Change var order to enhance readablity.

print(loadings_tidy)

Check out tokenizing on ngram and skip ngram

These are in tokenizers but we haven't tested them out or anything. I have had one question from a user (potential user?) about this so far.

internationalization

Are non-english languages in scope for tidytext?

I just discovered tidytext (looks great!), and was trying to follow along with the code from the readme with my own dataset, which is in dutch. stop_words currently only has english words, so I had to improvise with stopwords() from the tm package.

If you're interested in supporting other languages, would you want to add a language field to stop_words, or might it be a better idea to depend on packages like tm that already have stopword lists, and generate the dataframe on the fly?

tidy.corpus not robust

As quanteda will soon re-implement the innards of a corpus class object, the methods in tidy. corpus are likely to break soon. The code should be reimplemented using the stable accessor functions.

Finish topic modeling vignette

I'm just thinking through what needs to be done before the next release to CRAN, submitting to JOSS, etc and this is one of those things. It looks awesome so far.

Sentiment data frame

Before the new release, I've been concerned about something. I use code like this a lot in my analyses, often as the only setup needed before a long text-mining pipe:

AFINN <- sentiments %>%
  filter(lexicon == "AFINN") %>%
  select(word, score)

When we were originally creating the sentiments data frame we wanted to combine them all in one tidy structure, in case someone wanted to do a cross-lexicon analysis. That's still a plausible use case, but overwhelmingly often I choose just one lexicon to use in an analysis. I think we should support that case more easily

I think we should either, in rough descending order of my preference:

Have a function get_sentiments that supports get_sentiments("AFINN"), get_sentiments("bing"), etc that pre-perform those
Have functions get_sentiments_afinn(), get_sentiments_bing(), etc
Have them stored as separate datasets (which does have package size implications if we also include the full dataset, and reverse compatibility implications if we don't)

I think it's important to get these out in this week's CRAN release, and not hard to write (I'm happy to do it)

get_sentiments("loughran")

I am working through your "Text Mining with R" and I am getting a 'bug' with the Loughran and McDonald sentiment lexicon in the "get_sentiments" function. Seems like the "loughran" lexicon is not recognized as an option in the get_sentiments function. I see you appear to be working on the matter, but I can't determine if you have 'solved' the issue. How can I incorporate the Loughran and McDonald lexicon in the get_sentiments function?

Thank you for writing such a cool tool and book! Just excellent!

Weighted Term Frequency

Thanks a lot for the package and the book, really enjoying them.

My recommendation here does not require any new features or coding, but simply demonstrating a powerful way to use unnest_tokens() in a special case, when documents in a corpus come each with a certain metric.

A simple example to illustrate my point. Assume we have a report for a set of keywords (the documents) with the impressions each generated:

library(tibble)
news_keywords <- data_frame(
  keyword = c("us news", "us elections", "china news", "china updates", "latest china news", "china newspaper"),
  impressions = c(500, 500, 100, 100, 100, 100)
)
news_keywords


#>  A tibble: 6 × 2
#>             keyword  impressions
#>                <chr>       <dbl>
#> 1           us news         500
#> 2      us elections         500
#> 3        china news         100
#> 4     china updates         100
#> 5 latest china news         100
#> 6   china newspaper         100

Now we separate the words with tidytext::unnest_tokens()

library(tidytext)
library(magrittr)
news_words <- news_keywords %>% 
  unnest_tokens(output = words, input = keyword)

Counting the frequency

library(dplyr)  
news_words %>% count(words, sort = TRUE)
#>  A tibble: 7 × 2
#>       words     n
#>       <chr> <int>
#> 1     china     4
#> 2      news     3
#> 3        us     2
#> 4 elections     1
#> 5    latest     1
#> 6 newspaper     1
#> 7   updates     1

We can see that the most frequently used words are "china", "news", then "us".

Now let's take a look at the weighted frequency

news_words %>%   
  group_by(words) %>%   
  summarise(impressions = sum(impressions)) %>%   
  arrange(desc(impressions))

#>  A tibble: 7 × 2
#>        words impressions
#>        <chr>       <dbl>
#> 1        us        1000
#> 2      news         700
#> 3 elections         500
#> 4     china         400
#> 5    latest         100
#> 6 newspaper         100
#> 7   updates         100

Here, we see a completely different view, where "us" is the most frequent (by number of impressions) and china is in fourth position.
While "china" occurs more often in the keywords, there are more times that people search for keywords that include "us", and therefore "us" is bigger in terms of searches.

Another simple example: let's say the most frequent word you use in your tweets is "dplyr", but when you take into consideration the number of impressions of your tweets, you might find that tweets that generated the most impressions where the ones with "ggplot" in them.

Possible use cases:

Text	Metrics
search engine keywords	impressions, clicks, conversions
tweet text	impressions, engagements, retweets
web page titles	pageviews, avg. session duration
song* / movie / book title	downloads, sales, likes

If you think this is useful, I'd be more than happy to expand, and share more detailed examples.

'* might work with your post https://www.washingtonpost.com/news/wonk/wp/2016/10/01/the-states-that-americans-sing-about-most/?postshare=5471475343908445&tid=ss_tw

Difference in n-grams between tidytext and tm?

Hi, I was using tidytext for a project where I was trying to make a text prediction algorithm. This involves tokenizing text into n-grams, and I noticed a difference between the results using tidytext vs using the tm package. Basically it seems that if you have multiple lines of text, tm computes n-grams for each line separately, whereas tidytext computes n-grams for all the lines pasted together (so you get n-grams that span lines). I'm not sure if there is a 'correct' method . If your text is liked the tidytext examples from gutenbergr, it makes sense because each line of text appears in order in the book. But in my case, each line was a different article or tweet for example, so an n-gram spanning lines wouldn't make sense (because those words never actually occur together in that order). But I thought it might be worth pointing out in a vignette or something since people might assume you get the same results as with tm. I made a little example of this in the attached file.

tidytextVStm.pdf

Qualifiers dataset(s)

Would the addition of datasets of sentiment qualifiers like negations and adverbs be of interest to add in?
e.g.

negations
-------
cannot
could not
did not
does not
had no
have no
may not
never
no
not
nothing
was no
was not
will not
would not
ain't
aint
can't
cant
couldn't
didn't
doesn't
don't
dont
hasn't
haven't
won't

Compiled from datasets available from Sentiment Composition Lexicons

Error: could not find function "get_sentiments"

Error using cast_dtm command

Used tidytext to parse support case subjects into words...then grouped by case number and word to count,. Ended up with a tibble that looks like this sample

A tibble: 38,923 x 3
CaseNumber word n

1 20695703 backup 1
2 20695703 catalogs 1

When I try to execute subject.freq %>% cast_dtm(CaseNumber, word, n) I get this error

Error in as.lazy_dots(.dots) :
argument ".dots" is missing, with no default

I downloaded the latest dev version of tidytext but that does not alleviate the issue. Any help is appreciated.

On Twitter as @NevilleLeoniers

tidytext::get_sentiments() fails if tidytext not loaded

To reproduce,

Don't call library(tidytext).
Call tidytext::get_sentiments().

The problem is that sentiments doesn't get loaded if tidytext isn't on the search path.

I think the fix is to add

data(list = "sentiments", package = "tidytext")

somewhere near the start of get_sentiments().

You might also need to specify the enivr argument to data().

dependency in Travis for XML package

OK, the XML package is a "suggests" dependency of quanteda and it is not getting built on Travis currently for the tidying_casting.Rmd vignette
https://travis-ci.org/juliasilge/tidytext/builds/124871786

Here is XML on CRAN.

I think this has the answer of what we need to change, but I haven't tried yet.

Implement tidiers for topic models from stm package

Let's write tidiers for the topic models from stm. This package has no rJava dependency and the same input as the topicmodels LDA modeling.

unnest_tokens when token = "characters" removes spaces

This is definitely not a bug, but I've run into a situation where I split up a document into characters using unnest_tokens, but now I need to know whether any given character is at the beginning of a word or not. The easiest way to find out would be if I were able to ask the tokenizer to preserve spaces between characters and then I could check if lag(token) is a space. Is that a possibility?

Write paper.md

Let's combine and expand the vignettes and blog posts into a paper.md to submit to JOSS.

http://joss.theoj.org/

(Do I need to get an ORCID ID?)