dgrtwo / tidy-text-mining Goto Github PK

View Code? Open in Web Editor NEW

1.3K 133.0 804.0 86.8 MB

Manuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson

Home Page: http://tidytextmining.com

License: Other

TeX 55.63% CSS 25.60% R 15.14% HTML 3.63%

book text-mining tidyverse bookdown r

tidy-text-mining's Introduction

This is the repo for the book Text Mining with R: A Tidy Approach, by Julia Silge and David Robinson.

Please note that this work is written under a Contributor Code of Conduct and released under a CC-BY-NC-SA license. By participating in this project (for example, by submitting a pull request with suggestions or edits) you agree to abide by its terms.

tidy-text-mining's People

Contributors

Stargazers

Watchers

Forkers

wavelets sibusiso16 fhoces ellzud dsdesrosiers getfetcher stat-r prasants ekstroem drsimonj shrektan alanponce rcirci jiunnguo ronaldyu barkleybg yzharold hendrik147 mathias3 antibioticbook snowdj waqasm86 merico34 ainilaha lgatto rlugojr libardo1 elmargb nabie nodp53 ivanchan0928 denotepython shazhe y44k0v poldham hbcbh1999 cuixueqin jirlong wenjinsun danklotz atharkharal fototo huangrh georggr hieuqtran anthiskon tizemrane richardpilbery esuene malleeblue juampatronics jlopezsi fyjgreatlion ruialv epsimatic88 calchad laventura dafineinc itsjeffreyy bradleyboehmke sdusa maiwennian irichgreen surajitkundu19 worldmovers sakebite jonmcalder datalulu struggle-yd archicxd wanjinchang morizen tpopenfoose eyangs opinonfischer sanjogar francescoalb milkycx jpalbino wmhdamon jonathanjowen ahmedmahdy95 mitchellakeba ybj2004 uraboer gezgoing liuxiaodong20090512 vanly233 chojay katrina03691 peter-john88 jeun0241 isusquiza rian39 klperez adamkski ttorpin zhaoxiaohe stepthom amitas13

tidy-text-mining's Issues

tweets_dave.csv

tweets_dave.csv is in wrong format

NASA data.json has changed

The data.json made available by NASA has changed its schema so we likely want to update the analysis at some point.

Broken Code in Section 5.3.1

The code scraping in section 5.3.1 no longer works as most of the code in the package tm.plugin.webmining is not up-to-date.

I tried switching the GoogleFinanceSource to YahooFinanceSource but that did not work either.

I am sure there are alternatives, but I figured it is best reported here first.

Code issue?

I was just browsing the online version of the book and noticed this block of code throughout the book at the top of each chapter- just wanted to make sure you were aware in case it was a weird Travis artifact that didn't show up locally:

Zipf's law

Add a discussion of Zipf's law in Chapter 4 (or maybe 2)

Travis + gutenbergr

The latest commit I just pushed failed because of our old friend, a 403 HTTP error from Project Gutenberg. Do we want to do the quasi-cheater thing we did with the vignettes in tidytext to get this to build on Travis? Store these things as data objects in data/ so that we don't have to actually call Project Gutenberg from Travis?

Broken code in 9.1

This code in section 9.1 cannot work:

raw_text <- tibble(folder = dir(training_folder, full.names = TRUE)) %>%
  unnest(map(folder, read_folder)) %>%
  transmute(newsgroup = basename(folder), id, text)

unnest expects a data frame for its data argument (source), and map returns a list (source).

Anyone using the code will get this error: Error: map(.$folder, read_folder) must evaluate to column positions or names, not a list.

Also, unnest requires specification of the cols argument, and that is not specified in the code.

Figure 4.2 caption: replace "followed" by "preceded"

The current caption for Figure 4.2 is:

Figure 4.2: The 20 words followed by ‘not’ that had the greatest contribution to sentiment scores, in either a positive or negative direction

In fact, 'not' is in the word1 column of not_words, whereas the words listed in Figure 4.2 are in the word2 column. Therefore the caption should read:

Figure 4.2: The 20 words preceded by ‘not’ that had the greatest contribution to sentiment scores, in either a positive or negative direction

Clarify "confusion matrix" in Chapter 7

As pointed out in #8, the discussion of which words are assigned to which books by the topic modeling is not clear enough around the "confusion matrix". Let's add a more developed discussion so that readers understand what we're getting at here.

suggestion: stem before some analyses

For some of the analyses in the book, it's better to stem the words first. For example, in the analysis of inauguration speeches in chapter 6, it makes more sense to group together words like job/jobs, union/unions, constitution/constitutions, etc. before tf-idf calculation and frequency time series plot.

I understand that stemming is not integrated in the tidytext package for a good reason (juliasilge/tidytext#17). Perhaps that's why you try to avoid stemming in the book?

Ch. 2 Looking at Units Beyond Just Words

In the last part of the code (line 301 to 316) I am not able to get the correct result stated in the book, a tibble: 6 x 5. My result is a tibble: 0 x 5. I have copied the code exactly from github to R studio and can't seem to find the issue.

Mismatch between plot (Einstein) and text ("Gif")

Under a plot in Chapter 4, text says:

One thing we see here is “gif” in the Einstein text?!

However, plot above does not have "gif" listed as a word.

tidy() error: cannot coerce class to a data.frame

Loving this tutorial. Everything works fine until 6.1.2: When I run
ap_documents <- tidy(ap_lda, matrix = "gamma")
I get:

Error in as.data.frame.default(x) : 
  cannot coerce class "structure("LDA_VEM", package = "topicmodels")" to a data.frame
In addition: Warning message:
In tidy.default(ap_lda, matrix = "gamma") :
  No method for tidying an S3 object of class LDA_VEM , using as.data.frame

Full code:

library(topicmodels)
library(tidytext)
library(tm)
library(ggplot2)
library(tidyr)
library(dplyr)
library(broom)

data("AssociatedPress")
ap_lda <- LDA(AssociatedPress, k = 2, control = list(seed = 1234))
ap_documents <- tidy(ap_lda, matrix = "gamma")

Ch. 6 reorder confusion in top terms plots

I spent some time today confused about the top terms plots in chapter 6 - Topic modeling (Fig 6.2 and Fig 6.4).

I assumed that mutate(term = reorder(term, beta)) would result in the bar plots, for each topic, being plotted in descending order of height, so that the viewer could quickly see the order of the most important words for each topic, but it does not. "new" in Fig 6.2 and "pip" in Fig 6.4 are in the "wrong" place.

I think this might be the intent of the code, although it appears @dgrtwo is aware of this behavior and has previously devised a solution for ordering factors this way. If this is not the intent, it is unclear to me why one would reorder the terms this way.

Using reorder_within would likely be too complex of a solution to address this minor issue, but maybe it would make sense to remove this line from the code? It either claims to order the terms in a way that that isn't reflected in the plot, or, if you are already familiar with how ggplot2 reorders factors in facetted plots, it reorders the terms for no obvious reason. With the line removed, the terms would be ordered alphabetically, which makes more sense than the way they are presented now.

small typo, fixed in repo, but not in published web version

In section 2.2 the web page reads "this index (using integer division) counts up sections of 100 lines of text", but in the github repo this has already been corrected to "this index (using integer division) counts up sections of 80 lines of text.". Might be a sign they are out of sync.

Chapter 4: "promise already under evaluation" error

Firstly, thanks for a great resource. I'm new to text mining, and am finding the text clear and enjoyable to work through.

When I try to run this code from chapter 4:

word_cors %>%
    filter(correlation > .15) %>%
    graph_from_data_frame() %>%
    ggraph(layout = "fr") +
    geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
    geom_node_point(color = "lightblue", size = 5) +
    geom_node_text(aes(label = name), repel = TRUE) +
    theme_void()

I get the following error

Error in data.frame(x = xlim, y = ylim) :
promise already under evaluation: recursive default argument reference or earlier problems?

It seems to come from this line:

geom_node_text(aes(label = name), repel = TRUE)

but more than that I can't tell (I'm fairly new to R).

For what it's worth, I get the same error when building the HTML version of the book locally.

If it's out of your hands, please let me know and I'll take it up with whichever package maintainer you feel is responsible.

In case it helps, here are the details of what I have installed:

> devtools::session_info()
Session info ---------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.3 (2017-03-06)
 system   x86_64, mingw32             
 ui       RStudio (1.0.136)           
 language (EN)                        
 collate  English_United Kingdom.1252 
 tz       Australia/Sydney            
 date     2017-04-16                  

Packages -------------------------------------------------------------------------------------------------------------------------
 package     * version    date       source                                  
 assertthat    0.2.0      2017-04-11 CRAN (R 3.3.3)                          
 broom         0.4.2      2017-02-13 CRAN (R 3.3.3)                          
 colorspace    1.3-2      2016-12-14 CRAN (R 3.3.3)                          
 DBI           0.6-1      2017-04-01 CRAN (R 3.3.3)                          
 devtools      1.12.0     2016-06-24 CRAN (R 3.3.3)                          
 digest        0.6.12     2017-01-27 CRAN (R 3.3.3)                          
 dplyr       * 0.5.0      2016-06-24 CRAN (R 3.3.3)                          
 foreign       0.8-67     2016-09-13 CRAN (R 3.3.3)                          
 ggforce       0.1.1      2016-11-28 CRAN (R 3.3.3)                          
 ggplot2     * 2.2.1      2016-12-30 CRAN (R 3.3.3)                          
 ggraph      * 1.0.0      2017-04-16 Github (thomasp85/ggraph@0d099f3)       
 ggrepel       0.6.9      2017-03-24 Github (slowkow/ggrepel@d21b468)        
 gridExtra     2.2.1      2016-02-29 CRAN (R 3.3.3)                          
 gtable        0.2.0      2016-02-26 CRAN (R 3.3.3)                          
 gutenbergr  * 0.1.2.9000 2017-04-14 Github (ropenscilabs/gutenbergr@26f0639)
 hms           0.3        2016-11-22 CRAN (R 3.3.3)                          
 igraph      * 1.0.1      2015-06-26 CRAN (R 3.3.0)                          
 janeaustenr * 0.1.4      2016-10-26 CRAN (R 3.3.3)                          
 labeling      0.3        2014-08-23 CRAN (R 3.3.2)                          
 lattice       0.20-35    2017-03-25 CRAN (R 3.3.3)                          
 lazyeval      0.2.0      2016-06-12 CRAN (R 3.3.3)                          
 magrittr      1.5        2014-11-22 CRAN (R 3.3.3)                          
 MASS          7.3-45     2016-04-21 CRAN (R 3.3.3)                          
 Matrix        1.2-8      2017-01-20 CRAN (R 3.3.3)                          
 memoise       1.0.0      2016-01-29 CRAN (R 3.3.3)                          
 mnormt        1.5-5      2016-10-15 CRAN (R 3.3.2)                          
 munsell       0.4.3      2016-02-13 CRAN (R 3.3.3)                          
 nlme          3.1-131    2017-02-06 CRAN (R 3.3.3)                          
 plyr          1.8.4      2016-06-08 CRAN (R 3.3.3)                          
 psych         1.7.3.21   2017-03-22 CRAN (R 3.3.3)                          
 purrr         0.2.2      2016-06-18 CRAN (R 3.3.3)                          
 R6            2.2.0      2016-10-05 CRAN (R 3.3.3)                          
 Rcpp          0.12.10    2017-03-19 CRAN (R 3.3.3)                          
 readr         1.1.0      2017-03-22 CRAN (R 3.3.3)                          
 reshape2      1.4.2      2016-10-22 CRAN (R 3.3.3)                          
 scales        0.4.1      2016-11-09 CRAN (R 3.3.3)                          
 SnowballC     0.5.1      2014-08-09 CRAN (R 3.3.2)                          
 stringi       1.1.5      2017-04-07 CRAN (R 3.3.3)                          
 stringr     * 1.2.0      2017-02-18 CRAN (R 3.3.3)                          
 tibble        1.3.0      2017-04-01 CRAN (R 3.3.3)                          
 tidyr       * 0.6.1      2017-01-10 CRAN (R 3.3.3)                          
 tidytext    * 0.1.2.900  2017-04-14 Github (juliasilge/tidytext@d4e3702)    
 tokenizers    0.1.4      2016-08-29 CRAN (R 3.3.3)                          
 tweenr        0.1.5      2016-10-10 CRAN (R 3.3.3)                          
 udunits2      0.13       2016-11-17 CRAN (R 3.3.2)                          
 units         0.4-3      2017-03-25 CRAN (R 3.3.3)                          
 viridis       0.4.0      2017-03-27 CRAN (R 3.3.3)                          
 viridisLite   0.2.0      2017-03-24 CRAN (R 3.3.3)                          
 widyr       * 0.0.0.9000 2017-04-13 Github (dgrtwo/widyr@58a1d2d)           
 withr         1.0.2      2016-06-20 CRAN (R 3.3.3)

Some questions regarding Zipf 's Law

I tried the code from http://tidytextmining.com/tfidf.html.

My question is: How can I rewrite the code to produce the reverse relationship between the log of term frequency and the log of rank?
The following is the term-document matrix. Any comments are highly appreciated.
I plot the graph and it does look like this one in this thread.
https://i.stack.imgur.com/j2CTf.jpg
Thank you

 # Zipf 's law

 freq_rk < -DTM_words %>%
  group_by(document) %>% 
  mutate(rank=row_number(),
  'term_frequency'=count/total)

   freq_rk %>%
   ggplot(aes(rank,term_frequency,color=document)) +
   geom_line(size=1.2,alpha=0.8)


   DTM_words
   # A tibble: 4,530 x 5
     document       term count     n total
      <chr>      <chr> <dbl> <int> <dbl>
1        1      activ     1     1   109
2        1 agencydebt     1     1   109
3        1     assess     1     1   109
4        1      avail     1     1   109
5        1     balanc     2     1   109
# ... with 4,520 more rows

Ch. 5 Typo: Fig 5-2 Caption

The caption of figure 5-2 states it's computed as the product of the word's AFINN sentiment score and its frequency when it's actually using the Bing sentiment and counting the number of negative or positive sentiments a word has contributed. It does not compute the magnitude of those sentiments * frequency for each word.

Need load scales library for Figure 6.6.

Hi, I think we need to load library(scales) for plotting Figure 6.6 (confusion matrix).

When I ran the current code, the error is

Error in check_breaks_labels(breaks, labels) : object 'percent_format' not found

Since percent_format is a function inside the scales library,
scale_fill_gradient2(high='red', labels = percent_format())
should be corrected as
scale_fill_gradient2(high='red', labels = scale::percent_format())
or
library(sclaes)
scale_fill_gradient2(high='red', labels = percent_format())

Ch 7: some code seems inconsistent with text

Some code seems inconsistent with its description:

word_ratios <- tidy_tweets %>%
  filter(!str_detect(word, "^@")) %>%
  count(word, person) %>%
  group_by(word) %>%
  filter(sum(n) >= 10) %>%
  ungroup() %>%
  spread(person, n, fill = 0) %>%
  mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>%
  # should it be: mutate_if(is.numeric, funs((. + 1) / (sum(.) + 1)))
  mutate(logratio = log(David / Julia)) %>%
  arrange(desc(logratio))

words_by_time <- tidy_tweets %>%
  filter(!str_detect(word, "^@")) %>%
  mutate(time_floor = floor_date(timestamp, unit = "1 month")) %>%
  count(time_floor, person, word) %>%
  group_by(person, time_floor) %>%
  mutate(time_total = sum(n)) %>%
  group_by(word) %>%
  # should it be: group_by(person, word)
  mutate(word_total = sum(n)) %>%
  ungroup() %>%
  rename(count = n) %>%
  filter(word_total > 30)

totals <- tidy_tweets %>%
  group_by(person, id) %>%
  summarise(rts = sum(retweets)) %>%
  # should it be: summarise(rts = first(retweets))
  group_by(person) %>%
  summarise(total_rts = sum(rts))

totals <- tidy_tweets %>%
  group_by(person, id) %>%
  summarise(favs = sum(favorites)) %>%
  # should it be: summarise(favs = first(favorites))
  group_by(person) %>%
  summarise(total_favs = sum(favs))

Besides, while I just ran the same code, the output I got is different from the one on the website.

Thank you for any help you can provide.

tidy(mallet_model) gives jobjRef error

Thank you very much for this useful book and examples. I have been applying the code to my own set of data but each time I try to obtain the data from the mallet topic.model it gives an error as follows:

Error in as.data.frame.default(x) :
cannot coerce class "structure("jobjRef", package = "rJava")" to a data.frame
In addition: Warning message:
In tidy.default(topic.model) :
No method for tidying an S3 object of class jobjRef , using as.data.frame

Would you have any suggestions on how to fix this issue?
Thanks

Change in cast_sparse and dfm?

When I am building the book locally, I see a difference in how Chapter 6 (casting/tidying) builds between my desktop, which has the development version of tidytext, and my laptop, which has the CRAN version. Travis is currently building with the CRAN version, but we are planning on doing a release ASAP, obviously.

The chapter errors when it tries to cast on the dfm example:

# cast into quanteda's dfm
ap_td %>%
  cast_dfm(term, document, count)

Here's the error:

Quitting from lines 137-144 (06-document-term-matrices.Rmd) 
Error in UseMethod("ndoc") : 
  no applicable method for 'ndoc' applied to an object of class "dfmSparse"
Calls: local ... <Anonymous> -> print -> print -> .local -> cat -> format -> ndoc
Execution halted

The very annoying thing is that I cannot reproduce the error when I run this all interactively; I only get the error when building the book. If I have the file open and say "Run All Chunks Above", then step through the chunk that contains this code, no problems.

Chapter 2 - bing _and_nrc

Hi,
Working from the web version of your book, and looking forward to getting a 'real' copy when it is out.

In chapter 2, when you are doing sentiment analysis on pride and prejudice, you create a joint bing_and_nrc vector, but the code seems to go awry at the end of the line :

mutate(method="Bing et al."),
because of the ',' on the end (screenshot of code also attached).

I have managed to get a working version by separating the two variables, but can see your version is more elegant. How should this code look?

Thanks,

Add discussion of tokens/tokenizing

As pointed out on Twitter, it's probably a good idea to be more explicit in Chapter 2 about what a token is and what we mean by tokenizing.

Sentiment lexicons have changed

In 02-sentiment-analysis.Rmd, the get_sentiments(...) function barks at you to use textdata. Once you have that, you must work through some interactive prompts the first time you use each lexicon. That is not reflected in the text.

line / linenumber column name consistency in examples

In the README for tidytext the column name for line numbers is called linenumbers:

https://github.com/juliasilge/tidytext

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number()) %>%
  ungroup()

In a number of examples in the book the column is instead called line:

https://www.tidytextmining.com/tidytext.html

library(dplyr)
text_df <- data_frame(line = 1:4, text = text)

text_df

This adds a tiny bit of friction to copying and pasting examples between the two resources. If there's interest I'll happily do the grunt work to make a pull request against the two to make them all consistent.

Hopefully you don't think this is a nitpick! tidytext is awesome and I really enjoy the book! I've observed some new R users get stuck when working with the text because of this issue, is all 🙂

chapter 3 tf-idf

Dear Julia,

Nice to meet you! I am a newer to R world. I am Wu Yusheng. Firstly thank you and David very much for the excellent book Tidy Text Mining with R. Very excellent! Amazing!
I have one question. When I typing by myself for your code in chapter 3 tf-idf, I found the code error in RStudio: Error in log(x, base) : non-numeric argument to mathematical function
as below showing:
freq_by_rank %>%
ggplot(aes(rank, 'term frequency', color = book)) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10()
Error in log(x, base) : non-numeric argument to mathematical function

I did not know why it was. When I search in github for your article, I make a try to copy your code from github and paste into my RStudio, then it works! But the codes from you looks exactly the same to mine.
Finally I find the reason: your term frequency is different with my 'term frequency'.

Could you please find deeper reason for me? I think some R packages can't read ' ' but

. I mean the single quotation mark.

Again thank you very much for the excellent book!

Best Regards
Wu Yusheng

Incorrect order in Figure 5.3 due to duplicate terms

In Figure 5.3, the order of the term "-" is incorrect. I think this is because both document 1961-Kennedy and 2009-Obama have term "-", when calling reorder(term, tf_idf), the function calculate the mean of the two tf-idf's.

By the way, I got some questions when reading the book:

The caption of Figure 5.4 says "[...] for four selected terms", but there are six terms in the figure.
In the first paragraph of Chapter 5.3.1, I guess the rightmost ) in "For instance, performing WebCorpus(GoogleFinanceSource("NASDAQ:MSFT"))) allows us [...]" is a typo?
The caption of Figure 4.4 and Figure 4.5 says "Common bigrams in Pride and Prejudice," but we doesn't filter the book Pride and Prejudice beforehand. I think these bigrams come from all of Austen's novels.

Thank you so much. I really like the book 😃

explain `index = linenumber %/% 100` in sentiment analysis chapter

I realize this may be more of a dplyr thing and outside the scope of your book, so please feel free to close and ignore.

I'm not sure I grok what's happening with the count(book, index = linenumber %/% 100, sentiment) and later the count(method, index = linenumber %/% 80, sentiment) %>% lines in the sentiment analysis chapter. A sentence or so explanation of what this is doing would be welcome!

tidy_tweets - invalid argument type

First, congratulations, I'm loving this book! Great work.

Now, when I run the chunk that will unnest_tokens with the regex on tweets to get tidy_tweets
(in 07-tweet-archives.Rmd ), I have an "Error: invalid argument type".

Any idea why?

Thanks!

Evaluation Error in 06. topic models

Hi Julia. I'm getting the following error when trying to run your code in lines 73-88. Actually, lines 73 to 80 run fine. But when I run 81-88 the following error appears (with no plot):

Error in mutate_impl(.data, dots) :
Evaluation error: could not find function "reorder_within".

Any help will be greatly appreciated.
And thanks for all your hard work and sharing.

Typo in chapter 2

Agnes Grey has gutenberg_id 767, not 766. (Chapter 2)

add the pdf/ebook to the repo/website

Hi, this might seems like a trivial request that I could easily do by myself (it is even commented in the code for the pdf part and this function would do the ebook).
However, I believe that it might be useful to have it for people to download on the index of the website as many people like to read the book "offline" in the subway or else.
PS: not really an issue but an enhancement suggestion
PS2: I really like the book. If I had another suggestion to say it would be maybe to include some more advanced methods or at least literature/package to go forward (I am a big fan of the qdap package, also sentimentr::highlight is a neat tool that could come handy for the book at some point).

comparison.cloud

The use of comparison.cloud at the end of chapter 3 is a bit of misleading. The size of a word is in proportion to the relative frequency in the word's corresponding group, positive or negative. The graph does show the most common positive and negative words in Jane Austen’s works, but it easily misleads the viewers to think that the size of a word is relative to the whole positive & negative word count so that the visualization can be used to infer the average sentiment of Austen's full works.

In my opinion, using different colors for positive and negative words on all Austen's work would be better. (In the Austen's case, the size change may not be obvious but in other case it could be significant.)

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  mutate(color = (sentiment == "positive") + 1) %>%
  with(wordcloud(word, n, max.words = 100, colors = color, ordered.colors=T))

Chapter 3 idf formula formatting issue

The formula for calculating idf is not rendering properly on the tidytextmining site:

I ran the 03-tf-idf.Rmd file locally and had no issues, so I am not sure what is going on.

Set up auto-compiling on Travis to GitHub Pages

I'd like to get the current HTML up on GitHub pages; it would be a great way to share work in progress and to review what most needs to be done. We could also have Travis build it automatically upon pushing.

There's a guide to GitHub + Travis here: I started doing this on the gh-pages branch but honestly got a little turned around. Putting Issue here as a suggestion to either of us when time and interest arises.

(Right now it is technically visible at the awful URL of http://varianceexplained.org/tidy-text-mining/_book/intro.html with zero CSS, which- yuck!)

Exploring removed stop words

Have you considered incorporation exploration into the words that gets removed when you remove stop words?

It is similar to looking at the words in the stop words list (which you always should) but a more limited and reasonable approach since you are only looked at the affected words.

library(tidyverse)
library(tidytext)
library(janeaustenr)

data <- tibble(text = emma) %>%
  unnest_tokens(word, text)

## This step would be added

right_join(data, stop_words, by = "word") %>%
  count(word, sort = TRUE)
#> # A tibble: 728 x 2
#>    word      n
#>    <chr> <int>
#>  1 to    15717
#>  2 the   15603
#>  3 and   14688
#>  4 of    12873
#>  5 i      9531
#>  6 a      9387
#>  7 it     7584
#>  8 her    7386
#>  9 was    7194
#> 10 she    7020
#> # ... with 718 more rows

anti_join(data, stop_words, by = "word")
#> # A tibble: 46,775 x 1
#>    word     
#>    <chr>    
#>  1 emma     
#>  2 jane     
#>  3 austen   
#>  4 volume   
#>  5 chapter  
#>  6 emma     
#>  7 woodhouse
#>  8 handsome 
#>  9 clever   
#> 10 rich     
#> # ... with 46,765 more rows

^{Created on 2018-09-26 by the reprex package (v0.2.1)}

Issue for style questions/consistency

I'm hoping that we can use this issue for questions about consistency in style for the whole book.

My first one: should we show or hide code to make plots in the book? I show the code on my blog because I know some people are interested specifically in that and always ask questions, but I lean toward hiding in the book, since it is not about plotting at all. Thoughts?

ch3 sentiment clarity

Please clarify why the index is created using the quotient of linenumber / 80 in the middle of the 3rd Chapter

library(tidyr)

janeaustensentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

Twitter archives no longer contain CSVs

tidy-text-mining/07-tweet-archives.Rmd

Line 15 in 54b8cac

    
           An individual can download their own Twitter archive by following [directions available on Twitter's website](https://support.twitter.com/articles/20170160). We each downloaded ours and will now open them up. Let's use the lubridate package to convert the string timestamps to date-time objects and initially take a look at our tweeting patterns overall (Figure \@ref(fig:setup)).

When you request your own Twitter archive, they now only contain JSON, not any CSV files. Probably want to adjust the wording a tiny bit to make it sound less like people can follow along with this analysis in a super straightforward way.

How to filter certain words from retweet count

I am trying to filter certain words from retweet count since i am working with mix of English and non-English tweets. Some of the words typed in English do not make sense. I used the following code, but its not working

word_by_rts %>%
filter(str_detect(word_by_rts == "hrs") %>%
arrange(desc(retweetCount))

The above code is written after

word_by_rts %>%
filter(uses >= 5) %>%
arrange(desc(retweetCount))

Could you please help>

tidytext 0.2.1 get_sentiments now needs interactive lexicon download and also removed NRC lexicon

After updating tidytext package to version 0.2.1, changes in get_sentiments behaviour has broken code in 02-sentiment-analysis.Rmd (lines 38-40):

now the get_sentiments() function expect to remotely download lexicon databases using the package textdata. Interactive prompt needed to be handled manually, before running the markdown code at first time
NRC lexicon was removed (https://cran.r-project.org/web/packages/tidytext/news/news.html), so we have the following error:

get_sentiments("nrc")
Error in match.arg(lexicon) :
'arg' should be one of “afinn”, “bing”, “loughran”

Switch out Project Gutenberg ID for Einstein

Chapter 3 currently uses Einstein's Relativity with gutenberg_id of 5001, which no longer works. As pointed out in ropensci/gutenbergr#22 we can switch out for the id of 30155.

Mining financial articles - Error in mutate_impl(.data, dots) : 1: failed to load HTTP resource

Hi,

I get the error above in section 5.3.1 when executing the code to download articles for technology stocks. I've both typed it in myself and copied and pasted the text from here with the same result. I'm guessing this is something in the wider set up of my pc or the world, but would be grateful for any thoughts you have on how I might fix it.

Thanks,

Ian

count() stopped working w dplyr?

i've used your example in the past, and was using it again today, and this stopped working
tidy_text <-tidy_text %>%
count(word)

this is what i went with
tidy_text <- tidy_text %>%
group_by(word) %>%
summarize(n=n()) %>%
arrange(desc(n))

Chapter 1 missing introduction on getting self-generated texts into R

Hi David and Julia

Thank you for this nice resource. This year I use it with students for the first time. I encountered one blind spot in chapter 1 on importing self-generated text data into R. I know this is trivial if you know how to think in code and R. For the book's audience this might not be a valid assumption. Therefore, I am missing a section on organizing and reading self-generated textdata into the working environment, in addition to working with Jane Austin's books and the Gutenberg Dataset.

I suggest to my students to organize their texts (e.g. from interviews) into separate text files in a sub-directory before loading them into R. So I would really like to find something around the following boilerplate in the book.

textDirectory <- "my_own_texts"

list.files(textDirectory, "\\.txt$") %>%
    tibble(textfile = . ) %>%
    mutate(textid = rownumber()) %>%
    group_by(textfile) %>%
    mutate(
        text = str_c(textDirectory, textfile, sep = "/") %>% read_file()
    ) %>%
    ungroup() -> text_df

I think that a brief section on this little topic would make a great addition to chapter 1. It would offer readers with beginner's knowledge of working with self-generated unstructured data in R a nice way to put the concepts into practice.

appending screenshot of results of each commands line would great

appending screenshot of results of each commands line may very nice to some people do not want read word by word or people already finished the book and looking for something they forget. Because with the screenshots they don't need try it again.

Suggestion: gutenberg_download" cannot download the free book

In chapter 1.5, the function " gutenberg_download" cannot download the free book by hgwells <- gutenberg_download(c(35, 36, 5230, 159)) the defaults mirror appears to be down. It is needed to update the mirror address hgwells <- gutenberg_download(c(35, 36, 5230, 159), mirror = "http://mirrors.xmission.com/gutenberg/") The more infromation can be seen from https://github.com/ropenscilabs/gutenbergr/issues/8

Change title to "with R"?

I've come around on changing to "with". My general reasoning is that most readers will think of R as a tool to accomplish something ("I did this; I did it with R") rather than something notable in itself ("I work in R; here's what I did").

Your thoughts?

Unable to install janeaustenr

Unable to install package=janeaustenr with R 3.4.1