Comments (4)
I see that your question got deleted on Stack Overflow before I could get to it; sorry about that. 😕
To use row_number()
to get rank, you need to make sure that your data frame is ordered by n
, the number of times a word is used in a document. Let's look at an example. It sounds like you are starting with a document-term matrix that you are tidying? (I'm going to use some example data that is similar to a DTM from quanteda.)
library(tidyverse)
library(tidytext)
data("data_corpus_inaugural", package = "quanteda")
inaug_dfm <- quanteda::dfm(data_corpus_inaugural, verbose = FALSE)
ap_td <- tidy(inaug_dfm)
ap_td
#> # A tibble: 44,725 x 3
#> document term count
#> <chr> <chr> <dbl>
#> 1 1789-Washington fellow 3
#> 2 1793-Washington fellow 1
#> 3 1797-Adams fellow 3
#> 4 1801-Jefferson fellow 7
#> 5 1805-Jefferson fellow 8
#> 6 1809-Madison fellow 1
#> 7 1813-Madison fellow 1
#> 8 1817-Monroe fellow 6
#> 9 1821-Monroe fellow 10
#> 10 1825-Adams fellow 3
#> # ... with 44,715 more rows
Notice that here, you have a tidy data frame with one word per row, but it is not ordered by count
, the number of times that each word was used in each document. If we used row_number()
here to try to assign rank, it isn't meaningful because the words are all jumbled up in order.
Instead, we can arrange this by descending count.
ap_td <- tidy(inaug_dfm) %>%
group_by(document) %>%
arrange(desc(count))
ap_td
#> # A tibble: 44,725 x 3
#> # Groups: document [58]
#> document term count
#> <chr> <chr> <dbl>
#> 1 1841-Harrison the 829
#> 2 1841-Harrison of 604
#> 3 1909-Taft the 486
#> 4 1841-Harrison , 407
#> 5 1845-Polk the 397
#> 6 1821-Monroe the 360
#> 7 1889-Harrison the 360
#> 8 1897-McKinley the 345
#> 9 1841-Harrison to 318
#> 10 1881-Garfield the 317
#> # ... with 44,715 more rows
Now we can use row_number()
to get rank, because the data frame is actually ranked/arranged/ordered/sorted/however you want to say it.
ap_td <- tidy(inaug_dfm) %>%
group_by(document) %>%
arrange(desc(count)) %>%
mutate(rank = row_number(),
total = sum(count),
`term frequency` = count / total)
ap_td
#> # A tibble: 44,725 x 6
#> # Groups: document [58]
#> document term count rank total `term frequency`
#> <chr> <chr> <dbl> <int> <dbl> <dbl>
#> 1 1841-Harrison the 829 1 9178 0.09032469
#> 2 1841-Harrison of 604 2 9178 0.06580954
#> 3 1909-Taft the 486 1 5844 0.08316222
#> 4 1841-Harrison , 407 3 9178 0.04434517
#> 5 1845-Polk the 397 1 5211 0.07618499
#> 6 1821-Monroe the 360 1 4898 0.07349939
#> 7 1889-Harrison the 360 1 4744 0.07588533
#> 8 1897-McKinley the 345 1 4383 0.07871321
#> 9 1841-Harrison to 318 4 9178 0.03464807
#> 10 1881-Garfield the 317 1 3240 0.09783951
#> # ... with 44,715 more rows
ap_td %>%
ggplot(aes(rank, `term frequency`, color = document)) +
geom_line(alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10()
from tidy-text-mining.
from tidy-text-mining.
Ah, I just saw that! I'll post the answer there too so people can see it.
from tidy-text-mining.
Thank you very much! .... I am studying your answer and trying replicate Zipf's law.
from tidy-text-mining.
Related Issues (20)
- preprocessing omission in sample code 6.2 for "The War of the Worlds" HOT 4
- different output for cor.test() in 1.5 HOT 3
- Wrong explanation of Fig. 3.1. HOT 3
- Chapter 1 missing introduction on getting self-generated texts into R HOT 2
- Error in Section 1.5 - gutenbergr Package
- Removing stop words-> which ones have I removed? HOT 2
- Topic labeling with Mutual Information HOT 2
- Comparing word frequencies 3 ways HOT 2
- Update for new tidyr HOT 1
- Broken code 9.1 HOT 4
- Version of Pride & Prejudice from Project Gutenberg has "Chapter" issues HOT 1
- Avoid adding columns with other functions HOT 1
- Evolve facets from traditional tilde notation to vars() HOT 1
- 9.1 Preprocessing Error HOT 2
- Replace superseded top_n with slice_min/slice_max
- Feature Request: Images of the data you are working with throughout the book HOT 2
- possible error with beginning of 'Case study: analyzing usenet text' HOT 2
- qualitative research HOT 2
- Question about grouping text for pairwise cor and tf idf HOT 2
- tm.plugin.webminin is no longer working for Chapter 5.3.1 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tidy-text-mining.