I tried the code from ht

Some questions regarding Zipf 's Law about tidy-text-mining HOT 4 CLOSED

dgrtwo commented on July 21, 2024

Some questions regarding Zipf 's Law

from tidy-text-mining.

Comments (4)

juliasilge commented on July 21, 2024

I see that your question got deleted on Stack Overflow before I could get to it; sorry about that. 😕

To use row_number() to get rank, you need to make sure that your data frame is ordered by n, the number of times a word is used in a document. Let's look at an example. It sounds like you are starting with a document-term matrix that you are tidying? (I'm going to use some example data that is similar to a DTM from quanteda.)

library(tidyverse)
library(tidytext)

data("data_corpus_inaugural", package = "quanteda")
inaug_dfm <- quanteda::dfm(data_corpus_inaugural, verbose = FALSE)

ap_td <- tidy(inaug_dfm)
ap_td
#> # A tibble: 44,725 x 3
#>           document   term count
#>              <chr>  <chr> <dbl>
#>  1 1789-Washington fellow     3
#>  2 1793-Washington fellow     1
#>  3      1797-Adams fellow     3
#>  4  1801-Jefferson fellow     7
#>  5  1805-Jefferson fellow     8
#>  6    1809-Madison fellow     1
#>  7    1813-Madison fellow     1
#>  8     1817-Monroe fellow     6
#>  9     1821-Monroe fellow    10
#> 10      1825-Adams fellow     3
#> # ... with 44,715 more rows

Notice that here, you have a tidy data frame with one word per row, but it is not ordered by count, the number of times that each word was used in each document. If we used row_number() here to try to assign rank, it isn't meaningful because the words are all jumbled up in order.

Instead, we can arrange this by descending count.

ap_td <- tidy(inaug_dfm) %>%
  group_by(document) %>%
  arrange(desc(count)) 

ap_td
#> # A tibble: 44,725 x 3
#> # Groups:   document [58]
#>         document  term count
#>            <chr> <chr> <dbl>
#>  1 1841-Harrison   the   829
#>  2 1841-Harrison    of   604
#>  3     1909-Taft   the   486
#>  4 1841-Harrison     ,   407
#>  5     1845-Polk   the   397
#>  6   1821-Monroe   the   360
#>  7 1889-Harrison   the   360
#>  8 1897-McKinley   the   345
#>  9 1841-Harrison    to   318
#> 10 1881-Garfield   the   317
#> # ... with 44,715 more rows

Now we can use row_number() to get rank, because the data frame is actually ranked/arranged/ordered/sorted/however you want to say it.

ap_td <- tidy(inaug_dfm) %>%
  group_by(document) %>%
  arrange(desc(count)) %>%
  mutate(rank = row_number(),
         total = sum(count),
         `term frequency` = count / total)

ap_td
#> # A tibble: 44,725 x 6
#> # Groups:   document [58]
#>         document  term count  rank total `term frequency`
#>            <chr> <chr> <dbl> <int> <dbl>            <dbl>
#>  1 1841-Harrison   the   829     1  9178       0.09032469
#>  2 1841-Harrison    of   604     2  9178       0.06580954
#>  3     1909-Taft   the   486     1  5844       0.08316222
#>  4 1841-Harrison     ,   407     3  9178       0.04434517
#>  5     1845-Polk   the   397     1  5211       0.07618499
#>  6   1821-Monroe   the   360     1  4898       0.07349939
#>  7 1889-Harrison   the   360     1  4744       0.07588533
#>  8 1897-McKinley   the   345     1  4383       0.07871321
#>  9 1841-Harrison    to   318     4  9178       0.03464807
#> 10 1881-Garfield   the   317     1  3240       0.09783951
#> # ... with 44,715 more rows

ap_td %>%
  ggplot(aes(rank, `term frequency`, color = document)) +
  geom_line(alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10()

from tidy-text-mining.

TheOne000 commented on July 21, 2024

Dear Julia I am not sure whether sending the Email will reach you or not. I would like to say thank you very much for your reply. I deleted the post on the forum and reposed again yesterday. Thank you Bun 2017-08-05 20:51 GMT+01:00 Julia Silge <[email protected]>:

…

I see that your question got deleted on Stack Overflow before I could get to it; sorry about that. 😕 To use row_number() to get rank, you need to make sure that your data frame is ordered by n, the number of times a word is used in a document. Let's look at an example. It sounds like you are starting with a document-term matrix that you are tidying? (I'm going to use some example data that is similar to a DTM from quanteda.) library(tidyverse) library(tidytext) data("data_corpus_inaugural", package = "quanteda")inaug_dfm <- quanteda::dfm(data_corpus_inaugural, verbose = FALSE) ap_td <- tidy(inaug_dfm)ap_td#> # A tibble: 44,725 x 3#> document term count#> <chr> <chr> <dbl>#> 1 1789-Washington fellow 3#> 2 1793-Washington fellow 1#> 3 1797-Adams fellow 3#> 4 1801-Jefferson fellow 7#> 5 1805-Jefferson fellow 8#> 6 1809-Madison fellow 1#> 7 1813-Madison fellow 1#> 8 1817-Monroe fellow 6#> 9 1821-Monroe fellow 10#> 10 1825-Adams fellow 3#> # ... with 44,715 more rows Notice that here, you have a tidy data frame with one word per row, but it is not ordered by count, the number of times that each word was used in each document. If we used row_number() here to try to assign rank, it isn't meaningful because the words are all jumbled up in order. Instead, we can arrange this by descending count. ap_td <- tidy(inaug_dfm) %>% group_by(document) %>% arrange(desc(count)) ap_td#> # A tibble: 44,725 x 3#> # Groups: document [58]#> document term count#> <chr> <chr> <dbl>#> 1 1841-Harrison the 829#> 2 1841-Harrison of 604#> 3 1909-Taft the 486#> 4 1841-Harrison , 407#> 5 1845-Polk the 397#> 6 1821-Monroe the 360#> 7 1889-Harrison the 360#> 8 1897-McKinley the 345#> 9 1841-Harrison to 318#> 10 1881-Garfield the 317#> # ... with 44,715 more rows *Now* we can use row_number() to get rank, because the data frame is actually ranked/arranged/ordered/sorted/however you want to say it. ap_td <- tidy(inaug_dfm) %>% group_by(document) %>% arrange(desc(count)) %>% mutate(rank = row_number(), total = sum(count), `term frequency` = count / total) ap_td#> # A tibble: 44,725 x 6#> # Groups: document [58]#> document term count rank total `term frequency`#> <chr> <chr> <dbl> <int> <dbl> <dbl>#> 1 1841-Harrison the 829 1 9178 0.09032469#> 2 1841-Harrison of 604 2 9178 0.06580954#> 3 1909-Taft the 486 1 5844 0.08316222#> 4 1841-Harrison , 407 3 9178 0.04434517#> 5 1845-Polk the 397 1 5211 0.07618499#> 6 1821-Monroe the 360 1 4898 0.07349939#> 7 1889-Harrison the 360 1 4744 0.07588533#> 8 1897-McKinley the 345 1 4383 0.07871321#> 9 1841-Harrison to 318 4 9178 0.03464807#> 10 1881-Garfield the 317 1 3240 0.09783951#> # ... with 44,715 more rows ap_td %>% ggplot(aes(rank, `term frequency`, color = document)) + geom_line(alpha = 0.8, show.legend = FALSE) + scale_x_log10() + scale_y_log10() <https://camo.githubusercontent.com/82d975bc8c3ace9f742460bf0372c3c45bd59b27/687474703a2f2f692e696d6775722e636f6d2f4c6638545332642e706e67> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#36 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AY61H8cB1xxYJBD12HMyNrSqs9jnyiCvks5sVMfPgaJpZM4Oud8T> .

from tidy-text-mining.

juliasilge commented on July 21, 2024

Ah, I just saw that! I'll post the answer there too so people can see it.

from tidy-text-mining.

TheOne000 commented on July 21, 2024

Thank you very much! .... I am studying your answer and trying replicate Zipf's law.

from tidy-text-mining.

Some questions regarding Zipf 's Law about tidy-text-mining HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent