vocabular2

The goal of vocabular2 is to compare vocabularies on a set of metrics. There’s currently no clear development path for the package. It may become usable in the future, but for now it’s not adviced to use the code for your projects. I haven’t spent enough time thinking about the meaningfulness of the metrics to recommend them. They were simply intuitive to me at 4am on some exam-stressed winter night. It’s also very possible that they are in the literature under different names. :)

Installation

You can install the development version with:

devtools::install_github("ludvigolsen/vocabular2")

Main functions

compare_vocabs()
get_doc_metrics()
stack_doc_metrics()

Simple Example

Note: By default, negative values are set to 0 for most of the metrics (not TD-IDF and TF-IRF).

See the metric formulas below the example.

Attach packages

library(vocabular2)
library(tm)
library(tidyverse)
library(knitr)

Load the included ‘hamlet’ dataset

# The included dataset with Hamlet lines
# Extracted from https://www.opensourceshakespeare.org/
hamlet %>% head(5)
#> # A tibble: 5 x 2
#>   Line                                              Character
#>   <chr>                                             <chr>    
#> 1 Though yet of Hamlet our dear brother's death     Claudius 
#> 2 The memory be green, and that it us befitted...   Claudius 
#> 3 We doubt it nothing. Heartily farewell.           Claudius 
#> 4 Have you your father's leave? What says Polonius? Claudius 
#> 5 Take thy fair hour, Laertes. Time be thine,       Claudius

# Collect the lines for each character
data <- hamlet %>% 
  dplyr::group_by(Character) %>% 
  dplyr::summarise(txt = paste0(Line, collapse = " "))

data
#> # A tibble: 5 x 2
#>   Character txt                                                                 
#>   <chr>     <chr>                                                               
#> 1 Claudius  Though yet of Hamlet our dear brother's death The memory be green, …
#> 2 Gertrude  Good Hamlet, cast thy nighted colour off, And let thine eye look li…
#> 3 Hamlet    Not so, my lord. I am too much i' th' sun. Ay, madam, it is common.…
#> 4 Horatio   Friends to this ground. A piece of him. Tush, tush, 'twill not appe…
#> 5 Ophelia   Do you doubt that? No more but so? I shall th' effect of this good …

# Assign each text to a variable
# This could be done in a loop if we had a lot of texts
claudius <- data[1, "txt"][[1]]
gertrude <- data[2, "txt"][[1]]
hamlet <- data[3, "txt"][[1]] # note: overwrites the dataset
horatio <- data[4, "txt"][[1]]
ophelia <- data[5, "txt"][[1]]

Count the terms

# Create a term-count tibble for each document

count_terms <- function(t){
  docs <- Corpus(VectorSource(t))
  # do things like removing stopwords, lemmatization, etc.
  docs <- tm_map(docs, removeWords, stopwords("english"))
  docs <- tm_map(docs, removePunctuation, preserve_intra_word_dashes = TRUE)
  dtm <- TermDocumentMatrix(docs)
  m <- as.matrix(dtm)
  v <- sort(rowSums(m), decreasing=TRUE)
  d <- tibble::tibble(Word = names(v), Count=v)
  d
}

claudius_tc <- count_terms(claudius)
gertrude_tc <- count_terms(gertrude)
hamlet_tc <- count_terms(hamlet)
horatio_tc <- count_terms(horatio)
ophelia_tc <- count_terms(ophelia)

Compare the vocabularies

This is where the metrics are calculated. We get a column per document with a nested tibble containing the metrics.

scores <- compare_vocabs(tc_dfs = list("claudius" = claudius_tc,
                                       "gertrude" = gertrude_tc,
                                       "hamlet" = hamlet_tc,
                                       "horatio" = horatio_tc,
                                       "ophelia" = ophelia_tc))
scores
#> # A tibble: 887 x 7
#>    Word     `In Docs` claudius     gertrude     hamlet     horatio    ophelia   
#>    <chr>        <dbl> <list>       <list>       <list>     <list>     <list>    
#>  1 ability          1 <tibble [1 … <tibble [1 … <tibble [… <tibble [… <tibble […
#>  2 aboard           1 <tibble [1 … <tibble [1 … <tibble [… <tibble [… <tibble […
#>  3 acquitt…         1 <tibble [1 … <tibble [1 … <tibble [… <tibble [… <tibble […
#>  4 act              1 <tibble [1 … <tibble [1 … <tibble [… <tibble [… <tibble […
#>  5 admirat…         1 <tibble [1 … <tibble [1 … <tibble [… <tibble [… <tibble […
#>  6 affecti…         1 <tibble [1 … <tibble [1 … <tibble [… <tibble [… <tibble […
#>  7 affecti…         1 <tibble [1 … <tibble [1 … <tibble [… <tibble [… <tibble […
#>  8 affrigh…         1 <tibble [1 … <tibble [1 … <tibble [… <tibble [… <tibble […
#>  9 aha              1 <tibble [1 … <tibble [1 … <tibble [… <tibble [… <tibble […
#> 10 air              2 <tibble [1 … <tibble [1 … <tibble [… <tibble [… <tibble […
#> # … with 877 more rows

Extract the metrics for Claudius

get_doc_metrics(scores, "claudius") %>% 
  arrange(desc(REL_TF_NRTF)) %>% 
  head(10) %>% 
  kable()

Doc	Word	In Docs	Count	TF	IRF	RTF	NRTF	MRTF	TF_IDF	TF_IRF	TF_RTF	TF_NRTF	TF_MRTF	REL_TF_NRTF	REL_TF_MRTF	RANK_ENS
claudius	give	2	7	0.0132075	0.6931472	0.0022472	0.0005618	0.0022472	0.0067468	0.0091548	0.0109604	0.0126457	0.0109604	0.1314895	0.0490369	886.0
claudius	gertrude	1	5	0.0094340	1.3862944	0.0000000	0.0000000	0.0000000	0.0086443	0.0130782	0.0094340	0.0094340	0.0094340	0.1255340	0.1255340	885.0
claudius	laertes	2	9	0.0169811	0.6931472	0.0059880	0.0014970	0.0059880	0.0086744	0.0117704	0.0109931	0.0154841	0.0109931	0.1193114	0.0279667	887.0
claudius	leave	1	3	0.0056604	1.3862944	0.0000000	0.0000000	0.0000000	0.0051866	0.0078469	0.0056604	0.0056604	0.0056604	0.0451922	0.0451922	883.5
claudius	polonius	1	3	0.0056604	1.3862944	0.0000000	0.0000000	0.0000000	0.0051866	0.0078469	0.0056604	0.0056604	0.0056604	0.0451922	0.0451922	883.5
claudius	hamlet	3	15	0.0283019	0.2876821	0.0428010	0.0107002	0.0359281	0.0063154	0.0081419	0.0000000	0.0176016	0.0000000	0.0439106	0.0000000	673.0
claudius	time	2	4	0.0075472	0.6931472	0.0022472	0.0005618	0.0022472	0.0038553	0.0052313	0.0053000	0.0069854	0.0053000	0.0415048	0.0135499	882.0
claudius	father	4	6	0.0113208	0.0000000	0.0081824	0.0020456	0.0029940	0.0000000	0.0000000	0.0031384	0.0092752	0.0083267	0.0381682	0.0255019	697.0
claudius	thine	2	4	0.0075472	0.6931472	0.0029940	0.0007485	0.0029940	0.0038553	0.0052313	0.0045532	0.0067987	0.0045532	0.0352249	0.0092965	881.0
claudius	must	3	6	0.0113208	0.2876821	0.0091200	0.0022800	0.0068729	0.0025262	0.0032568	0.0022007	0.0090407	0.0044479	0.0342901	0.0066663	880.0

Extract the metrics for Gertrude

get_doc_metrics(scores, "gertrude") %>% 
  arrange(desc(REL_TF_NRTF)) %>% 
  head(10) %>% 
  kable()

Doc	Word	In Docs	Count	TF	IRF	RTF	NRTF	MRTF	TF_IDF	TF_IRF	TF_RTF	TF_NRTF	TF_MRTF	REL_TF_NRTF	REL_TF_MRTF	RANK_ENS
gertrude	drownd	1	3	0.0089820	1.3862944	0.0000000	0.0000000	0.0000000	0.0082302	0.0124517	0.0089820	0.0089820	0.0089820	0.1296075	0.1296075	887.0
gertrude	hamlet	3	12	0.0359281	0.2876821	0.0351747	0.0087937	0.0283019	0.0080171	0.0103359	0.0007534	0.0271345	0.0076263	0.1040184	0.0096092	875.0
gertrude	thou	4	8	0.0239521	0.0000000	0.0219195	0.0054799	0.0117647	0.0000000	0.0000000	0.0020326	0.0184722	0.0121874	0.0727235	0.0237111	730.0
gertrude	hast	2	3	0.0089820	0.6931472	0.0018868	0.0004717	0.0018868	0.0045883	0.0062259	0.0070952	0.0085103	0.0070952	0.0698872	0.0254277	885.5
gertrude	ophelia	2	3	0.0089820	0.6931472	0.0018868	0.0004717	0.0018868	0.0045883	0.0062259	0.0070952	0.0085103	0.0070952	0.0698872	0.0254277	885.5
gertrude	thy	3	6	0.0179641	0.2876821	0.0135679	0.0033920	0.0113208	0.0040086	0.0051679	0.0043961	0.0145721	0.0066433	0.0653355	0.0100518	883.0
gertrude	this	2	3	0.0089820	0.6931472	0.0022472	0.0005618	0.0022472	0.0045883	0.0062259	0.0067348	0.0084202	0.0067348	0.0638903	0.0211089	884.0
gertrude	alack	1	2	0.0059880	1.3862944	0.0000000	0.0000000	0.0000000	0.0054868	0.0083012	0.0059880	0.0059880	0.0059880	0.0576034	0.0576034	881.0
gertrude	forgot	1	2	0.0059880	1.3862944	0.0000000	0.0000000	0.0000000	0.0054868	0.0083012	0.0059880	0.0059880	0.0059880	0.0576034	0.0576034	881.0
gertrude	noise	1	2	0.0059880	1.3862944	0.0000000	0.0000000	0.0000000	0.0054868	0.0083012	0.0059880	0.0059880	0.0059880	0.0576034	0.0576034	881.0

Extract the metrics for Hamlet

get_doc_metrics(scores, "hamlet") %>% 
  arrange(desc(REL_TF_NRTF)) %>% 
  head(10) %>% 
  kable()

Doc	Word	In Docs	Count	TF	IRF	RTF	NRTF	MRTF	TF_IDF	TF_IRF	TF_RTF	TF_NRTF	TF_MRTF	REL_TF_NRTF	REL_TF_MRTF	RANK_ENS
hamlet	hold	1	4	0.0117647	1.3862944	0.0000000	0.0000000	0.0000000	0.0107799	0.0163093	0.0117647	0.0117647	0.0117647	0.2215225	0.2215225	887
hamlet	horatio	2	5	0.0147059	0.6931472	0.0018868	0.0004717	0.0018868	0.0075121	0.0101933	0.0128191	0.0142342	0.0128191	0.1909742	0.0751466	886
hamlet	horrible	1	3	0.0088235	1.3862944	0.0000000	0.0000000	0.0000000	0.0080849	0.0122320	0.0088235	0.0088235	0.0088235	0.1246064	0.1246064	885
hamlet	boy	1	2	0.0058824	1.3862944	0.0000000	0.0000000	0.0000000	0.0053899	0.0081547	0.0058824	0.0058824	0.0058824	0.0553806	0.0553806	882
hamlet	earth	1	2	0.0058824	1.3862944	0.0000000	0.0000000	0.0000000	0.0053899	0.0081547	0.0058824	0.0058824	0.0058824	0.0553806	0.0553806	882
hamlet	fellow	1	2	0.0058824	1.3862944	0.0000000	0.0000000	0.0000000	0.0053899	0.0081547	0.0058824	0.0058824	0.0058824	0.0553806	0.0553806	882
hamlet	hell	1	2	0.0058824	1.3862944	0.0000000	0.0000000	0.0000000	0.0053899	0.0081547	0.0058824	0.0058824	0.0058824	0.0553806	0.0553806	882
hamlet	thrift	1	2	0.0058824	1.3862944	0.0000000	0.0000000	0.0000000	0.0053899	0.0081547	0.0058824	0.0058824	0.0058824	0.0553806	0.0553806	882
hamlet	make	2	3	0.0088235	0.6931472	0.0034364	0.0008591	0.0034364	0.0045073	0.0061160	0.0053871	0.0079644	0.0053871	0.0473864	0.0117273	878
hamlet	sword	2	3	0.0088235	0.6931472	0.0034364	0.0008591	0.0034364	0.0045073	0.0061160	0.0053871	0.0079644	0.0053871	0.0473864	0.0117273	878

Extract the metrics for Horatio

get_doc_metrics(scores, "horatio") %>% 
  arrange(desc(REL_TF_NRTF)) %>% 
  head(10) %>% 
  kable()

Doc	Word	In Docs	Count	TF	IRF	RTF	NRTF	MRTF	TF_IDF	TF_IRF	TF_RTF	TF_NRTF	TF_MRTF	REL_TF_NRTF	REL_TF_MRTF	RANK_ENS
horatio	lord	5	37	0.0831461	-0.2231436	0.1203392	0.0300848	0.1065292	-0.0151593	-0.0185535	0.0000000	0.0530613	0.0000000	0.1456518	0.0000000	629
horatio	might	1	4	0.0089888	1.3862944	0.0000000	0.0000000	0.0000000	0.0082363	0.0124611	0.0089888	0.0089888	0.0089888	0.1208332	0.1208332	887
horatio	heard	2	3	0.0067416	0.6931472	0.0018868	0.0004717	0.0018868	0.0034438	0.0046729	0.0048548	0.0062699	0.0048548	0.0370797	0.0128226	886
horatio	aught	1	2	0.0044944	1.3862944	0.0000000	0.0000000	0.0000000	0.0041182	0.0062305	0.0044944	0.0044944	0.0044944	0.0302083	0.0302083	879
horatio	bernardo	1	2	0.0044944	1.3862944	0.0000000	0.0000000	0.0000000	0.0041182	0.0062305	0.0044944	0.0044944	0.0044944	0.0302083	0.0302083	879
horatio	consider	1	2	0.0044944	1.3862944	0.0000000	0.0000000	0.0000000	0.0041182	0.0062305	0.0044944	0.0044944	0.0044944	0.0302083	0.0302083	879
horatio	custom	1	2	0.0044944	1.3862944	0.0000000	0.0000000	0.0000000	0.0041182	0.0062305	0.0044944	0.0044944	0.0044944	0.0302083	0.0302083	879
horatio	een	1	2	0.0044944	1.3862944	0.0000000	0.0000000	0.0000000	0.0041182	0.0062305	0.0044944	0.0044944	0.0044944	0.0302083	0.0302083	879
horatio	issue	1	2	0.0044944	1.3862944	0.0000000	0.0000000	0.0000000	0.0041182	0.0062305	0.0044944	0.0044944	0.0044944	0.0302083	0.0302083	879
horatio	most	1	2	0.0044944	1.3862944	0.0000000	0.0000000	0.0000000	0.0041182	0.0062305	0.0044944	0.0044944	0.0044944	0.0302083	0.0302083	879

Extract the metrics for Ophelia

get_doc_metrics(scores, "ophelia") %>% 
  arrange(desc(REL_TF_NRTF)) %>% 
  head(10) %>% 
  kable()

Doc	Word	In Docs	Count	TF	IRF	RTF	NRTF	MRTF	TF_IDF	TF_IRF	TF_RTF	TF_NRTF	TF_MRTF	REL_TF_NRTF	REL_TF_MRTF	RANK_ENS
ophelia	lord	5	31	0.1065292	-0.2231436	0.0969561	0.0242390	0.0831461	-0.0194226	-0.0237713	0.0095731	0.0822902	0.0233831	0.3571989	0.0309710	742
ophelia	mark	1	3	0.0103093	1.3862944	0.0000000	0.0000000	0.0000000	0.0094463	0.0142917	0.0103093	0.0103093	0.0103093	0.1753109	0.1753109	887
ophelia	know	4	6	0.0206186	0.0000000	0.0146223	0.0036556	0.0094340	0.0000000	0.0000000	0.0059962	0.0169630	0.0111846	0.0822374	0.0230834	762
ophelia	better	1	2	0.0068729	1.3862944	0.0000000	0.0000000	0.0000000	0.0062975	0.0095278	0.0068729	0.0068729	0.0068729	0.0779159	0.0779159	883
ophelia	keen	1	2	0.0068729	1.3862944	0.0000000	0.0000000	0.0000000	0.0062975	0.0095278	0.0068729	0.0068729	0.0068729	0.0779159	0.0779159	883
ophelia	keep	1	2	0.0068729	1.3862944	0.0000000	0.0000000	0.0000000	0.0062975	0.0095278	0.0068729	0.0068729	0.0068729	0.0779159	0.0779159	883
ophelia	many	1	2	0.0068729	1.3862944	0.0000000	0.0000000	0.0000000	0.0062975	0.0095278	0.0068729	0.0068729	0.0068729	0.0779159	0.0779159	883
ophelia	naught	1	2	0.0068729	1.3862944	0.0000000	0.0000000	0.0000000	0.0062975	0.0095278	0.0068729	0.0068729	0.0068729	0.0779159	0.0779159	883
ophelia	show	1	2	0.0068729	1.3862944	0.0000000	0.0000000	0.0000000	0.0062975	0.0095278	0.0068729	0.0068729	0.0068729	0.0779159	0.0779159	883
ophelia	sings	1	2	0.0068729	1.3862944	0.0000000	0.0000000	0.0000000	0.0062975	0.0095278	0.0068729	0.0068729	0.0068729	0.0779159	0.0779159	883

Extract and stack metrics for all documents

stack_doc_metrics(scores)
#> # A tibble: 1,294 x 17
#>    Doc   Word  `In Docs` Count      TF    IRF     RTF    NRTF    MRTF   TF_IDF
#>    <chr> <chr>     <dbl> <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>    <dbl>
#>  1 clau… aboa…         1     1 0.00189  1.39  0       0.      0        1.73e-3
#>  2 clau… acqu…         1     1 0.00189  1.39  0       0.      0        1.73e-3
#>  3 clau… affe…         1     1 0.00189  1.39  0       0.      0        1.73e-3
#>  4 clau… alas          3     2 0.00377  0.288 0.0149  3.73e-3 0.0120   8.42e-4
#>  5 clau… alone         2     1 0.00189  0.693 0.00294 7.35e-4 0.00294  9.64e-4
#>  6 clau… and           5     7 0.0132  -0.223 0.0663  1.66e-2 0.0180  -2.41e-3
#>  7 clau… answ…         3     1 0.00189  0.288 0.00524 1.31e-3 0.00299  4.21e-4
#>  8 clau… apart         2     1 0.00189  0.693 0.00299 7.49e-4 0.00299  9.64e-4
#>  9 clau… argu…         2     1 0.00189  0.693 0.00344 8.59e-4 0.00344  9.64e-4
#> 10 clau… arm           2     1 0.00189  0.693 0.00344 8.59e-4 0.00344  9.64e-4
#> # … with 1,284 more rows, and 7 more variables: TF_IRF <dbl>, TF_RTF <dbl>,
#> #   TF_NRTF <dbl>, TF_MRTF <dbl>, REL_TF_NRTF <dbl>, REL_TF_MRTF <dbl>,
#> #   RANK_ENS <dbl>

Metrics

TF-IDF and TF-IRF (Term Frequency - Inverse Rest Frequency)

These are highly correlated (>0.999).

TF-RTF (Term Frequency - Rest Term Frequency)

TF-RTF is positive when the term frequency is higher in the current document than the sum of the term frequencies in the rest of the corpus.

TF-NRTF (Term Frequency - Normalized Rest Term Frequency)

As our selected TF function ensures that frequencies add up to 1 document-wise, the NRTF (Normalized Rest Term Frequency) is simply the average term frequency in the other documents, instead of the sum as in RTF.

TF-NRTF is positive when the term frequency is higher in the current document than the average term frequency in the rest of the corpus.

TF-MRTF (Term Frequency - Maximum Rest Term Frequency)

Instead of the normalized/average rest term frequency, we instead use the maximum rest term frequency.

TF-MRTF is positive when the term frequency is higher in the current document than the maximum term frequency in the rest of the corpus.

Relative TF-NRTF (Relative Term Frequency - Normalized Rest Term Frequency)

Where the TF-NRTF tend to be dominated by highly frequent words, the Relative TF-NRTF instead uses the relative distance to the NRTF. As that would likely be dominated by very infrequent words, we multiply it by the term frequency.

Epsilon (ε) is added to avoid zero-division. It is calculated to resemble +1 smoothing in the rest population.

The beta (β) exponentiator allows us to control the influence of the term frequency. By setting it to 0, we simply get the relative difference (log scaled).

Relative TF-MRTF (Relative Term Frequency - Maximum Rest Term Frequency)

Similar to Relative TF-NRTF but for MRTF instead.

ludvigolsen / vocabular2 Goto Github PK

vocabular2's Introduction