contefranz / optop Goto Github PK

Optimal topic identification from a pool of Latent Dirichlet Allocation models

R 77.58% C++ 22.25% C 0.17%

topic-modeling lda text-mining natural-language-processing nlp latent-dirichlet-allocation model-selection

optop's Introduction

OpTop: detect the optimal number of topics from a pool of LDA models

Overview

OpTop is an R package that implements the testing approach described in the paper A Statistical Approach for Optimal Topic Model Identification by Lewis and Grossetti (2019).

Latent Dirichlet Allocation (LDA) was developed by Blei, Ng, and Jordan in 2003 [Blei et al., (2003)] and is based on the idea that a corpus can be represented by a set of topics. LDA has been used extensively in computational linguistics, is replicable, and is automated so it cannot be influenced by researcher prejudice. LDA uses a likelihood approach to discover clusters of text, namely topics that frequently appear in a corpus.

One of the open challenges in topic modeling is to rigorously determine the optimal number of topics for a corpus. Extant research relies on heuristic approaches such as iterative trial-and-error procedures to select the number of topics. For example, a standard approach is to determine which specification is the least perplexed by the test sets. Perplexity is based on the intuition that a high degree of similarity, identified as a low level of perplexity, can be used to determine the appropriate number of topics [Blei et al., (2003); Hornik and Grün, (2011)].

OpTop introduces a set of parametric tests to identify the optimal number of topics from a collection of LDA models. OpTop also includes several tests to explore topic stability and redundancy.

Installation

The package is not on CRAN yet. You can install the development version as follows:

# Install the development version from Github:
devtools::install_github("contefranz/OpTop")

Functions

All the procedures described in the paper will be implemented in this package. The package is in beta stage and contains the following functions whose most of the internals are in C++ and C to increase the performance.

get_topic_models(): handy function to immediately get the list of topic models the user wants to process from a specified environment;
optimal_topic(): implements Test 1 of optimality from the methodological paper [Lewis and Grossetti (2019)].
topic_stability(): implements Test 2 of topic stability from the methodological paper [Lewis and Grossetti (2019)].
agg_topic_stability(): implements Test 3 of aggregate topic stability from the methodological paper [Lewis and Grossetti (2019)].
agg_document_stability(): implements Test 4 of overall topic stability and Test 5 of relative topic importance from the methodological paper [Lewis and Grossetti (2019)].
sim_dfm(): convenient function to simulate a quanteda dfm object from a given LDA model of class LDA_VEM from topicmodels.

Bug Reporting

Bugs and issues can be reported at https://github.com/contefranz/OpTop/issues.

Authors

Francesco Grossetti

Assistant Professor of Data Science and Accounting Information Systems
Bocconi Institute for Data Science and Analytics (BIDSA)
Accounting Department, Bocconi University.
Contact Francesco at: [email protected].
Craig M. Lewis

Madison S. Wigginton Professor of Finance
Owen Business School, Vanderbilt University.
Contact Craig at: [email protected].

Bibliography

Lewis, C. and Grossetti, F. (2022): A Statistical Approach for Optimal Topic Model Identification (forthcoming on Journal of Machine Learning Research)
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(Jan):993–1022.
Benoit K., Watanabe K., Wang H., Nulty P., Obeng A., Müller S., Matsuo A. (2018): quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 774. doi: 10.21105/joss.00774 (URL: http://doi.org/10.21105/joss.00774), URL: https://quanteda.io)

optop's People

Contributors

Stargazers

Watchers

optop's Issues

Check compilation issues under MacOS

Using devtools::load_all(".") raises the following error:

##> Error: Could not find tools necessary to compile a package
##> Call `pkgbuild::check_build_tools(debug = TRUE)` to diagnose the problem.

I solved by invoking options(buildtools.check = function(action) TRUE ) before loading the functions or building the package from scratch.

To do: check if this is a one time issue or if this is consistent at each start up.

R session used:

sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.27     assertthat_0.2.1  magrittr_2.0.1    reprex_1.0.0     
#>  [5] evaluate_0.14     highr_0.8         stringi_1.5.3     rlang_0.4.10     
#>  [9] cli_2.3.1         rstudioapi_0.13   fs_1.5.0          rmarkdown_2.7    
#> [13] tools_4.0.4       stringr_1.4.0     glue_1.4.2        xfun_0.22        
#> [17] yaml_2.2.1        compiler_4.0.4    htmltools_0.5.1.1 knitr_1.31

^{Created on 2021-03-27 by the reprex package (v1.0.0)}

UPDATE

It appears to be a bug in RStudio, but folks there do not seem to be investing a lot of effort. This is a good issue to monitor here.

A temporary workaround which seems to be working, for me at least, most of the times is to do devtools::load_all(".") first to initiate all the compilation and function loadings. After that, standard RStudio shortcuts for building and checking seem to work.

Improve optimal_topic() C++ efficiency

Even though optimal_topic() has been converted to a C++ function, it still presents issues.
Find what they are and make the code more efficient.

Loop over remaining documents in optimal_topic() goes out of bound

When optimal_topics() finds no perfect match between weighted_dfm and lda_models because the LDA was not able to estimate the model over those documents, the function removes the entries from weighted_dfm. We thought we were updating the same object but apparently we are not doing that. For this reason, when optimal_topic_core() loops over the documents in a given element of lda_models, it goes out of bound.

Could it be that we do not carry the information of the removed documents from R to C++?
This needs to be solved because it prevents the release of v1.0.0. I am labeling this as a bug then.

Final C++ conversions

Port topic_stability(), agg_topic_stability() and agg_document_stability() inner computations to C++

R CMD check for CRAN submission

Different machines present different WARNINGS and NOTES after running R CMD check --as-cran.

@mattia- Can you check if you have missing packages that could conflict with the check?
This is the R session I am using for you to compare.

sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.27     assertthat_0.2.1  magrittr_2.0.1    reprex_1.0.0     
#>  [5] evaluate_0.14     highr_0.8         stringi_1.5.3     rlang_0.4.10     
#>  [9] cli_2.3.1         rstudioapi_0.13   fs_1.5.0          rmarkdown_2.7    
#> [13] tools_4.0.4       stringr_1.4.0     glue_1.4.2        xfun_0.22        
#> [17] yaml_2.2.1        compiler_4.0.4    htmltools_0.5.1.1 knitr_1.31