polmine / biglda Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 1.0 6.12 MB

Tools for fast LDA topic modelling for big corpora

R 89.07% Java 3.27% TeX 2.59% C++ 5.08%

biglda's People

Contributors

Stargazers

Watchers

Forkers

christophleonhardt

biglda's Issues

Update Super Topic Model vignette

In another project context, we have a streamlined workflow for a good topic model with simplified, more readable code. Integrate it here.

bigrams missing

When fitting topic models on a corpus of press releases, we generate bigrams separated by an underscore. We do not see the bigrams in the output. One potential source is the regex used (implicitly) by MALLET, see https://mimno.github.io/Mallet/import.html:

--token-regex. MALLET divides documents into tokens using a regular expression. As of version 2.0.8, the default token expression is '\p{L}[\p{L}\p{P}]+\p{L}', which is valid for all Unicode letters, and supports typical English non-letter patterns such as hyphens, apostrophes, and acronyms with . characters. Note that this expression also implicitly drops one- and two-letter words.
Other options include:

For non-English text, a good choice is –token-regex '[\p{L}\p{M}]+', which means Unicode letters and marks (required for Indic scripts). MALLET currently does not support word segmentation for languages that require it, such as Chinese, Japanese, Korean, or Thai.

To include short words, use \p{L}+ (letters only) or '\p{L}[\p{L}\p{P}]*\p{L}|\p{L}' (letters possibly including punctuation).

Check result of `load_word_weights()` against direct Java export

It may be that the values need to be normalized? Just ran into problems with SVD on the beta matrix generated using load_word_weights().

Parametrize memory allocated to JVM

At present, memory allocation is hard-coded. Get memory available

This works on Linux.

grep MemTotal /proc/meminfo

On macOS

system_profiler SPHardwareDataType | grep "  Memory:"
sysctl hw.memsize # alternative

Use package of idx package for fast data transfer

Transferring the data from R to JVM is slow. Faster solution in idx package? Has been archived, but maybe get out functionality...

Multithreaded readRDS()?

Saving and reading the data can be slow - this is what I found for multi-threaded readRDS():

devtools::install_github("traversc/trqwe")

Overcome parallelism between BigCao2009 and FastCao2009 etc

Too confusing.

Quote Minmo

http://journalofdigitalhumanities.org/2-1/the-details-by-david-mimno/

Store Java (mallet) objects

This is code removed from the polmineR.misc package that had been used for saving mallet objects. This (biglda) package has chosen another approach for loading and saving Java objects. Is this code still useful? Don't know yet.

#' @exportMethod store
setGeneric("store", function(object, ...) standardGeneric("store"))

#' store mallet object
#' 
#' @rdname as.mallet-method
#' @importClassesFrom rJava jobjRef
setMethod("store", "jobjRef", function(object, filename = NULL){
  if (require("rJava", quietly = TRUE)){
    message("... rJava-package loaded")
  } else {
    warning("rJava package not available")
    stop()
  }
  if (is.null(filename)) filename <- tempfile()
  fileOutputStream <- new(rJava::J("java/io/FileOutputStream"), filename)
  objectStream <- new(rJava::J("java/io/ObjectOutputStream"), fileOutputStream)
  objectStream$writeObject(object)
  objectStream$close()
  filename
})

Topic Models created with biglda::as_LDA() remain dependent on biglda package

I fitted a mallet LDA topic model and stored it as a topicmodels LDA using

topicmodels_lda <- biglda::as_LDA(mallet_lda)

and then saveRDS() to create an rds file.

When I try to load this rds file, in seems to be required that biglda is installed and correctly set up, otherwise this error message occurs:

Error in .requirePackage(package) :
unable to find required package ‘biglda’

My first impression is that this might be a bit of a nuisance in cases in which I just want to use the topicmodels or the topicanalysis package. Is this behavior deliberate? If so, I would be interested in the underlying rationale.

Edit, just to reiterate: If biglda is installed, it seems like the model can be loaded and used as any other model fitted (fit?) with the topicmodels package (first impressions, I just used terms and topic for now). But it is still complaining (although the terms object is created):

library(topicmodels)
tm250k <- readRDS("topic_250k.rds")
xx <- terms(tm250k, 50)
Loading required package: biglda
No mallet installation found. Use mallet_install() for installation!

Argument `X` of `BigCao()` - arguments inconsistently named

The other metrics (Arun, Deveaud) have the argument beta, so argument X is confusing and inconsistent.

Use efficiency of output of `$printDenseDocumentTopics()`

The save_topic_documents() function uses printDocumentTopics() - but the aforementioned method is much, much more efficient!

Sys.setenv(MALLET_DIR="/opt/mallet/Mallet-202108")
library(biglda)

library(polmineR)
use("polmineR")
speeches <- polmineR::as.speeches("GERMAPARLMINI", s_attribute_name = "speaker", s_attribute_date = "date")
instance_list <- as.instance_list(speeches)

BTM  <- BigTopicModel(n_topics = 25L, alpha_sum = 5.1, beta = 0.1)
BTM$addInstances(instance_list)
BTM$estimate()

file <- rJava::.jnew("java/io/File", path.expand("~/Lab/tmp/dense.tsv"))
file_writer <- rJava::.jnew("java/io/FileWriter", file)
print_writer <- rJava::new(rJava::J("java/io/PrintWriter"), file_writer)
BTM$printDenseDocumentTopics(print_writer)
print_writer$close()

file <- rJava::.jnew("java/io/File", path.expand("~/Lab/tmp/notdense.tsv"))
file_writer <- rJava::.jnew("java/io/FileWriter", file)
print_writer <- rJava::new(rJava::J("java/io/PrintWriter"), file_writer)
BTM$printDocumentTopics(print_writer)
print_writer$close()

a <- data.table::fread("~/Lab/tmp/dense.tsv")
b <- data.table::fread("~/Lab/tmp/notdense.tsv")

Feature Request | Add possibility to remove stopwords and short documents to the as.instance_list() function

Background

The as.instance_list() function provides a nice way to pass a partition_bundle object (from polmineR) to the workflow as shown in the vignette here.

Issue

What is missing, as far as I can see at least, is the possibility to reduce the vocabulary of the token streams which are passed to the mallet instance list (i.e. removing stopwords, punctuation, etc.).

In addition, sometimes it could be useful to remove very short documents before fitting the topic model. Of course, this kind of filtering could be done before passing the partition_bundle to as.instance_list(). However, if you want to remove stopwords first and then filter out short documents (which might be short now because of the removal of stopwords), it could be nice to do it within the function.

Idea

Within as.instance_list() the token streams of the partitions in the partition_bundle are retrieved using the get_token_stream() method of polmineR. See the code below:

biglda/R/as.instance_list.R

Line 75 in bd7a884

token_stream_list <- get_token_stream(x, p_attribute = p_attribute)

Now I thought that subsetting these token streams should be possible by utilizing the full potential of the get_token_stream() method of polmineR. As documented there (?get_token_stream), there is a subset argument which can be used to pass expressions to the function which allow for some - also quite elaborate - subsetting.

As a next step, I tried to add this to the original function. Instead of line 75 quoted above, I tried to create a slightly modified version of this which includes the subset argument:

  token_stream_list <- get_token_stream(
    x,
    p_attribute = p_attribute,
    subset = {!get(p_attribute) %in% terms_to_drop},
    progress = TRUE
  )

Here, I think get() is needed to find the correct column in the data.table containing the token stream. terms_to_drop would be an additional argument for as.instance_list() which - in this first draft - would be simply a character vector of terms that should be dropped from the column indicated by the p_attribute argument. I assume that if terms_to_drop would default to NULL, each term would be kept but I did not yet check this.

This kind of subset works when you run each line of the function step by step. If you want to use this modified function as a whole, however, you get the error that the object terms_to_drop cannot be found.

I could be mistaken here, but I assume the following: This subset is not evaluated in the same environment, i.e. get_token_stream() looks for an object called terms_to_drop in the global environment in which it does not find it (except the character vector containing these terms is, by chance, called like this, probably). An easy way to make this work would be to assign the terms_to_drop variable to the global environment before building the token_stream_list but I do not think that it is the best idea for a function to implicitly create objects there. So, I am not entirely sure how to solve this robustly.

The code suggested above also limits the possibilities of the subset argument, given that it also could be used to subset the token stream by more than one p-attribute. But for now, I would assume that the removal of specific terms would be a useful addition, at least as an option.

Concerning the removal of short documents, things might be easier. Introducing some kind of "min_length" argument and iterating through each element of token_stream_list, evaluating its length, seems to work. In the end of this, all empty token streams must be removed from the list, however, otherwise adding it to the instance_list won't work.

Performance test

To test the performance of biglda, I used this code ...

options(java.parameters = "-Xmx4g")
library(polmineR)
library(biglda)
if (!mallet_is_installed()) mallet_install()
library(polmineR)
use("GermaParl")

speeches <- polmineR::as.speeches("GERMAPARL", s_attribute_name = "speaker")
id_list <- get_token_stream(speeches, p_attribute = "word", decode = FALSE) # 43 secs

instance_list <- as.instance_list(id_list, corpus = "GERMAPARL") # 7min

lda <- ParallelTopicModel(n_topics = 250L, alpha_sum = 5.1, beta = 0.1)
lda$addInstances(instance_list) # 1min
lda$setNumThreads(7L)
lda$setTopicDisplay(50L, 10L)
lda$setNumIterations(2000L)
started <- Sys.time()
lda$estimate()
Sys.time() - started

The result: Time difference of 3.593737 hours

But there is no filtering of stopwords, of short documents, so this is not the end of history!

`load_word_weights()` - normalize step slow

... and an Rcpp implementation seems to be easy going.

Use external mallet installation, e.g. /opt/

At present, mallet is assumed to be present within the biglda package. It should be possible to have it at a system-wide storage location.

Use Mallet topic model diagnostics

It may make sense to have access to the topic model diagnostics of mallet. This is a minimal example how it might work.

First we start be generating a ParallelTopicModel ...

options(java.parameters = "-Xmx4g")

library(biglda)
library(polmineR)

speeches <- polmineR::as.speeches("GERMAPARLMINI", s_attribute_name = "speaker")
instance_list <- as.instance_list(speeches)

lda <- rJava::.jnew("cc/mallet/topics/ParallelTopicModel", 25L, 5.1, 0.1)
# lda <- ParallelTopicModel(n_topics = 25L, alpha_sum = 5.1, beta = 0.1)
lda$addInstances(instance_list)
lda$setNumThreads(1L)
lda$setTopicDisplay(50L, 10L)
lda$setNumIterations(150L)
lda$estimate()

Then we instantiate a TopicModelDiagnostics class object and extract information on topics.

library(xml2)
library(data.table)
x <- rJava::.jnew("cc/mallet/topics/TopicModelDiagnostics", lda, 100L)$toXML() %>%
  xml2::read_xml(x) %>%
  xml_find_all("/model/topic") %>%
  lapply(function(x) as.data.table(t(as.data.frame(xml_attrs(x))))) %>%
  rbindlist()
for (col in colnames(x)) x[, (col) := as.numeric(x[[col]])]

Note that there is further information on words included in the XML which remains unused in this minimal example. So this is just a proof of concept. We might turn it into a a function later on.

BUT: At this stage, I could make the constructor for the TopicModelDiagnostics class accept the RTopicModel class as input, it seeems to require the ParallelTopicModel. Not nice, because there is an advantage of the RTopicModel.

Get document sizes fast

This works, but it is quite slow

x <- lda$getData()
sizes <- sapply(0L:(x$size() - 1L), function(i) x$get(i)$instance$getData()$getLength())

vector space exhausted with `topicmodels::get_terms()`: use `apply`

With a large topic model, I get a vector space exhausted error when trying to use get_terms(). Internally, there is some unnecessary copying of the data.

This snippet does the job - maybe turn it into a function.

foo <- apply(lda@beta, 1, function(row) lda@terms[tail(order(row), 10)])