tommyjones / textminer Goto Github PK

An aid for text mining in R, with a syntax that should be familiar to experienced R users. Provides a wrapper for several topic models that take similarly-formatted input and give similarly-formatted output. Has additional functionality for analyzing and diagnostics for topic models.

License: Other

R 82.53% C++ 17.47%

textminer's Introduction

textmineR

Functions for Text Mining and Topic Modeling

An aid for text mining in R, with a syntax that is more familiar to experienced R users. Also, implements various functions related to topic modeling, making it a good topic modeling work bench.

textmineR was created with three principles in mind:

Maximize interoperability within R's ecosystem
Scaleable in terms of object storeage and computation time
Syntax that is idiomatic to R

Please see the vignettes for more information on how to get started.

Note: there's a lot going on with textmineR at the moment, including adding functionality based on original research.

textminer's People

Stargazers

Watchers

textminer's Issues

Consistent function names

Where possible, make function names verbs

Make unit tests for Dtm2Docs

Put them in tests/testthat/test-corpus_functions.R
Try to write a test for every argument or contingency the code anticipates

Correct description on coherence in vignette.

At one point, you say:
"For each pair of words {a,b} in the top M words in a topic, probabilistic coherence calculates P(b|a)−P(b), where {a} is more probable than {b} in the topic."

Then in the numbered list below, you have:
"For example, suppose the top 4 words in a topic are {a,b,c,d}. Then, we calculate
P(a|b)−P(b), P(a|c)−P(c), P(a|d)−P(d)
P(b|c)−P(c), P(b|d)−P(d)
P(c|d)−P(d)"

Add description of phi and theta to vignette

write tests for CalcTopicModelR2

Put them in tests/testthat/test-evaluation_metrics.R
Try to write a test for every argument or contingency the code anticipates

suppress verbose in textmineR::CreateDtm()

Hi,

When using the function CreateDtm() in textmineR, the progress status verbose cannot be suppressed regardless of suppressWarnings(), suppressMessages(), invisible(), or try(silent = T) etc.

For strictly non-verbose environment, this is very undesirable. Could this be somehow modified in a way that leave the option of verbose to the user?

stackoverflow question here

Make unit tests for CreateDtm

Put them in tests/testthat/test-corpus_functions.R
Try to write a test for every argument or contingency the code anticipates

write tests for FitCtmModel

Put them in tests/testthat/test-topic_modeling_core.R
Try to write a test for every argument or contingency the code anticipates

write tests for Cluster2TopicModel

Put them in tests/testthat/test-topic_modeling_core.R
Try to write a test for every argument or contingency the code anticipates

ProbCoherence - one mean() too much?

Hi Tommy,
with reference to #35 I have another question concerning calculation of coherence. To check my LDA models I want to use different coherence measures and just realized that results from my implementation of the difference measure differs from yours. Maybe I have understood something wrong, in this case, I would appreciate, if you could provide an explanation for better understanding. In the lines highlighted below you use mean() in each step of sapply() when making calculations on combinations of wi/wj. Would the mean() not only have be applied on the full set of probability differences? I have edited your implementation and added comments to highlight what I mean.

result <- sapply(1:(ncol(count.mat) - 1), function(x) {
  #  mean(   #this is the mean that I think is too much 
      p.mat[x, (x + 1):ncol(p.mat)]/p.mat[x, x] - Matrix::diag(p.mat)[(x +
                                                                            1):ncol(p.mat)]
    #  , na.rm = TRUE)
  })
  mean(unlist(result), na.rm = TRUE) #instead I would use the mean on all unlisted results

Write tests for CalcJSDivergence

Put them in tests/testthat/test-distance_functions.R
Try to write a test for every argument or contingency the code anticipates

Parallel execution fails on Windows when TmParallelApply called from inside a function

I think it has something to do with the environment that TmParallelApply is looking within. If I have something in my work space with the right name, the error does not happen. Examples below

Fails, because I don't have anything in my work space named stopword_vec

rm(list=ls())
data(nih_sample)
 dtm <- CreateDtm(nih_sample$ABSTRACT_TEXT, 
                  doc_names = nih_sample$APPLICATION_ID, 
                  ngram_window = c(1, 2))

Fails for the same reason

rm(list=ls())
data(nih_sample)
 dtm <- CreateDtm(nih_sample$ABSTRACT_TEXT, 
                  stopword_vec = c("blah")
                  doc_names = nih_sample$APPLICATION_ID, 
                  ngram_window = c(1, 2))

Does not fail, even though this is not the stopword_vec passed to the function

rm(list=ls())
data(nih_sample)
stopword_vec <- "blah"
 dtm <- CreateDtm(nih_sample$ABSTRACT_TEXT, 
                  doc_names = nih_sample$APPLICATION_ID, 
                  ngram_window = c(1, 2))

It looks like the source might be parallel::clusterExport

Make topic model classes and predict methods for supported models

`...` argument conflicts in `FitLdaModel`

Dtm2Docs does not receive ... at all (though it calls parallelization)
... is allocated to both lda::lda.collapsed.gibbs.sampler and TmParallelApply

The result is that passing args though ... will crash either TmParallelApply or lda::lda.collapsed.gibbs.sampler

Make "smooth" parameter in FormatRawLdaOutput a softmax function

write tests for FitLsaModel

Put them in tests/testthat/test-topic_modeling_core.R
Try to write a test for every argument or contingency the code anticipates

Wordclouds of terms from each topic in model (Lda, Ctm, Lsa)

Hello textmineR-Community

Maybe this is wrong place to ask for such a task ... but I can't get it to work.

The goal is to visualize the GetTopTerms (M=5) for individual topics (k=4) from FitLdaModel, FitLsaModel and FitCtmModel.

Therefore I took the TopicWordCloud function (https://github.com/TommyJones/textmineR/blob/master/extra_functions/TopicWordCloud.R).

The frequencies for the single clouds I extracted from phi with:
freq <- apply(model$phi, 1, function(x){ x[ order(x, decreasing=TRUE) ][ 1:M ] }
A copy from the GetTopTerms function, without the names.

So ... am I correct in assuming (for Lda and Ctm) that freq is the probabilities of the terms, matching the related topic? My outputs are always below 0,1 on each topic-term-group. Seems that I miss some important facts about knowing what this kind of metric explains. Additionally ... what does the phi of the FitLsaModel describes exactly?

The second problem I can't solve is hidden in the Lsa context. From time to time (fitting models) there are negative values (in a topic-term-group). The above TopicWordCloud can't handle most of the phi outputs. My unsuspicious assumption was to make model$phi abs() before handle over to the TopicWordCloud function. But the results differ from GetTopTerms out of the LSA model to the resulting wordcloud (see attached example, sorry for the quirky german terms :D).

Below the output of my unsuccessful trials ...
TopicDetection_Report_20111108.htm.zip (a single HTML file with base64 inlined wordclouds).

So ... what is the right way in plotting wordclouds for each topic while using textmineR's phi of each FitxModel function?

Hope to get help ... although it's not an issue.

Compare RSpectra to current implementation of SVD

From @dselivanov "Worth to check RSpectra package. Here are benchmarks for rARPACK, which is a wrapper of RSpectra."

write tests for CalcGamma

Put them in tests/testthat/test-topic_modeling_core.R
Try to write a test for every argument or contingency the code anticipates

Probabilistic coherence for whole corpus in one function

Remove need to use apply, but still make it work for a single topic if necessary.

make a test file for every major group in the /R directory on the 3.0 branch

e.g. "tests/testthat/test-corpus_functions.R"

Make unit tests for TermDocFreq

Put them in tests/testthat/test-corpus_functions.R
Try to write a test for every argument or contingency the code anticipates

write tests for CalcLikelihood

Put them in tests/testthat/test-evaluation_metrics.R
Try to write a test for every argument or contingency the code anticipates

write tests for Dtm2Lexicon

Put them in tests/testthat/test-topic_modeling_core.R
Try to write a test for every argument or contingency the code anticipates

replace acq2 data with sample of ~500 NIH abstracts

Corpus, DTM, topic model(s) as different data objects so calls to data(x) only load one.

Be sure to put in correct citation

Error when fitting lda with trigrams

Hi,

I have a considerably large corpus of transcribed phone calls that I am trying to get topics from. I have tried fitting unigram and bigram lda models to achieve this, but so far I have not obtained great results. Thus, I wanted to try and see if I could achieve better results using trigrams. However, when I try this the fitLDAModel function fails with the following error

2 nodes produced errors; first error: SpMat::init(): requested size is too large

My code is the following:
`prosCallTranscripts <- completeCalls %>%
filter(Speaker != 'company') %>%
group_by(Call_Name) %>%
summarize('Call_Text' = ChampTextclean(paste(Text,'',collapse = ' '),stops))

prosCallTranscripts$Call_Text <- lemmatize_strings(prosCallTranscripts$Call_Text)
prosCallTranscripts$Call_Text <- ChampTextclean(prosCallTranscripts$Call_Text,stops)

PreCall <- grepl('Pre-Call',prosCallTranscripts$Call_Name)
prosCallTranscripts <- prosCallTranscripts[!PreCall,]

dtm <- CreateDtm(doc_vec = prosCallTranscripts$Call_Text,
doc_names = prosCallTranscripts$Call_Name,
ngram_window = c(3,3),
verbose = TRUE)
latenAlloc <- FitLdaModel(dtm = dtm, k = 5, iterations = 100,alpha = 0.1,beta = 0.05,cpus = 3)`

I am using a Windows x64 computer in RStudio and I have the following pachages imported in my R session

library(tm) library(RWeka) library(qdap) library(dplyr) library(tidyr) library(imputeMissings) library(textmineR) library(textstem) library(ggplot2) library(ggsignif) library(radarchart)

Please let me know if this is a bug, it looks like an Rcpp issue, but it may be something that I am doing wrong.

Add option to get PCA from LSA model

tm, rweka dependencies

Hi! I also often use lda package, so can be interesting in textmineR. tm and RWeka are very inefficient packages. You can be interested in my text2vec package, which I just submitted to CRAN. It much faster, don't require rJava and can create corpuses in lda_c format suggested by lda package:

library(text2vec)
library(magrittr)
data("movie_review")
it <- itoken(movie_review$review, preprocess_function = tolower, 
             tokenizer = word_tokenizer)

vocab <- vocabulary(it, ngram = c(1L, 1L)) %>% 
  prune_vocabulary(term_count_min = 5)

it <- itoken(movie_review$review, preprocess_function = tolower, 
             tokenizer = word_tokenizer)

lda_c_dtm <- create_vocab_corpus(it, vocab) %>% 
  get_dtm('lda_c')

write tests for predict.lda_topic_model

Put them in tests/testthat/test-topic_modeling_core.R
Try to write a test for every argument or contingency the code anticipates

Remove deprecated functions

These functions have been deprecated for over a year and (at least in the case of Vec2Dtm) may not work correctly.

HellDist
JSD
Vec2Dtm

Add structural topic model from stm library

add structural topic model from stm library

CalcTopicModelR2 sometimes fails to identify dimnames causing error when input is perfect

Don't know what causes this error. Seems to be an R state issue. Either fix this or at least make it robust so it doesn't cause exit on error.

Consistent style for variables and argument names

Variables and arguments should be all lowercase and use "_" not ".".

Add functionality for LSA

SVD for sparse matrices from irlba

Proper citation of topicmodels & lda libraries

Done since version 2.0.0

Question: Which coherence measure used? Useful for topic model validation?

Hi Tommy Jones,

I am approaching a topic modelling project based on scientific abstracts and have a question regarding the coherence measure you have thankfully implemented. Since I am not a computer scientist, I thought I´d ask before spending hours in trying to figure it out myself. I guess that you use the "UMass measure" proposed by Mimno et al., is this correct? I did not fully understand the lines 72-74 in CalcProbCoherence.R of the pcoh function , i.e., result <- sapply(1:(ncol(count.mat) - 1), function(x) { mean(p.mat[x, (x + 1):ncol(p.mat)]/p.mat[x, x] - Matrix::diag(p.mat)[(x + 1):ncol(p.mat)], na.rm = TRUE).

Could you give me a hint how these lines work / what they mean?

I hope I am not bothering you too much with this non-expert question. I would be glad if you could help me to improve my understanding.

Thanks in advance
Manuel Bickel

Add supervised lda from lda package

write tests for predict.ctm_topic_model

Put them in tests/testthat/test-topic_modeling_core.R
Try to write a test for every argument or contingency the code anticipates

Open an issue for every function in 3.0 branch to make a test for that function

e.g. "Make tests for CreateDtm"

Change implementation of LSA from irlba to RSpectra

RSpectra is much faster

write tests for CalcHellingerDist

Put them in tests/testthat/test-distance_functions.R
Try to write a test for every argument or contingency the code anticipates

Check textmineR against text2vec 0.4

Hi, Tommy. Today new release of text2vec will be on CRAN. Plz check that we didn't break anything in textmineR.

Make unit tests for Dtm2Tcm

Put them in tests/testthat/test-corpus_functions.R
Try to write a test for every argument or contingency the code anticipates

write tests for CalcProbCoherence

Put them in tests/testthat/test-evaluation_metrics.R
Try to write a test for every argument or contingency the code anticipates

There is no version 2.0.4 on CRAN

At the moment textmineR fails while installing text2vec (0.4.0) and textmineR (2.0.3) via CRAN. With this configuration textmineR's tokenizing behavior is struggeling. For example GetTopTerms delivers chunks of sentences instead of single terms.

Thanks for your great work!

write tests for TmParallelApply

Put them in tests/testthat/test-other_utilities.R
Try to write a test for every argument or contingency the code anticipates

Make unit tests for CreateTcm

Put them in tests/testthat/test-corpus_functions.R
Try to write a test for every argument or contingency the code anticipates

Check textmineR against text2vec 0.5

@TommyJones I'm going to release 0.5 very soon. Could you please check textmineR against latest version of text2vec from github? There several API-breaking changes - check NEWS file here https://github.com/dselivanov/text2vec/blob/master/NEWS.md (but please keep in mind that probably not all changes are there - I can miss something).

CreateDtm() ignores all non-English words

Subject.
My data is in Russian and encoded in UTF-8 but the resulting DTM consist only of English words from my corpus.

write tests for FitLdaModel

Put them in tests/testthat/test-topic_modeling_core.R
Try to write a test for every argument or contingency the code anticipates

tommyjones / textminer Goto Github PK

textminer's Introduction

textmineR

textminer's People

Stargazers

Watchers

Forkers

textminer's Issues

Recommend Projects

Recommend Topics

Recommend Org