Giter Site home page Giter Site logo

tommyjones / textminer Goto Github PK

View Code? Open in Web Editor NEW
104.0 16.0 34.0 236.38 MB

An aid for text mining in R, with a syntax that should be familiar to experienced R users. Provides a wrapper for several topic models that take similarly-formatted input and give similarly-formatted output. Has additional functionality for analyzing and diagnostics for topic models.

License: Other

R 82.53% C++ 17.47%

textminer's Introduction

textmineR textmineR logo

CRAN_Status_Badge Downloads Total Downloads R-CMD-check

Functions for Text Mining and Topic Modeling

Copyright 2021 by Thomas W. Jones

An aid for text mining in R, with a syntax that is more familiar to experienced R users. Also, implements various functions related to topic modeling, making it a good topic modeling work bench.

textmineR was created with three principles in mind:

  1. Maximize interoperability within R's ecosystem
  2. Scaleable in terms of object storeage and computation time
  3. Syntax that is idiomatic to R

Please see the vignettes for more information on how to get started.

Note: there's a lot going on with textmineR at the moment, including adding functionality based on original research.

textminer's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

textminer's Issues

Make unit tests for Dtm2Docs

  1. Put them in tests/testthat/test-corpus_functions.R
  2. Try to write a test for every argument or contingency the code anticipates

Correct description on coherence in vignette.

At one point, you say:
"For each pair of words {a,b} in the top M words in a topic, probabilistic coherence calculates P(b|a)−P(b), where {a} is more probable than {b} in the topic."

Then in the numbered list below, you have:
"For example, suppose the top 4 words in a topic are {a,b,c,d}. Then, we calculate
P(a|b)−P(b), P(a|c)−P(c), P(a|d)−P(d)
P(b|c)−P(c), P(b|d)−P(d)
P(c|d)−P(d)"

write tests for CalcTopicModelR2

  1. Put them in tests/testthat/test-evaluation_metrics.R
  2. Try to write a test for every argument or contingency the code anticipates

suppress verbose in textmineR::CreateDtm()

Hi,

When using the function CreateDtm() in textmineR, the progress status verbose cannot be suppressed regardless of suppressWarnings(), suppressMessages(), invisible(), or try(silent = T) etc.

For strictly non-verbose environment, this is very undesirable. Could this be somehow modified in a way that leave the option of verbose to the user?

stackoverflow question here

Make unit tests for CreateDtm

  1. Put them in tests/testthat/test-corpus_functions.R
  2. Try to write a test for every argument or contingency the code anticipates

write tests for FitCtmModel

  1. Put them in tests/testthat/test-topic_modeling_core.R
  2. Try to write a test for every argument or contingency the code anticipates

ProbCoherence - one mean() too much?

Hi Tommy,
with reference to #35 I have another question concerning calculation of coherence. To check my LDA models I want to use different coherence measures and just realized that results from my implementation of the difference measure differs from yours. Maybe I have understood something wrong, in this case, I would appreciate, if you could provide an explanation for better understanding. In the lines highlighted below you use mean() in each step of sapply() when making calculations on combinations of wi/wj. Would the mean() not only have be applied on the full set of probability differences? I have edited your implementation and added comments to highlight what I mean.

result <- sapply(1:(ncol(count.mat) - 1), function(x) {
  #  mean(   #this is the mean that I think is too much 
      p.mat[x, (x + 1):ncol(p.mat)]/p.mat[x, x] - Matrix::diag(p.mat)[(x +
                                                                            1):ncol(p.mat)]
    #  , na.rm = TRUE)
  })
  mean(unlist(result), na.rm = TRUE) #instead I would use the mean on all unlisted results

Write tests for CalcJSDivergence

  1. Put them in tests/testthat/test-distance_functions.R
  2. Try to write a test for every argument or contingency the code anticipates

Parallel execution fails on Windows when TmParallelApply called from inside a function

I think it has something to do with the environment that TmParallelApply is looking within. If I have something in my work space with the right name, the error does not happen. Examples below

Fails, because I don't have anything in my work space named stopword_vec

rm(list=ls())
data(nih_sample)
 dtm <- CreateDtm(nih_sample$ABSTRACT_TEXT, 
                  doc_names = nih_sample$APPLICATION_ID, 
                  ngram_window = c(1, 2))

Fails for the same reason

rm(list=ls())
data(nih_sample)
 dtm <- CreateDtm(nih_sample$ABSTRACT_TEXT, 
                  stopword_vec = c("blah")
                  doc_names = nih_sample$APPLICATION_ID, 
                  ngram_window = c(1, 2))

Does not fail, even though this is not the stopword_vec passed to the function

rm(list=ls())
data(nih_sample)
stopword_vec <- "blah"
 dtm <- CreateDtm(nih_sample$ABSTRACT_TEXT, 
                  doc_names = nih_sample$APPLICATION_ID, 
                  ngram_window = c(1, 2))

It looks like the source might be parallel::clusterExport

`...` argument conflicts in `FitLdaModel`

Dtm2Docs does not receive ... at all (though it calls parallelization)
... is allocated to both lda::lda.collapsed.gibbs.sampler and TmParallelApply

The result is that passing args though ... will crash either TmParallelApply or lda::lda.collapsed.gibbs.sampler

write tests for FitLsaModel

  1. Put them in tests/testthat/test-topic_modeling_core.R
  2. Try to write a test for every argument or contingency the code anticipates

Wordclouds of terms from each topic in model (Lda, Ctm, Lsa)

Hello textmineR-Community

Maybe this is wrong place to ask for such a task ... but I can't get it to work.

The goal is to visualize the GetTopTerms (M=5) for individual topics (k=4) from FitLdaModel, FitLsaModel and FitCtmModel.

Therefore I took the TopicWordCloud function (https://github.com/TommyJones/textmineR/blob/master/extra_functions/TopicWordCloud.R).

The frequencies for the single clouds I extracted from phi with:
freq <- apply(model$phi, 1, function(x){ x[ order(x, decreasing=TRUE) ][ 1:M ] }
A copy from the GetTopTerms function, without the names.

So ... am I correct in assuming (for Lda and Ctm) that freq is the probabilities of the terms, matching the related topic? My outputs are always below 0,1 on each topic-term-group. Seems that I miss some important facts about knowing what this kind of metric explains. Additionally ... what does the phi of the FitLsaModel describes exactly?

The second problem I can't solve is hidden in the Lsa context. From time to time (fitting models) there are negative values (in a topic-term-group). The above TopicWordCloud can't handle most of the phi outputs. My unsuspicious assumption was to make model$phi abs() before handle over to the TopicWordCloud function. But the results differ from GetTopTerms out of the LSA model to the resulting wordcloud (see attached example, sorry for the quirky german terms :D).

Below the output of my unsuccessful trials ...
TopicDetection_Report_20111108.htm.zip (a single HTML file with base64 inlined wordclouds).

So ... what is the right way in plotting wordclouds for each topic while using textmineR's phi of each FitxModel function?

Hope to get help ... although it's not an issue.

write tests for CalcGamma

  1. Put them in tests/testthat/test-topic_modeling_core.R
  2. Try to write a test for every argument or contingency the code anticipates

Make unit tests for TermDocFreq

  1. Put them in tests/testthat/test-corpus_functions.R
  2. Try to write a test for every argument or contingency the code anticipates

write tests for CalcLikelihood

  1. Put them in tests/testthat/test-evaluation_metrics.R
  2. Try to write a test for every argument or contingency the code anticipates

write tests for Dtm2Lexicon

  1. Put them in tests/testthat/test-topic_modeling_core.R
  2. Try to write a test for every argument or contingency the code anticipates

Error when fitting lda with trigrams

Hi,

I have a considerably large corpus of transcribed phone calls that I am trying to get topics from. I have tried fitting unigram and bigram lda models to achieve this, but so far I have not obtained great results. Thus, I wanted to try and see if I could achieve better results using trigrams. However, when I try this the fitLDAModel function fails with the following error

2 nodes produced errors; first error: SpMat::init(): requested size is too large

My code is the following:
`prosCallTranscripts <- completeCalls %>%
filter(Speaker != 'company') %>%
group_by(Call_Name) %>%
summarize('Call_Text' = ChampTextclean(paste(Text,'',collapse = ' '),stops))

prosCallTranscripts$Call_Text <- lemmatize_strings(prosCallTranscripts$Call_Text)
prosCallTranscripts$Call_Text <- ChampTextclean(prosCallTranscripts$Call_Text,stops)

PreCall <- grepl('Pre-Call',prosCallTranscripts$Call_Name)
prosCallTranscripts <- prosCallTranscripts[!PreCall,]

dtm <- CreateDtm(doc_vec = prosCallTranscripts$Call_Text,
doc_names = prosCallTranscripts$Call_Name,
ngram_window = c(3,3),
verbose = TRUE)
latenAlloc <- FitLdaModel(dtm = dtm, k = 5, iterations = 100,alpha = 0.1,beta = 0.05,cpus = 3)`

I am using a Windows x64 computer in RStudio and I have the following pachages imported in my R session

library(tm) library(RWeka) library(qdap) library(dplyr) library(tidyr) library(imputeMissings) library(textmineR) library(textstem) library(ggplot2) library(ggsignif) library(radarchart)

Please let me know if this is a bug, it looks like an Rcpp issue, but it may be something that I am doing wrong.

tm, rweka dependencies

Hi! I also often use lda package, so can be interesting in textmineR. tm and RWeka are very inefficient packages. You can be interested in my text2vec package, which I just submitted to CRAN. It much faster, don't require rJava and can create corpuses in lda_c format suggested by lda package:

library(text2vec)
library(magrittr)
data("movie_review")
it <- itoken(movie_review$review, preprocess_function = tolower, 
             tokenizer = word_tokenizer)

vocab <- vocabulary(it, ngram = c(1L, 1L)) %>% 
  prune_vocabulary(term_count_min = 5)

it <- itoken(movie_review$review, preprocess_function = tolower, 
             tokenizer = word_tokenizer)

lda_c_dtm <- create_vocab_corpus(it, vocab) %>% 
  get_dtm('lda_c')

Remove deprecated functions

These functions have been deprecated for over a year and (at least in the case of Vec2Dtm) may not work correctly.

HellDist
JSD
Vec2Dtm

Question: Which coherence measure used? Useful for topic model validation?

Hi Tommy Jones,

I am approaching a topic modelling project based on scientific abstracts and have a question regarding the coherence measure you have thankfully implemented. Since I am not a computer scientist, I thought I´d ask before spending hours in trying to figure it out myself. I guess that you use the "UMass measure" proposed by Mimno et al., is this correct? I did not fully understand the lines 72-74 in CalcProbCoherence.R of the pcoh function , i.e., result <- sapply(1:(ncol(count.mat) - 1), function(x) { mean(p.mat[x, (x + 1):ncol(p.mat)]/p.mat[x, x] - Matrix::diag(p.mat)[(x + 1):ncol(p.mat)], na.rm = TRUE).

Could you give me a hint how these lines work / what they mean?

I hope I am not bothering you too much with this non-expert question. I would be glad if you could help me to improve my understanding.

Thanks in advance
Manuel Bickel

Make unit tests for Dtm2Tcm

  1. Put them in tests/testthat/test-corpus_functions.R
  2. Try to write a test for every argument or contingency the code anticipates

There is no version 2.0.4 on CRAN

At the moment textmineR fails while installing text2vec (0.4.0) and textmineR (2.0.3) via CRAN. With this configuration textmineR's tokenizing behavior is struggeling. For example GetTopTerms delivers chunks of sentences instead of single terms.

Thanks for your great work!

write tests for TmParallelApply

  1. Put them in tests/testthat/test-other_utilities.R
  2. Try to write a test for every argument or contingency the code anticipates

Make unit tests for CreateTcm

  1. Put them in tests/testthat/test-corpus_functions.R
  2. Try to write a test for every argument or contingency the code anticipates

write tests for FitLdaModel

  1. Put them in tests/testthat/test-topic_modeling_core.R
  2. Try to write a test for every argument or contingency the code anticipates

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.