Comments (5)
Are you thinking about adding a correspondence analysis (CA) option as well? Arguably, CA could be tapping into underlying linguistic properties a bit better than PCA.
from textminer.
I hadn't thought about it. From this paper (
http://www.aclweb.org/anthology/W08-2007) it seems that this would work on
a term co-occurrence matrix, not a document-term matrix, right? I have no
problem implementing CA, though it depends on two things. I'll have to wait
for text2vec version 0.3 to be released (coming soon) to get the term
co-occurrence matrix. And I'd have to look into whether or not
implementations of some of the intermediate methods exist for sparse
matrices. (If not, I may be able to make them myself.)
If you want, you can open up an issue for me to look into this. I'll do my
best.
On Mon, Mar 21, 2016 at 2:03 PM smikhaylov [email protected] wrote:
Are you thinking about adding a correspondence analysis (CA) option as
well? Arguably, CA could be tapping into underlying linguistic properties a
bit better than PCA.—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#12 (comment)
from textminer.
You can set it up on DFM directly. That's how it's implemented in quanteda textmodel_ca function.
It's calling ca package. Another option is vegan package. Vegan is widely used in ecology and has more functionality.
Btw, quanteda is another higher-level framework implementation.
from textminer.
I will look at quanteda as well. I'm going to do benchmarks on SVD from irlba, RSpectra, and quanteda. I'll implement the version that seems fastest/most scalable. At the end of the day, all of LSA, PCA, and CA rely on SVD. So, it's just a matter of which one works best.
It seems that all three of textmineR, text2vec, and quanteda use the same data type. I am in the process of reworking textmineR to be a higher-level package, built on text2vec. @dselivanov has done an amazing job at creating a framework that works faster and is more scalable than any other I've seen (in any language), at least on a single machine. Maybe the quanteda maintainers might want to do the same?
My current plan (not written anywhere on GitHub) is to create wrappers for...
- CTM and LDA based on EM from the topicmodels library (LDA based on gibbs sampling is already imported from the lda library)
- STM from the stm library
- LSA/PCA/CA (testing irlba, RSpectra, quanteda libraries)
- GloVe from text2vec
- Represent document clustering as a topic model where each document only contains a single topic
- Others as they become available/I have time to understand them enough to build wrappers to put them in similar format.
The goal is to have a library that uses similar syntax and returns similar objects to get a wide range of topic models so users don't have to hunt them all down. My personal PhD research focuses on evaluation metrics for topic models. So, textmineR has that functionality as well.
from textminer.
I think that sounds really good. And combination with text2vec is great.
Looking forward to see the development.
from textminer.
Related Issues (20)
- Fill out implementation details in FitLdaModel and update.lda_topic_model HOT 1
- Discrepancy in topic content when summarizing and visualizing with LDAvis HOT 5
- Thinning/Lag of the sampler HOT 3
- License is not super clear HOT 4
- CalcProbCoherence algorithm unclear HOT 1
- CalcTopicModelR2 HOT 6
- Prior information on topics (e.g. Seeded LDA/Guided LDA) HOT 7
- LDAvis HOT 2
- Sparsity makes it more likely to have misses resulting in "negative probability" errors HOT 1
- FitCtmModel does not respect arguments from topicmodels::CTM HOT 2
- CalcTopicModelRsquared fails when called in FitLdaModel
- custom stop word list HOT 5
- CalcHellingerDist seems to modify in place HOT 5
- Switch to {mvrsquared}
- Choosing the numbers of topics ERROR Parallel HOT 4
- test HOT 1
- how to obtain word assignment of the original dataset? HOT 2
- vignette part 5 has error? HOT 1
- CreateTcm does not use multithreading
- How to use the Mallet in textmineR?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from textminer.