byu-aml-lab / ankura Goto Github PK
View Code? Open in Web Editor NEWAnchor-based topic modeling
License: GNU General Public License v3.0
Anchor-based topic modeling
License: GNU General Public License v3.0
Inside the tandem_anchors
function, we are still calling hmean(Q[anchor, :] + epsilon, axis=0)
for anchors that actually only contain one word. This ends up just adding epsilon to all the anchors that have one word. Although default epsilon is 1e-10
, it has actually been shown to make a difference in the classification (in some cases making the accuracy go up). This is has been noticed when all the anchors are just one word (for instance, when we call tandem_anchors
directly from the gs_anchors
in TBUIE).
At some point, this should be addressed: either fixed or agreed that this is as desired. For now, I will just use a smaller epsilon in TBUIE.
This repo will be a bit more coherent if the only thing it contains is vanilla anchor words related. The split off repo(s) will use the ankura import as a dependency, and pipeline functions which add feature columns to Q for supervised anchors, as well as contain sampling based supervised topic models and active learning.
Given a word-topic matrix, we need a way to quantitatively evaluate the quality of those topics. Potential evaluations include:
Add functions allowing users to modify a list of anchor words. These functions will not likely be automatically exported by ankura, but will be available through a sub-package. These will mostly be used for interactive exploration inside of ipython. Some of the potential operators include:
Many of these operators might be useful when composed. For example we might remove a word from an anchor, and create a new anchor from that same word.
Its really time to get with the times and move on to Python 3. Seriously. Its time.
Since switching to Python3, things have been a lot slower. Time to spend some time with the profiler to figure out why!
In the interest of eventually facilitating a web based user study of interactive anchors, it would be useful to be able to have a simple web server which could serve some useful end points. Some potential queries include:
For now, we will only worry about coming up with these end points. The easiest way to get started is probably with a micro-framework like Flask. Eventually, it would be best to move as much as possible client side, but for now, we'll do inference server side. We leave the work of actually coming up with the client views for later.
Each call to ankura.topic.recover_topic copies Q and row_normalizes the copy. This work can be precomputed (so it doesn't get included as part interactive topic updates). Store this precomputed row-normalized Q on the Dataset object, and point out where Nozomu can add hooks to tweak individual rows of Q and row-normalized Q.
When starting up the server, I get the following warning:
/usr/lib64/python3.4/site-packages/scipy/sparse/compressed.py:698: SparseEfficiencyWarning: Changing the sparsity structure of a csc_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
We need to figure out why the LAPACK headers aren't getting found by the C compiler when they aren't in the same directory as the C code.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.