wehlutyk / brainscopypaste Goto Github PK

Analysis of mutation of quotes on the web

License: GNU General Public License v3.0

Python 0.20% Shell 0.01% Jupyter Notebook 99.80%

brainscopypaste's Introduction

Brains Copy Paste

Software developed to build our "Brains Copy Paste" paper, analyzing mutation in quotes when they propagate through the blog- and news-spaces (this should tell us stuff on how the brain copy-pastes and alters quotes when doing so).

The latex source (along with compiled pdf) for the paper is at wehlutyk/brainscopypaste-paper.

See the documentation if you want to install, run, or understand any of this. A substantial effort has gone into making this code readable and well-documented, so please feel free to use or review it if you need to or want to!

If you find a bug, a mistake, or anything you think needs changing, please file an issue.

License

GNU/GPLv3.

brainscopypaste's People

Contributors

Stargazers

Watchers

brainscopypaste's Issues

Do timebag evolution based on product from substitution mining

Does the distribution of word features in timebag 1 depend (predict) the duration of a cluster?

Document root scripts

Document `analyze`

IPythonize

Graphing/visualizing (not mining) can be put in ipython with nice views and interfaces.

All the figures from the paper should be included in this.

Update the sphinx doc with the notebooks.

PCA on features

Useful?
As an answer to what?

Rework dependencies (seaborn, pandas, update all)

And rebuild the whole dependency list. See also #19.

Licensing

Treetaggerwrapper in under GPL, can I release my stuff under any license?
Choose license with http://wiki.civiccommons.org/Choosing_a_License

Document `visualize`

Sphinx toctree warning

The error is:

WARNING: toctree contains reference to nonexisting document u'reference/analyze.args.GroupAnalysisArgs.title'

Also decide on the options for automodule (etc.) to get all members of all classes, but not inherited members.

Roll back to directed unweighed FA network

Set susceptibility to NaN if never substituted

Rename all the online projects to brains-copy-paste

Check all from future import division

Plan documentation

Check what's documented in the code and what isn't. Make a list of things that still need documenting.

Recompute and reinterpret results

After #21 and #22 are done.

Harmonise ids

For Clusters they're ints, in QtStrings in substitutions they seem to be floats, and in Quotes in Clusters they're strings!

On python2.7, argparse can't be found in pypi (because included in stdlib?)

So pip install -r requirements.txt fails.

Great refactoring

I discovered so many new tools since this started. MongoDB, SQLAlchemy joblib, logging, pandas, seaborn to name a few. So there's a great refactoring in sight. Basically, there will be three levels:

Database

Models in SQLAlchemy. This will also solve #33. Basically it eases everything.

Flow

Mining operations are to be generalized. You can mine for substitutions, for evolution of timebags, for other things. One unique command and subcommand brains mine {substitutions | evolution | ...}. Each operation has prerequisites that go through joblib, are done before if needed (with confirmation), a kind of parallelized make. This includes #32.

Creating a new mining operation must be straightforward, because that's the way to go when a new question appears. Maybe also allow for quick immediate tweaking of an analysis by adding variation-ids to analyses?

Viz

In notebooks, and nowhere else. Storify and order them too. Graphs read mined stuff, writing their prerequisites at the beginning.

Do timebag evolution over [0, n-1]

Harmonise lemmatizing policy

Weird spike

In http://nbviewer.ipython.org/github/wehlutyk/brainscopypaste/blob/master/features_timebags_evolution_recursive_shifting.ipynb , in the MNSyllables plot, there's a growing spike at 4 syllables. What's that? It could also be related to a similar spike in the AoA plots.