ariddell / tatom Goto Github PK
View Code? Open in Web Editor NEWQuantitative Text Analysis for the digitale Geisteswissenschaften
Home Page: https://de.dariah.eu/tatom/
Quantitative Text Analysis for the digitale Geisteswissenschaften
Home Page: https://de.dariah.eu/tatom/
https://de.dariah.eu/tatom/feature_selection.html
Determine values for hyperparameters:
Let us consider μ0 and σ20 first.
In keeping with this observation we will set μ0 to be 3 and γ20 to be 1.52
In [79]: mu0 = 3
In [80]: tau20 = 1.5**2
These are inconsistent. The first two sentences should refer to tau20 as in the code.
Hi Allen,
sorry to bother you again, i have a dumm question:
https://de.dariah.eu/tatom/topic_model_mallet.html
Here you wrote "Because these are lengthy texts, the novels are split up into smaller sections—a preprocessing step which improves results considerably."
My question is, is there any rules about the length (or size) of the smaller sections? One paragraph as a section? One chapter of the novel? Or maybe the length of the smaller sections is not important, since we will combine the results of topic modelling in the end after all.
I've noticed, that almost all your data are about 6 or 7 kB. I assume maybe this is the right way?
Thanks a lot!
Hi Allen,
I refer to this material in my courses, but unfortunately the links are broken, e.g.: https://de.dariah.eu/tatom/visualizing_trends.html
I mailed the DARIAH team about this several times, and their response was that the material is avairable at https://de.dariah.eu/tatom/
However, this is just a redirect to the github repo, not a rendered HTML version of the tutorial.
So my suggestion: could you host this tutorial on github pages or your own website? I could host the material myself, but I'd like to be able to refer to an official/canonical version.
Cheers
https://de.dariah.eu/tatom/getting_started.html
For R an Octave/Matlab users
I think it is and between R and Octave.
I am working through the code to play with MALLET output, and I get a syntax error here:
docnum, docname, *values = line.rstrip().split('\t')
Python does not seem to like the *. How can I resolve this issue?
Pandas does make many operations much easier. Need to find sensible ways of integrating mentions of its uses. In principle, I think the tutorials should only require familiarity with the "basic" numpy/scipy stack.
I've been having problems with the installation of scikit-learn in Ubuntu 14.04. This seems to be a known issue; see here: http://askubuntu.com/questions/449326/installation-error-in-sklearn-for-python3.
Quote:
"There's an issue with the pre-compiled Cython C files (compatibility with Python 3.4.0) with the pypi version.
To properly install scikit-learn, use the git repo instead (tested ok on 14.04):
sudo pip3 install git+https://github.com/scikit-learn/scikit-learn.git"
This is what I've tried and despite a lot of warnings, in the end sklearn seemed to have been installed.
Maybe this could be added to the "Installing Python packages" section.
Hello,
When visiting https://de.dariah.eu/tatom/topic_model_python.html, I encounter a display problem on my machine.
I am running Elementary OS Freya 64 bits. The problem occurs on firefox and chromium-browser and in Firefox in a virtual Windows 7 machine.
Thanking you,
rtd_theme is great. If I knew my way around sass better I don't think it would be that difficult.
Hello there,
Once again thanks for the tutorial. I am following the one on Topic Modeling available here. Nonetheless when I launch the python script (copy pasted and adapted in a .py file), I encounter this error:
simon@simon-thinkpad:~/dariah/data$ python reorder-for-matrix.py
File "reorder-for-matrix.py", line 18
docnum, docname, *values = line.rstrip().split('\t')
^
SyntaxError: invalid syntax
with the "^" pointing to the star symbol. I'm no python expert so I don't really know what to change.
Here's what I have launched, I basically copy-pasted your instructions, removed the steps numbering and changed the path to my doc-topics file:
import numpy as np
import itertools
import operator
import os
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
args = [iter(iterable)] * n
return itertools.zip_longest(*args, fillvalue=fillvalue)
doctopic_triples = []
mallet_docnames = []
with open("doc-topics-hugo.txt") as f:
f.readline() # read one line in order to skip the header
for line in f:
docnum, docname, *values = line.rstrip().split('\t') # this is where the error happens
mallet_docnames.append(docname)
for topic, share in grouper(2, values):
triple = (docname, int(topic), float(share))
doctopic_triples.append(triple)
# sort the triples
# triple is (docname, topicnum, share) so sort(key=operator.itemgetter(0,1))
# sorts on (docname, topicnum) which is what we want
doctopic_triples = sorted(doctopic_triples, key=operator.itemgetter(0,1))
# sort the document names rather than relying on MALLET's ordering
mallet_docnames = sorted(mallet_docnames)
# collect into a document-term matrix
num_docs = len(mallet_docnames)
num_topics = len(doctopic_triples) // len(mallet_docnames)
# the following works because we know that the triples are in sequential order
doctopic = np.zeros((num_docs, num_topics))
for triple in doctopic_triples:
docname, topic, share = triple
row_num = mallet_docnames.index(docname)
doctopic[row_num, topic] = share
Thanks in advance.
Hi Allen,
https://de.dariah.eu/tatom/preprocessing.html#every-1-000-words
def split_text(filename, n_words):
....: """Split a text into chunks approximately n_words
words in length."""
....: input = open(filename, 'r')
....: words = input.read().split(' ')
....: input.close()
At the place of "input = open(filname, 'r')".
I don't konw if we use "input = open(filname, 'r', encoding = 'UTF-8')" would be better.
Otherwise you may get error message: "UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 10: character maps to ".
The docs should build flawlessly on Ubuntu 14.04 and a development version of the docs should be available at github.io.
Die Präsentation der Bibliotheken und die Installationsunterstützung ist ja
eher für Einsteiger, auf der anderen Seite gibt es für Einsteiger im Rest
des Tutorials zu viel implizites Wissen, was das Lesen auch für mich
manchmal schwierig gemacht hat. Manche Erläuterungen zu methodischen so wie
Python spezifische Begrifflichkeiten würden dem Lesefluss gut tun. Auch
wenn eher Fortgeschrittene angesprochen werden sollen könnte im Sinne der
Lesbarkeit der Text einfach noch etwas expressiver sein.
In the chapter on preprocessing, NLTK's PunktWordTokenizer is used directly (input 11). This no longer seems to work in NLTK version 3.0.3. In fact, this word tokenizer was not supposed to be used in the first place. Maybe it should be removed from the tutorial?
In the section on “Using the package manager”, you list the relevant ipython package as python3-ipython
; in recent versions, it's actually called ipython3
.
Hello there,
I am trying to use the code from the tutorial on topic modeling and I am facing a problem I cannot solve on my own:
Traceback (most recent call last):
File "/.../mallet_python.py", line 47, in
doctopic[row_num, topic] = share
IndexError: index 14 is out of bounds for axis 1 with size 6
I have copied the code, stored it in a .py file and adjusted the path to the doc-topic-file:
import os
import numpy as np
import itertools
import operator
def grouper(n, iterable, fillvalue=None):
#Collect data into fixed-length chunks or blocks
args = [iter(iterable)] * n
return itertools.zip_longest(*args, fillvalue=fillvalue)
doctopic_triples = []
mallet_docnames = []
with open("doc-topics.txt") as f:
f.readline() # read one line in order to skip the header
for line in f:
docnum, docname, *values = line.rstrip().split('\t')
mallet_docnames.append(docname)
for topic, share in grouper(2, values):
triple = (docname, int(topic), float(share))
doctopic_triples.append(triple)
#sort the triples
#triple is (docname, topicnum, share) so sort(key=operator.itemgetter(0,1))
#sorts on (docname, topicnum) which is what we want
doctopic_triples = sorted(doctopic_triples, key=operator.itemgetter(0,1))
#sort the document names rather than relying on MALLET's ordering
mallet_docnames = sorted(mallet_docnames)
#collect into a document-term matrix
num_docs = len(mallet_docnames)
num_topics = len(doctopic_triples) // len(mallet_docnames)
#the following works because we know that the triples are in sequential order
doctopic = np.zeros((num_docs, num_topics))
for triple in doctopic_triples:
docname, topic, share = triple
row_num = mallet_docnames.index(docname)
doctopic[row_num, topic] = share # error
My doc-topic file has the following structure:
doc_number filename topic share ....
There are all together 6 columns with topics and 6 with shares, thus, this makes 12 in total ... + doc_number and filename makes 14... I guess that is what the error is about. But I don't know what I am doing wrong.
Thanks in advance!
Fertig. Den img-Tags der Bilder, die dargestellt werden sollen, müsstet Ihr
dann noch die CSS-Klasse 'fancybox', den Typ 'image' und bei Bedarf eine
Unterschrift mitgeben. Wie das genau funktioniert steht hier im Wiki:
https://dev2.dariah.eu/wiki/display/DARIAHDE/Gestaltungs-,+Pflegehinweise
(Unterpunkt Fancybox im Web Content Portlet).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.