ariddell / tatom Goto Github PK

View Code? Open in Web Editor NEW

47.0 47.0 18.0 43.36 MB

Quantitative Text Analysis for the digitale Geisteswissenschaften

Home Page: https://de.dariah.eu/tatom/

Makefile 15.37% Python 73.49% TeX 11.14%

tatom's People

Stargazers

Watchers

Forkers

matthewruttley ricebeans nooralahzadeh dalalkrish chengyuanji wizardshowing sinab0ck anyaconda nicoleeickhoff scottkleinman mafuguo skrywinski dariah-de dorisvickers

tatom's Issues

Bayesian Group Comparison Typos

https://de.dariah.eu/tatom/feature_selection.html

Determine values for hyperparameters:

Let us consider μ0 and σ20 first.
In keeping with this observation we will set μ0 to be 3 and γ20 to be 1.52
In [79]: mu0 = 3
In [80]: tau20 = 1.5**2

These are inconsistent. The first two sentences should refer to tau20 as in the code.

Topic modeling_Split the novels

Hi Allen,

sorry to bother you again, i have a dumm question:

https://de.dariah.eu/tatom/topic_model_mallet.html

Here you wrote "Because these are lengthy texts, the novels are split up into smaller sections—a preprocessing step which improves results considerably."

My question is, is there any rules about the length (or size) of the smaller sections? One paragraph as a section? One chapter of the novel? Or maybe the length of the smaller sections is not important, since we will combine the results of topic modelling in the end after all.

I've noticed, that almost all your data are about 6 or 7 kB. I assume maybe this is the right way?

Thanks a lot!

Host this material?

Hi Allen,

I refer to this material in my courses, but unfortunately the links are broken, e.g.: https://de.dariah.eu/tatom/visualizing_trends.html
I mailed the DARIAH team about this several times, and their response was that the material is avairable at https://de.dariah.eu/tatom/
However, this is just a redirect to the github repo, not a rendered HTML version of the tutorial.
So my suggestion: could you host this tutorial on github pages or your own website? I could host the material myself, but I'd like to be able to refer to an official/canonical version.

Cheers

Is this a typo?

https://de.dariah.eu/tatom/getting_started.html

For R an Octave/Matlab users

I think it is and between R and Octave.

syntax error

I am working through the code to play with MALLET output, and I get a syntax error here:

docnum, docname, *values = line.rstrip().split('\t')

Python does not seem to like the *. How can I resolve this issue?

mention nltk's MALLET wrapper

Add discussion of pandas (in lieu of numpy?)

Pandas does make many operations much easier. Need to find sensible ways of integrating mentions of its uses. In principle, I think the tutorials should only require familiarity with the "basic" numpy/scipy stack.

Page not found https://de.dariah.eu/tatom/

https://de.dariah.eu/tatom/topic_model_python.html

Installation of sklearn for Python3 on Linux

I've been having problems with the installation of scikit-learn in Ubuntu 14.04. This seems to be a known issue; see here: http://askubuntu.com/questions/449326/installation-error-in-sklearn-for-python3.

Quote:
"There's an issue with the pre-compiled Cython C files (compatibility with Python 3.4.0) with the pypi version.
To properly install scikit-learn, use the git repo instead (tested ok on 14.04):
sudo pip3 install git+https://github.com/scikit-learn/scikit-learn.git"

This is what I've tried and despite a lot of warnings, in the end sklearn seemed to have been installed.

Maybe this could be added to the "Installing Python packages" section.

Webpage does not display correctly in Linux (chromium-browser and firefox)

Hello,

When visiting https://de.dariah.eu/tatom/topic_model_python.html, I encounter a display problem on my machine.

I am running Elementary OS Freya 64 bits. The problem occurs on firefox and chromium-browser and in Firefox in a virtual Windows 7 machine.

Here is a screenshot:

Thanking you,

build in some sort of sidebar

rtd_theme is great. If I knew my way around sass better I don't think it would be that difficult.

SyntaxError: invalid syntax on Topic Modeling script

Hello there,

Once again thanks for the tutorial. I am following the one on Topic Modeling available here. Nonetheless when I launch the python script (copy pasted and adapted in a .py file), I encounter this error:

simon@simon-thinkpad:~/dariah/data$ python reorder-for-matrix.py 
File "reorder-for-matrix.py", line 18
    docnum, docname, *values = line.rstrip().split('\t')
                     ^
SyntaxError: invalid syntax

with the "^" pointing to the star symbol. I'm no python expert so I don't really know what to change.

Here's what I have launched, I basically copy-pasted your instructions, removed the steps numbering and changed the path to my doc-topics file:

import numpy as np
import itertools
import operator
import os

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    args = [iter(iterable)] * n
    return itertools.zip_longest(*args, fillvalue=fillvalue)

doctopic_triples = []

mallet_docnames = []

with open("doc-topics-hugo.txt") as f:
    f.readline()  # read one line in order to skip the header
    for line in f:
        docnum, docname, *values = line.rstrip().split('\t') # this is where the error happens
        mallet_docnames.append(docname)
        for topic, share in grouper(2, values):
            triple = (docname, int(topic), float(share))
            doctopic_triples.append(triple)

# sort the triples
# triple is (docname, topicnum, share) so sort(key=operator.itemgetter(0,1))
# sorts on (docname, topicnum) which is what we want
doctopic_triples = sorted(doctopic_triples, key=operator.itemgetter(0,1))

# sort the document names rather than relying on MALLET's ordering
mallet_docnames = sorted(mallet_docnames)

# collect into a document-term matrix
num_docs = len(mallet_docnames)

num_topics = len(doctopic_triples) // len(mallet_docnames)

# the following works because we know that the triples are in sequential order
doctopic = np.zeros((num_docs, num_topics))

for triple in doctopic_triples:
    docname, topic, share = triple
    row_num = mallet_docnames.index(docname)
    doctopic[row_num, topic] = share

Thanks in advance.

UnicodeDecodeError

Hi Allen,

https://de.dariah.eu/tatom/preprocessing.html#every-1-000-words

def split_text(filename, n_words):
....: """Split a text into chunks approximately n_words words in length."""
....: input = open(filename, 'r')
....: words = input.read().split(' ')
....: input.close()

At the place of "input = open(filname, 'r')".

I don't konw if we use "input = open(filname, 'r', encoding = 'UTF-8')" would be better.

Otherwise you may get error message: "UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 10: character maps to ".

Build TAToM on Ubuntu 14.04 with Python 3.4

The docs should build flawlessly on Ubuntu 14.04 and a development version of the docs should be available at github.io.

Clarify target audience / insert clarifications of underlying concepts

Die Präsentation der Bibliotheken und die Installationsunterstützung ist ja
eher für Einsteiger, auf der anderen Seite gibt es für Einsteiger im Rest
des Tutorials zu viel implizites Wissen, was das Lesen auch für mich
manchmal schwierig gemacht hat. Manche Erläuterungen zu methodischen so wie
Python spezifische Begrifflichkeiten würden dem Lesefluss gut tun. Auch
wenn eher Fortgeschrittene angesprochen werden sollen könnte im Sinne der
Lesbarkeit der Text einfach noch etwas expressiver sein.

Using PunktWordTokenizer

In the chapter on preprocessing, NLTK's PunktWordTokenizer is used directly (input 11). This no longer seems to work in NLTK version 3.0.3. In fact, this word tokenizer was not supposed to be used in the first place. Maybe it should be removed from the tutorial?

Installing Python packages on Debian-based systems

In the section on “Using the package manager”, you list the relevant ipython package as python3-ipython; in recent versions, it's actually called ipython3.

IndexError - Topic Modeling with Mallet Tutorial

Hello there,

I am trying to use the code from the tutorial on topic modeling and I am facing a problem I cannot solve on my own:

Traceback (most recent call last):
File "/.../mallet_python.py", line 47, in
doctopic[row_num, topic] = share
IndexError: index 14 is out of bounds for axis 1 with size 6

I have copied the code, stored it in a .py file and adjusted the path to the doc-topic-file:

import os
import numpy as np
import itertools
import operator

def grouper(n, iterable, fillvalue=None):
    #Collect data into fixed-length chunks or blocks
    args = [iter(iterable)] * n
    return itertools.zip_longest(*args, fillvalue=fillvalue)

doctopic_triples = []

mallet_docnames = []

with open("doc-topics.txt") as f:
    f.readline()			# read one line in order to skip the header
    for line in f:
        docnum, docname, *values = line.rstrip().split('\t')
        mallet_docnames.append(docname)
        for topic, share in grouper(2, values):
            triple = (docname, int(topic), float(share))
            doctopic_triples.append(triple)

#sort the triples
#triple is (docname, topicnum, share) so sort(key=operator.itemgetter(0,1))
#sorts on (docname, topicnum) which is what we want
doctopic_triples = sorted(doctopic_triples, key=operator.itemgetter(0,1))

#sort the document names rather than relying on MALLET's ordering
mallet_docnames = sorted(mallet_docnames)

#collect into a document-term matrix
num_docs = len(mallet_docnames)

num_topics = len(doctopic_triples) // len(mallet_docnames)

#the following works because we know that the triples are in sequential order
doctopic = np.zeros((num_docs, num_topics))

for triple in doctopic_triples:
    docname, topic, share = triple
    row_num = mallet_docnames.index(docname)
    doctopic[row_num, topic] = share			# error

My doc-topic file has the following structure:

doc_number filename topic share ....

There are all together 6 columns with topics and 6 with shares, thus, this makes 12 in total ... + doc_number and filename makes 14... I guess that is what the error is about. But I don't know what I am doing wrong.

Thanks in advance!

try getting jquery fancybox working for images

Fertig. Den img-Tags der Bilder, die dargestellt werden sollen, müsstet Ihr
dann noch die CSS-Klasse 'fancybox', den Typ 'image' und bei Bedarf eine
Unterschrift mitgeben. Wie das genau funktioniert steht hier im Wiki:
https://dev2.dariah.eu/wiki/display/DARIAHDE/Gestaltungs-,+Pflegehinweise
(Unterpunkt Fancybox im Web Content Portlet).

ariddell / tatom Goto Github PK

tatom's People

Stargazers

Watchers

Forkers

tatom's Issues

Recommend Projects

Recommend Topics

Recommend Org