Giter Site home page Giter Site logo

tatom's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tatom's Issues

Topic modeling_Split the novels

Hi Allen,

sorry to bother you again, i have a dumm question:

https://de.dariah.eu/tatom/topic_model_mallet.html

Here you wrote "Because these are lengthy texts, the novels are split up into smaller sections—a preprocessing step which improves results considerably."

My question is, is there any rules about the length (or size) of the smaller sections? One paragraph as a section? One chapter of the novel? Or maybe the length of the smaller sections is not important, since we will combine the results of topic modelling in the end after all.

I've noticed, that almost all your data are about 6 or 7 kB. I assume maybe this is the right way?

Thanks a lot!

Host this material?

Hi Allen,

I refer to this material in my courses, but unfortunately the links are broken, e.g.: https://de.dariah.eu/tatom/visualizing_trends.html
I mailed the DARIAH team about this several times, and their response was that the material is avairable at https://de.dariah.eu/tatom/
However, this is just a redirect to the github repo, not a rendered HTML version of the tutorial.
So my suggestion: could you host this tutorial on github pages or your own website? I could host the material myself, but I'd like to be able to refer to an official/canonical version.

Cheers

syntax error

I am working through the code to play with MALLET output, and I get a syntax error here:

docnum, docname, *values = line.rstrip().split('\t')

Python does not seem to like the *. How can I resolve this issue?

Add discussion of pandas (in lieu of numpy?)

Pandas does make many operations much easier. Need to find sensible ways of integrating mentions of its uses. In principle, I think the tutorials should only require familiarity with the "basic" numpy/scipy stack.

Installation of sklearn for Python3 on Linux

I've been having problems with the installation of scikit-learn in Ubuntu 14.04. This seems to be a known issue; see here: http://askubuntu.com/questions/449326/installation-error-in-sklearn-for-python3.

Quote:
"There's an issue with the pre-compiled Cython C files (compatibility with Python 3.4.0) with the pypi version.
To properly install scikit-learn, use the git repo instead (tested ok on 14.04):
sudo pip3 install git+https://github.com/scikit-learn/scikit-learn.git"

This is what I've tried and despite a lot of warnings, in the end sklearn seemed to have been installed.

Maybe this could be added to the "Installing Python packages" section.

SyntaxError: invalid syntax on Topic Modeling script

Hello there,

Once again thanks for the tutorial. I am following the one on Topic Modeling available here. Nonetheless when I launch the python script (copy pasted and adapted in a .py file), I encounter this error:

simon@simon-thinkpad:~/dariah/data$ python reorder-for-matrix.py 
File "reorder-for-matrix.py", line 18
    docnum, docname, *values = line.rstrip().split('\t')
                     ^
SyntaxError: invalid syntax

with the "^" pointing to the star symbol. I'm no python expert so I don't really know what to change.

Here's what I have launched, I basically copy-pasted your instructions, removed the steps numbering and changed the path to my doc-topics file:

import numpy as np
import itertools
import operator
import os

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    args = [iter(iterable)] * n
    return itertools.zip_longest(*args, fillvalue=fillvalue)

doctopic_triples = []

mallet_docnames = []

with open("doc-topics-hugo.txt") as f:
    f.readline()  # read one line in order to skip the header
    for line in f:
        docnum, docname, *values = line.rstrip().split('\t') # this is where the error happens
        mallet_docnames.append(docname)
        for topic, share in grouper(2, values):
            triple = (docname, int(topic), float(share))
            doctopic_triples.append(triple)

# sort the triples
# triple is (docname, topicnum, share) so sort(key=operator.itemgetter(0,1))
# sorts on (docname, topicnum) which is what we want
doctopic_triples = sorted(doctopic_triples, key=operator.itemgetter(0,1))

# sort the document names rather than relying on MALLET's ordering
mallet_docnames = sorted(mallet_docnames)

# collect into a document-term matrix
num_docs = len(mallet_docnames)

num_topics = len(doctopic_triples) // len(mallet_docnames)

# the following works because we know that the triples are in sequential order
doctopic = np.zeros((num_docs, num_topics))

for triple in doctopic_triples:
    docname, topic, share = triple
    row_num = mallet_docnames.index(docname)
    doctopic[row_num, topic] = share

Thanks in advance.

UnicodeDecodeError

Hi Allen,

https://de.dariah.eu/tatom/preprocessing.html#every-1-000-words

def split_text(filename, n_words):
....: """Split a text into chunks approximately n_words words in length."""
....: input = open(filename, 'r')
....: words = input.read().split(' ')
....: input.close()

At the place of "input = open(filname, 'r')".

I don't konw if we use "input = open(filname, 'r', encoding = 'UTF-8')" would be better.

Otherwise you may get error message: "UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 10: character maps to ".

Clarify target audience / insert clarifications of underlying concepts

Die Präsentation der Bibliotheken und die Installationsunterstützung ist ja
eher für Einsteiger, auf der anderen Seite gibt es für Einsteiger im Rest
des Tutorials zu viel implizites Wissen, was das Lesen auch für mich
manchmal schwierig gemacht hat. Manche Erläuterungen zu methodischen so wie
Python spezifische Begrifflichkeiten würden dem Lesefluss gut tun. Auch
wenn eher Fortgeschrittene angesprochen werden sollen könnte im Sinne der
Lesbarkeit der Text einfach noch etwas expressiver sein.

IndexError - Topic Modeling with Mallet Tutorial

Hello there,

I am trying to use the code from the tutorial on topic modeling and I am facing a problem I cannot solve on my own:

Traceback (most recent call last):
File "/.../mallet_python.py", line 47, in
doctopic[row_num, topic] = share
IndexError: index 14 is out of bounds for axis 1 with size 6

I have copied the code, stored it in a .py file and adjusted the path to the doc-topic-file:

import os
import numpy as np
import itertools
import operator

def grouper(n, iterable, fillvalue=None):
    #Collect data into fixed-length chunks or blocks
    args = [iter(iterable)] * n
    return itertools.zip_longest(*args, fillvalue=fillvalue)

doctopic_triples = []

mallet_docnames = []

with open("doc-topics.txt") as f:
    f.readline()			# read one line in order to skip the header
    for line in f:
        docnum, docname, *values = line.rstrip().split('\t')
        mallet_docnames.append(docname)
        for topic, share in grouper(2, values):
            triple = (docname, int(topic), float(share))
            doctopic_triples.append(triple)

#sort the triples
#triple is (docname, topicnum, share) so sort(key=operator.itemgetter(0,1))
#sorts on (docname, topicnum) which is what we want
doctopic_triples = sorted(doctopic_triples, key=operator.itemgetter(0,1))

#sort the document names rather than relying on MALLET's ordering
mallet_docnames = sorted(mallet_docnames)

#collect into a document-term matrix
num_docs = len(mallet_docnames)

num_topics = len(doctopic_triples) // len(mallet_docnames)

#the following works because we know that the triples are in sequential order
doctopic = np.zeros((num_docs, num_topics))

for triple in doctopic_triples:
    docname, topic, share = triple
    row_num = mallet_docnames.index(docname)
    doctopic[row_num, topic] = share			# error

My doc-topic file has the following structure:

doc_number filename topic share ....

There are all together 6 columns with topics and 6 with shares, thus, this makes 12 in total ... + doc_number and filename makes 14... I guess that is what the error is about. But I don't know what I am doing wrong.

Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.