Giter Site home page Giter Site logo

textacy's Introduction

textacy: NLP, before and after spaCy

textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spaCy library. With the fundamentals --- tokenization, part-of-speech tagging, dependency parsing, etc. --- delegated to another library, textacy focuses primarily on the tasks that come before and follow after.

build status current release version pypi version conda version

features

  • Access and extend spaCy's core functionality for working with one or many documents through convenient methods and custom extensions
  • Load prepared datasets with both text content and metadata, from Congressional speeches to historical literature to Reddit comments
  • Clean, normalize, and explore raw text before processing it with spaCy
  • Extract structured information from processed documents, including n-grams, entities, acronyms, keyterms, and SVO triples
  • Compare strings and sequences using a variety of similarity metrics
  • Tokenize and vectorize documents then train, interpret, and visualize topic models
  • Compute text readability and lexical diversity statistics, including Flesch-Kincaid grade level, multilingual Flesch Reading Ease, and Type-Token Ratio

... and much more!

links

maintainer

Howdy, y'all. ๐Ÿ‘‹

textacy's People

Contributors

abevieiramota avatar asifm avatar austinjp avatar bdewilde avatar ckot avatar covuworie avatar digest0r avatar dixiekong avatar gregbowyer avatar gregory-howard avatar harryhoch avatar hironsan avatar hugoabonizio avatar jonwiggins avatar kevinbackhouse avatar kjoshi-jisc avatar marius-mather avatar mdlynch37 avatar minketeer avatar mirkolenz avatar oroszgy avatar pavlin99th avatar rmax avatar rolandcolored avatar rolando avatar rtbs-dev avatar sammous avatar sandyrogers avatar sllvn avatar timgates42 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

textacy's Issues

Unable to reproduce example in README

Attempt to reproduce the code example in README led to very different result after the 2nd code block. Instead of obtaining a doc_term_matrix with over 11k terms, mine only has 9 terms.

Steps to Reproduce (for bugs)

import textacy

cw = textacy.corpora.CapitolWords()
docs = cw.records(speaker_name={'Hillary Clinton', 'Barack Obama'})

content_stream, metadata_stream = textacy.fileio.split_record_fields(docs, 'text')
corpus = textacy.Corpus('en', texts=content_stream, metadatas=metadata_stream)

corpus  
# Corpus(1241 docs; 859235 tokens)
doc_term_matrix, id2term = textacy.vsm.doc_term_matrix(
  (doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True) for doc in corpus),
  weighting='tfidf', normalize=True, smooth_idf=True, min_df=2, max_df=0.95
)

print(repr(doc_term_matrix))  
# <1238x9 sparse matrix of type '<class 'numpy.float64'>'
#	with 540 stored elements in Compressed Sparse Row format>

Your Environment

  • textacy and spacy versions: textacy=0.3.2, spacy=1.3.0
  • Python version: 3.5
  • Operating system and version: Windows 7

preprocess_text Error: object of type 'float' has no len()

I'm using TextAcy to preprocess text from a (UTF-8?) dataset of questions, (Quora Kaggle challenge dataset).
[Data: http://qim.ec.quoracdn.net/quora_duplicate_questions.tsv ]

I'm getting errors with using the preprocessing.
The described column is a list of strings, e.g.:
df['question2'].values[0:2]

['What is the step by step guide to invest in share market?',
'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?']

Expected Behavior

df["clean_question2"] = df['question2'].apply(lambda x: textacy.preprocess.preprocess_text(x, fix_unicode=True, lowercase=False, transliterate=True, no_urls=True, no_emails=True, no_phone_numbers=True, no_numbers=True, no_currency_symbols=True, no_punct=True, no_contractions=True))

(Note, this worked on another text column).

Current Behavior

df["clean_question2"] = df['question2'].apply(lambda x: textacy.preprocess.preprocess_text(x, fix_unicode=True, lowercase=False, transliterate=True, no_urls=True, no_emails=True, no_phone_numbers=True, no_numbers=True, no_currency_symbols=True, no_punct=True, no_contractions=True))

/Users/danofer/anaconda/lib/python3.6/site-packages/textacy/preprocess.py in preprocess_text(text, fix_unicode, lowercase, transliterate, no_urls, no_emails, no_phone_numbers, no_numbers, no_currency_symbols, no_punct, no_contractions, no_accents)
    197     """
    198     if fix_unicode is True:
--> 199         text = fix_bad_unicode(text, normalization='NFC')
    200     if transliterate is True:
    201         text = transliterate_unicode(text)

/Users/danofer/anaconda/lib/python3.6/site-packages/textacy/preprocess.py in fix_bad_unicode(text, normalization)
     38         str
     39     """
---> 40     return fix_text(text, normalization=normalization)
     41 
     42 

/Users/danofer/anaconda/lib/python3.6/site-packages/ftfy/__init__.py in fix_text(text, fix_entities, remove_terminal_escapes, fix_encoding, fix_latin_ligatures, fix_character_width, uncurl_quotes, fix_line_breaks, fix_surrogates, remove_control_chars, remove_bom, normalization, max_decode_length)
    152     out = []
    153     pos = 0
--> 154     while pos < len(text):
    155         textbreak = text.find('\n', pos) + 1
    156         fix_encoding_this_time = fix_encoding

TypeError: object of type 'float' has no len()

Your Environment

  • textacy and spacy versions: latest versions from pip
  • Python version: 3.6 , anaconda
  • Operating system and version: OSX Sierra 10.12.3

inconsistency in handling lemmas/raw-text between corpus.word_doc_freqs and sgrank

There appears to be an inconsistency in how the IDF dict returned by corpus.word_doc_freqs handles lemmatization/stopwords and how SGRank handles it.

Expected Behavior

calls to textacy.keyterms.sgrank(doc, idf=my_idf) should not raise KeyError

Current Behavior

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/204434/.pyenv/versions/3.5.1/lib/python3.5/site-packages/textacy/keyterms.py", line 80, in sgrank
    for term, count in term_counts.items()}
  File "/Users/204434/.pyenv/versions/3.5.1/lib/python3.5/site-packages/textacy/keyterms.py", line 80, in <dictcomp>
    for term, count in term_counts.items()}
KeyError: 'be'

Possible Solution

I don't have quite enough understanding of how the textacy ecosystem should work, or how SGRank should work, to really know what the solution might be, but I think the problem has something to do with lemmatization or stopwords.

I have only seen that KeyError appear when one of the normalized tokens (with normalized_str, I think) in doc is either very common word ("be", "go") or hard-to-tokenize strings like PERCENT"- or ET.

My suspicion, having just starting playing with Textacy (which seems awesome and to have great promise, so, thank you!) is that the word_doc_freqs method eliminates stop words or garbage or acronyms or something like that.

The problematic tokens vary based on whether I generate my IDF dict with lemmatize=True or lemmatize=False. It seems almost as if sgrank sometimes expects the keys in the IDF dict to be lemmatized and other times not.

Steps to Reproduce (for bugs)

import textacy
import spacy
import csv
nlp = spacy.load('en')
tweets = []
with open('some_tweets.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        tweets.append(row['body'])
corpus = textacy.corpus.Corpus('en', tweets)
doc = corpus[0]
my_idf = corpus.word_doc_freqs(lemmatize=True, weighting='idf', lowercase=True, as_strings=True)
textacy.keyterms.sgrank(doc, idf=my_idf)

At this point, depending on the content of doc, you should see a KeyError on line 80 of keyterms.py, complaining that some token from doc isn't in my_idf.

Context

Just trying to minimally test out SGRank, along with a few other key-term extraction algorithms.

Your Environment

spacy (1.3.0)
textacy (0.3.2)
Python 3.5.1

can not load wikipedia files one the disk

I am trying to read wiki data from my local directory. It is giving encoding error.

Current Behavior

Sample code,

import textacy
from textacy.corpora.wiki_reader import WikiReader

wr = WikiReader('D:\\corpus\\wiki_dump\\enwiki-latest-pages-articles.xml.bz2')

for text in wr.texts(limit=2):
    print(text)

I am getting following error,

C:\Anaconda3\lib\xml\etree\ElementTree.py in __next__(self)
   1295                 raise StopIteration
   1296             # load event buffer
-> 1297             data = self._file.read(16 * 1024)
   1298             if data:
   1299                 self._parser.feed(data)

C:\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3216: character maps to <undefined>

Your Environment

textacy and spacy versions: textacy (0.3.2) spacy (1.4.0)
python: Anaconda python 3.5.1
OS : windows 7

Readability stats use wrong word count due to stop list usage

Great workโ€”stumbled across this while writing my own Python script for readability stats. Looking forward to Topic Modeling :)

Between our work on readability scores, I noticed a discrepancy in word count, with far less words counted in your implementation.

Turns out that for calculating the readability stats in textacy.text_stats, you use the following line:

 words = doc.words(filter_punct=True)

which probably should be:

 words = doc.words(filter_punct=True, filter_stops=False)

By setting the default for filtering stop words to filter_stops=True in textacy.extract.wordsโ€”which is a rather significant change to any text, so maybe the default should be False?โ€”the number of words considered for the readability scores is reduced significantly and renders them incorrect.

Accessing to the set of TextDoc based on the metadata in doc_term_matrix?

If we have different corpus and we create the TextCorpus of these corpora and we specified them regarding to their metadata like Genre or domain or author. How we can access their representation in vector space ( doc_term_matrix)

Is the order in the row of doc_term_matrix as same as TextCorpus order get.docs with lambda helps us to cath them?
Do you have any implemented similarity measure to compare these corpora? If not I would be happy to contribute and pull some new functionalities !

Returning empty list using Textacy sgrank automated keyterm retrieval

I am trying to extract a key term list from a document using the sgrank function in textacy, but am getting an empty list returned.

Expected Behavior

Returns list of key terms.

Current Behavior

Empty list returned.

Steps to Reproduce (for bugs)

import textacy

doc = textacy.Doc("If you are looking for a ransom, I can tell you I don't have any money!")

sgrank = textacy.keyterms.sgrank(doc, normalize=u'lemma', window_width=1, n_keyterms=10, idf=None)

print(sgrank)

Your Environment

  • textacy and spacy versions: latest
  • Python version: 3.6.1
  • Operating system and version: MacOS Sierra 10.12.3

Acronyms_and_definitions returns only 1st character of definition when passing in known defs

When passing in a list of known definitions to the acronyms_and_definitions function, the dict object that is returned only returns the 1st character of the corresponding definition.

Expected Behavior

Return the full definition for the acronym

Current Behavior

Returns only the 1st character of the definition

Possible Solution

Looks to stem around line 452 in extract.py. Doing a defs[0] instead returns the confidence and empty definition for acronyms w/o definitions and just the definition w/o confidence for found expansions.

Steps to Reproduce (for bugs)

  1. extract.acronyms_and_definitions(doc1, known_acro_defs= {'OSH':'OUTSIDE HOSPITAL', "IBS":"IRRITABLE BOWEL SYNDROME"})

Context

There are a lot of acronyms in the medical space and this feature will be very helpful.

Your Environment

  • textacy and spacy versions: 0.3.3 & 1.6.0
  • Python version: 3.5 64bit
  • Operating system and version: Windows 7 64bit

fuzzywuzzy is GPL

I noticed that fuzzywuzzy, which textacy depends on, is licensed under GPL. Any chance textacy can migrate off of this? It seems this would be an issue for anyone who wants to use textacy but also wants to avoid using GPL.

SGRank issue

Hi ,

SGRank API seems to have some issue when I tried.. I believe there was a recent bug fix. when would this get updated?

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

  • Version used:
  • Environment name and version (e.g. PHP 5.4 on nginx 1.9.1):
  • Server type and version:
  • Operating System and version:
  • Link to your project:

probably user error maybe readability_stats() error

Hi - There may be an issue with the readabilty_stats() function it is returning:
object of type 'generator' has no len() from line ---> 35 n_words = len(words)

I originally thought it might be the text that I a using, it has many number format variations e.g ...binary "10"....66-bit transmission character.... However I then copy pasted your example and received the same error.

Below is the code copied from a jupyter notebook - Let me know if you want me to try anything out

import textacy

textacy numbers bug (maybe)

text = '''Hell, it's about time someone told about my friend EPICAC. After all, he cost the taxpayers $776,434,927.54. They have a right to know about him, picking up a check like that. EPICAC got a big send off in the papers when Dr. Ormand von Kleigstadt designed him for the Government people. Since then, there hasn't been a peep about him -- not a peep. It isn't any military secret about what happened to EPICAC, although the Brass has been acting as though it were. The story is embarrassing, that's all. After all that money, EPICAC didn't work out the way he was supposed to.
And that's another thing: I want to vindicate EPICAC. Maybe he didn't do what the Brass wanted him to, but that doesn't mean he wasn't noble and great and brilliant. He was all of those things. The best friend I ever had, God rest his soul.
You can call him a machine if you want to. He looked like a machine, but he was a whole lot less like a machine than plenty of people I could name. That's why he fizzled as far as the Brass was concerned..'''

text_model = textacy.texts.TextDoc(text.strip(), lang='en')

Number of words

text_model.n_words
203

Error

len(text_model.words)
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-6-247336a43fbc> in <module>()
----> 1 len(text_model.words)


TypeError: object of type 'method' has no len()
text_model.readability_stats()
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-44-e8fbedb4cd27> in <module>()
----> 1 text_model.readability_stats()


C:\Anaconda3\lib\site-packages\textacy-0.2.0-py3.5.egg\textacy\texts.py in readability_stats(self)
    546     @property
    547     def readability_stats(self):
--> 548         return text_stats.readability_stats(self)
    549 
    550 


C:\Anaconda3\lib\site-packages\textacy-0.2.0-py3.5.egg\textacy\text_stats.py in readability_stats(doc)
     33 
     34     words = doc.words(filter_punct=True, filter_stops=False, filter_nums=False)
---> 35     n_words = len(words)
     36     n_unique_words = len({word.lower for word in words})
     37     n_chars = sum(len(word) for word in words)


TypeError: object of type 'generator' has no len()

btw: textacy is very fun to use (and usefull). pandas for nlp

textacy.lexicon_methods.emotional_valence doesn't always work

textacy.lexicon_methods.emotional_valence requires that you specify the download directory for depechemood otherwise it fails to work and throws misleading error messages.

Expected Behavior

nlp = spacy.load("en")
doc = nlp(document)
print(textacy.lexicon_methods.emotional_valence(doc)

# result - defaultdict(<class 'float'>, {'DONT_CARE': 0.1297936930869565, 'AFRAID': 0.09674656373913043, 'SAD': 0.1282719223913044, 'ANNOYED': 0.12152050326086958, 'ANGRY': 0.1012500840869565, 'HAPPY': 0.12262621060869566, 'INSPIRED': 0.1640221918695652, 'AMUSED': 0.13576883073913046})

## Current Behavior

>>> import textacy
>>> import spacy
>>> nlp = spacy.load("en")
>>> doc = nlp("hello there friends, how are you today?")
>>> textacy.lexicon_methods.emotional_valence(doc)
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/textacy/data.py", line 140, in load_depechemood
    with io.open(fname, mode='rt') as csvfile:
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.5/dist-packages/textacy/resources/DepecheMood_V1.0/DepecheMood_normfreq.txt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/code.py", line 91, in runcode
    exec(code, self.locals)
  File "<console>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/textacy/lexicon_methods.py", line 37, in emotional_valence
    dm = data.load_depechemood(data_dir=dm_data_dir, weighting=dm_weighting)
  File "/usr/local/lib/python3.5/dist-packages/cachetools/__init__.py", line 50, in wrapper
    v = func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/textacy/data.py", line 145, in load_depechemood
    _download_depechemood(os.path.split(data_dir)[0])
  File "/usr/local/lib/python3.5/dist-packages/textacy/data.py", line 182, in _download_depechemood
    f.extractall(data_dir, members=members)
  File "/usr/lib/python3.5/zipfile.py", line 1347, in extractall
    self.extract(zipinfo, path, pwd)
  File "/usr/lib/python3.5/zipfile.py", line 1335, in extract
    return self._extract_member(member, path, pwd)
  File "/usr/lib/python3.5/zipfile.py", line 1390, in _extract_member
    os.makedirs(upperdirs)
  File "/usr/lib/python3.5/os.py", line 231, in makedirs
    makedirs(head, mode, exist_ok)
  File "/usr/lib/python3.5/os.py", line 241, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/textacy/resources'

## Possible Solution
Make the download directory for depecheMood mandatory as a parameter and make downloading depechemood part of installing textacy or a method that installs it somewhere.  This method would be similar to

`python -m spacy.en.download`

## Steps to Reproduce (for bugs)

run:

def emotional_valence(document):
nlp = spacy.load("en")
doc = nlp(document)
return textacy.lexicon_methods.emotional_valence(doc,dm_data_dir="~/Documents/DepecheMood_V1.0")
emotional_valence()


## Context
This issue is easily fixed but hard to debug.

## Your Environment

* Version used:
* Environment name and version: Python 3.5.2
* Server type and version: N/a
* Operating System and version: Ubuntu 16.04
* Link to your project:

Installation section of documentation needs dependency on spacy data added

Current Behavior

The method textacy.keyterms.sgrank silently fails by returning an empty list and the method textacy.keyterms.textrank crashes with an index out of range exception on line 70 of network.py when the spacy English data is not installed. I haven't checked any other methods.

Expected Behavior

Exceptions should be thrown for methods relying on part-of-speech (POS) tagging when the spacy English data is not installed and the code certainly should not crash.

Possible Solution

The installation section of the documentation needs to explain that the command python -m spacy.en.download all needs to be run to download the spacy English data in order for any methods using part-of-speech (POS) tagging to work. I just fell foul of the Empty lemma and zero POS issue.

Exceptions should be thrown by these methods, although I am not sure how difficult it is for you to detect the existence of the spacy English data installation. If it's too difficult to detect, then I tell you that I prefer the crash to the silent fail! It was the crash which enabled me to actually figure out what was going on by just looking at the code.

Your Environment

  • textacy 0.3.3 and spacy 1.6.0 versions:
  • Python version: 3.6.0.1
  • Operating system and version: Windows 7

ValueError: token is not POS-tagged

Hi,
I followed all the steps in the example section but the textacy.keyterms.textrank(doc, n_keyterms=10) function returns the following error:

ValueError: token is not POS-tagged

The error is raised in the spacy_utils.preserve_case function. I supoose that you have to run another function first to pos-tag the document, am I right?

more, better example corpora

textacy currently has one, small example corpus โ€” the "Bernie and Hillary" corpus containing 3000 speeches and basic metadata from the Congressional Record โ€” and readers for two, very large corpora โ€” streams of Wikipedia pages and Reddit comments from standardized, publicly-available database dumps. We want more options.

potential datasets / options

  • thousands of (mostly old) books are available at Project Gutenberg
  • U.S. Supreme Court decisions (see here for an example of how to get these documents)
  • larger, more composable collection of Congressional speeches from the Sunlight Foundation Capitol Words API that would enable subsetting by speaker to get, say, the equivalent of Bernie and Hillary
  • net neutrality comments on the FCC website, also via Sunlight Foundation here
  • thousands of descriptions of websites in a variety of categories at JC-Bingo (note: I can't find any information on TOS)
  • a streaming reader for the Enron email corpus (downloadable here)
  • a streaming reader for the Ontonotes 5 corpus
  • better filtering options for the Wikipedia and Reddit corpus readers; say, by Wikipedia category or subreddit

There are lots of other options! The only requirement is that the license / terms of service don't prohibit free, public distribution of the data.

implementation in textacy

  • stream one document at a time from disk
  • filter or group by some parameters or metadata
  • metadata in addition to text (preferred)
  • variety of content compared to other available corpora (preferred)
  • what else...?

backports.lzma dependency in textacy

Hi @bdewilde - Once again I am having trouble installing textacy, and once again it is a Windows issue.
Failed building wheel for backports.lzma

Generating this error

image

Any suggestions?

FYI: The issue installing textacy with cld2-cffi has been resolved, about 2 weeks ago(fix made by @GregBowyer )

SGRank method tests observed and expected values make no sense

Expected Behavior

Most of the expected values vs observed values in the keyterms.sgrank should match even though some may be different due to randomness of the results. keyterms.textrank and keyterms.singlerank at least seem to be behaving in that way.

Current Behavior

Almost none of the expected and observed values for sgrank tests match. Take test_sgrank as an example where the test says:

expected = [
            'new york times', 'york times jerusalem bureau chief', 'friedman',
            'president george h. w.', 'george polk award', 'pulitzer prize',
            'u.s. national book award', 'international reporting', 'beirut',
            'washington post']

However debugging this test I see the following values:

observed = ['year', 'receive', 'win', 'coverage', 'reporting', 'write', 'end', 'be transfer', 'east', 'also win']

Steps to Reproduce (for bugs)

Just debug the unit tests.

Context

I was trying to make a code change to the sgrank method and then submit a pull request. I noticed this during the course of writing some tests and then had to back off as I don't understand what is going on. Once this issue is resolved I can get back to the code change.

Your Environment

  • textacy 0.3.3 and spacy 1.6.0 versions:
  • Python version: 3.6.0.1
  • Operating system and version: Windows 7

TopicModel: infer topics for a new doc?

Textacy is proving to be really useful for some stuff I'm building. I had a quick question, though, about using the TopicModels...

Is there any in-built function, or a fairly straightforward way to infer a topics set for a new document that was not part of the original model fit? That is, given a model built from 20,000 docs, I have a brand new doc and I just want to know what the top topics/weights might be for that new doc. Is there any simple way already in Textacy to do this that I'm just missing in the docs?

Phrase models vs n-grams in pre-processing

Expected Behavior

It would be nice if some of the features of the gensim.models.Phrases() tool could get implemented into the doc.to_terms_list() method, or even elsewhere (expecially in the textacy.preprocess.preprocess_text() method).

Current Behavior

Currently, we can use ngrams=(1,2,3) kwarg in doc.to_terms_list() to get up to the 3rd-order n-grams included in the "bag of words" frequency representation later on. I don't currently see a way of modeling phrases to combine tokens that should really be together to begin with (like ice_cream) in Textacy.

Context

For example, here's a great instance of using a phrase model as opposed to just including all n-grams in a Bag of Words. The phrase model tokenizes the n-grams together before any vectorization.

Issue while creating and loading large corpus

I have 1.7*10^8 sentences, each of length 5. My RAM is of size 8GB.
It is consuming a lot of RAM while creating the corpus.
Also suppose in case if I take around 850000 sentences only then it saves successfully but while loading back I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/textacy/corpus.py", line 268, in load
    for spacy_doc, metadata in zip(spacy_docs, metadata_stream):
  File "/usr/local/lib/python2.7/dist-packages/textacy/fileio/read.py", line 173, in read_spacy_docs
    yield SpacyDoc(spacy_vocab).from_bytes(bytes_string)
  File "spacy/tokens/doc.pyx", line 615, in spacy.tokens.doc.Doc.from_bytes (spacy/tokens/doc.cpp:12347)
  File "spacy/serialize/packer.pyx", line 132, in spacy.serialize.packer.Packer.unpack_into (spacy/serialize/packer.cpp:6349)
  File "spacy/serialize/huffman.pyx", line 107, in spacy.serialize.huffman.HuffmanCodec.decode (spacy/serialize/huffman.cpp:3548)
Exception: Buffer exhausted at 4/6 symbols read.

Steps to Reproduce (for bugs)

I am working over the text8. I have to do the Named Entity Recognition.

  1. First I broke the corpus into sentences of window_width = 5.
  2. Then I am taking each sentence as a doc and creating a corpus.

Your Environment

  • textacy and spacy versions: 0.3.3 and 1.6.0
  • Python version: 2.7.12
  • Operating system and version: Ubuntu 16.04; 4.4.0-64-generic

more, better, and interactive(?) data viz

textacy currently has two visualizations: draw_semantic_network() for visualizing documents as networks of terms with edges given by, say, term co-occurrence; and draw_termite_plot() for visualizing the relationship between topics and terms in a topic model. Both of these could be improved!

There are also tons of other visualizations that textacy users could benefit from:

  • pyldavis for visualizing various aspects of topic models interactively
  • word clouds to show word (or, generically, term) counts
  • word trees to show word sequences
  • parallel tag clouds to show differences in key terms over time or between groups
  • stream graph for showing trends over time in, say, topic prevalence or word usage
  • dependency parsing viz a la displacy
  • compareclouds for visualizing media frames

I should stop listing these out and just point people to this site, which contains tons of possibilities.

implementation in textacy

  • Python-only, without a bunch of extra dependencies (preferred)
  • easy interoperability with relevant classes / functions
  • what else...?

SGRank method crashes at "low" values of term co-occurence window size

Current Behavior

Using the keyterms.sgrank method with the default window size of 1500 is intractable when using a corpus of several thousands of documents (and even hundreds) due to performance issues. I wish the authors of the SGRank paper had mentioned this.

Reducing the window_size to smaller values (e.g. using 1/10 of the default value or lower) drastically speeds up the performance. However at these "low" values of window_size a divide by zero crash occurs with the following stack trace:

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-56-be9b3fa36a41> in <module>()
    113 
    114 doc = Doc(text)
--> 115 keyphrases = keyterms.sgrank(doc, normalize=u'lemma', window_width=15, n_keyterms=20, idf=None)
    116 print(keyphrases)

C:\WinPython\WinPython-64bit-3.6.0.1Qt5\python-3.6.0.amd64\lib\site-packages\textacy\keyterms.py in sgrank(doc, normalize, window_width, n_keyterms, idf)
    147         sum_edge_weights = sum(t2s.values())
    148         norm_edge_weights.extend((t1, t2, {'weight': weight / sum_edge_weights})
--> 149                                  for t2, weight in t2s.items())
    150 
    151     # build the weighted directed graph from edges, rank nodes by pagerank

C:\WinPython\WinPython-64bit-3.6.0.1Qt5\python-3.6.0.amd64\lib\site-packages\textacy\keyterms.py in <genexpr>(.0)
    147         sum_edge_weights = sum(t2s.values())
    148         norm_edge_weights.extend((t1, t2, {'weight': weight / sum_edge_weights})
--> 149                                  for t2, weight in t2s.items())
    150 
    151     # build the weighted directed graph from edges, rank nodes by pagerank

ZeroDivisionError: float division by zero

Note that I am also able to set a negative window_size and the algorithm silently succeeds due to the code on line 60 of keyterms.py. I'm guessing window_size must be >=2.

Possible Solution

I read the SGRank paper a few days and thought I understood most of it, but obviously not enough to know what is actually wrong with the code! :) (sum_logdists or one or more term weights are zero, but I don't know why). For the second issue it should be easy enough to raise a ValueError when window_size < 2.

Steps to Reproduce

The following code crashes with a divide by zero error for all values of window_size <= 15:

from textacy import Doc, keyterms

# The Rime of The Ancient Mariner 
text = """
Hear the rime of the ancient mariner
See his eye as he stops one of three
Mesmerizes one of the wedding guests
Stay here and listen to the nightmares of the sea.

And the music plays on, as the bride passes by
Caught by his spell and the mariner tells his tale.

Driven south to the land of the snow and ice
To a place where nobody's been
Through the snow fog flies on the albatross
Hailed in God's name, hoping good luck it brings.

And the ship sails on, back to the North
Through the fog and ice and the albatross follows on.

The mariner kills the bird of good omen
His shipmates cry against what he's done
But when the fog clears, they justify him
And make themselves a part of the crime.

Sailing on and on and north across the sea
Sailing on and on and north 'til all is calm.

The albatross begins with its vengeance
A terrible curse a thirst has begun
His shipmates blame bad luck on the mariner
About his neck, the dead bird is hung.

And the curse goes on and on at sea
And the curse goes on and on for them and me.

"Day after day, day after day,
we stuck nor breath nor motion
as idle as a painted ship upon a painted ocean
Water, water everywhere and
all the boards did shrink
Water, water everywhere nor any drop to drink."

There calls the mariner
There comes a ship over the line
BUt how can she sail with no wind in her sails and no tide.

See...onward she comes
Onward she nears out of the sun
See, she has no crew
She has no life, wait but here's two.

Death and she Life in Death,
They throw their dice for the crew
She wins the mariner and he belongs to her now.
Then, crew one by one
they drop down dead, two hundred men
She, she, Life in Death.
She lets him live, her chosen one.

"One after one by the star dogged moon,
too quick for groan or sigh
each turned his face with a ghastly pang
and cursed me with his eye
four times fifty living men
(and I heard nor sigh nor groan)
with heavy thump, a lifeless lump,
they dropped down one by one."

The curse it lives on in their eyes
The mariner wished he'd die
Along with the sea creatures
But they lived on, so did he.

and by the light of the moon
He prays for their beauty not doom
With heart he blesses them
God's creatures all of them too.

Then the spell starts to break
The albatross falls from his neck
Sinks down like lead into the sea
Then down in falls comes the rain.

Hear the groans of the long dead seamen
See them stir and they start to rise
Bodies lifted by good spirits
None of them speak and they're lifeless in their eyes

And revenge is still sought, penance starts again
Cast into a trance and the nightmare carries on.

Now the curse is finally lifted
And the mariner sights his home
spirits go fromhe long dead bodies
Form their own light and the mariner's left alone.

And then a boat came sailing towards him
It was a joy he could not believe
The pilot's boat, his son and the hermit,
Penance of life will fall onto him.

And the ship sinks like lead into the sea
And the hermit shrives the mariner of his sins.

The mariner's bound to tell of his story
To tell this tale wherever he goes
To teach God's word by his own example
That we must love all things that God made.

And the wedding guest's a sad and wiser man
And the tale goes on and on and on.
"""

doc = Doc(text)
keyphrases = keyterms.sgrank(doc, normalize=u'lemma', window_width=15, n_keyterms=20, idf=None)

Your Environment

  • textacy 0.3.3 and spacy 1.6.0 versions:
  • Python version: 3.6.0
  • Operating system and version: Windows 7

Corpus.load throws exception when using an absolute path?

This could be totally something I'm doing wrong, but I'm stumped. I have a class that creates a topic model using textacy and then saves all the relevant corpus and model files out to disk for use later. When that class is instantiated, it first checks to see if the model files already exist on disk:

# /Users/mepatterson/stufftopic/modeler/stuff_topic_model.py
MOD_PATH = os.path.dirname(__file__)

if os.path.exists(os.path.join(MOD_PATH, 'stuff.corpus_spacy_docs.bin')):
    logging.info("Found stuff.corpus. Loading...")
    corpus = textacy.Corpus.load(MOD_PATH, name='stuff.corpus', compression='gzip')

When I run this file from within the modeler subdir itself, everything works great:
python analyzer.py
(analyzer imports the stuff_topic_model class from the same directory)
ALL OK!

But when I run this from the parent subdir, it blows up:
python modeler/analyzer.py

Traceback (most recent call last):
  File "modeler/analyzer.py", line 19, in <module>
    stm = StuffTopicModel()
  File " /Users/mepatterson/stufftopic/modeler/stuff_topic_model.py", line 35, in __init__
    self.load_corpus()
  File " /Users/mepatterson/stufftopic/modeler/stuff_topic_model.py", line 44, in load_corpus
    corpus = textacy.Corpus.load(MOD_PATH, name='stuff.corpus', compression='gzip')
  File "/Users/mepatterson/anaconda/envs/ironman/lib/python2.7/site-packages/textacy/corpus.py", line 268, in load
    for spacy_doc, metadata in zip(spacy_docs, metadata_stream):
  File "/Users/mepatterson/anaconda/envs/ironman/lib/python2.7/site-packages/textacy/fileio/read.py", line 173, in read_spacy_docs
    yield SpacyDoc(spacy_vocab).from_bytes(bytes_string)
  File "spacy/tokens/doc.pyx", line 615, in spacy.tokens.doc.Doc.from_bytes (spacy/tokens/doc.cpp:12347)
  File "spacy/serialize/packer.pyx", line 129, in spacy.serialize.packer.Packer.unpack_into (spacy/serialize/packer.cpp:6219)
  File "spacy/serialize/packer.pyx", line 184, in spacy.serialize.packer.Packer._char_decode (spacy/serialize/packer.cpp:7598)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 5: invalid start byte

Any ideas? Is it something I'm doing wrong with relation to file loading from different directories? Or is this a bug in spacy itself?

Issues while loading corpus from disk.

I am working with scrapped data from Twitter. I am treating each individual tweets as document. And I am creating a new corpus by adding documents one by one. To avoid creating corpus every time I have saved it to disk and trying to load. But I am facing issues. In fact i tried to load the corpus in the same program where it was saved it's failing with error:

Error:

```Corpus Created*******************
Corpus(27472 docs; 535575 tokens)
Traceback (most recent call last):
  File "C:\Users\workspace\textacy\src\corpus.py", line 74, in <module>
    main()
  File "C:\Users\workspace\textacy\src\corpus.py", line 56, in main
    tweet_corpus = textacy.Corpus.load('c:\\corpus',name='zika',compression='gzip')
  File "C:\Anaconda2\lib\site-packages\textacy-0.3.0-py2.7.egg\textacy\corpus.py", line 263, in load
    for spacy_doc, metadata in zip(spacy_docs, metadata_stream):
  File "C:\Anaconda2\lib\site-packages\textacy-0.3.0-py2.7.egg\textacy\fileio\read.py", line 173, in read_spacy_docs
    yield SpacyDoc(spacy_vocab).from_bytes(bytes_string)
  File "spacy/tokens/doc.pyx", line 423, in spacy.tokens.doc.Doc.from_bytes (spacy/tokens/doc.cpp:10859)
    self.vocab.serializer.unpack_into(data[4:], self)
  File "spacy/serialize/packer.pyx", line 125, in spacy.serialize.packer.Packer.unpack_into (spacy/serialize/packer.cpp:6144)
    self._char_decode(bits, -length, doc)
  File "spacy/serialize/packer.pyx", line 191, in spacy.serialize.packer.Packer._char_decode (spacy/serialize/packer.cpp:7677)
    doc.push_back(lex, is_spacy)
  File "spacy/tokens/doc.pyx", line 286, in spacy.tokens.doc.Doc.push_back (spacy/tokens/doc.cpp:8624)
    assert t.lex.orth != 0
AssertionError

Code that I am running is:

def main():

    dirname = sys.argv[1]
    filename = sys.argv[2]

    print dirname
    print filename
    filepath = os.path.join(dirname,filename)

    mytext = textacy.fileio.read_csv(filepath,encoding=u"utf-8")
    tweet_corpus = textacy.Corpus(u'en',metadatas=None)
    #meta = {}
    #meta[u"name"] = u"tanveer"
    count = 0

    for item in mytext:   ## read.csv will return a generator object. Extracting each row
        try:
            text = u""
            text = unicode(item[51])
            meta = {}
            time = unicode(item[61])
            meta["time"] = time
            text = textacy.preprocess_text(text,lowercase=True)
            text = re.sub(u"@[A-Za-z0-9]+",u"",text)
            text = textacy.preprocess.replace_urls(text, replace_with=u"")
            doc = textacy.Doc(text,lang=u'en',metadata=meta) #creating textacy object. It expects data in unicode.
            tweet_corpus.add_doc(doc,metadata=meta) ## Adding one document/tweet object to corpus
        except IndexError:
            count = count + 1
            meta = {}
            time = unicode(item[61])
            meta["time"] = time
            doc = textacy.Doc(unicode("unknown"),lang=u"en",metadata=meta)
            tweet_corpus.add_doc(doc,metadata=meta) 
            print "******Bad Record Number : ",count



    print "Corpus Created*******************"
    print tweet_corpus
    tweet_corpus.save("c:\\corpus", name='zika', compression='gzip')
    tweet_corpus = textacy.Corpus.load('c:\\corpus',name='zika',compression='gzip')
    print tweet_corpus


if __name__ == "__main__":
    main()

Context
Want to create a corpus once and want to use it later for analysis.

Environment

  • Version used: 0.3
  • Environment name and version : Python 2,7 Anaconda
  • Operating System and version: Windows 7

Question: NER example

Hello, I want to perform NER(and other IE) and display the result. I tried

>>> example=""" Opportunities New Brunswick chair Roxanne Fairweather was on hand for the announcement this morning at the Clean Harbours office on Whitebone Way in the McAllister Industrial Park."""
>>> print example
 Opportunities New Brunswick chair Roxanne Fairweather was on hand for the announcement this morning at the Clean Harbours office on Whitebone Way in the McAllister Industrial Park.

>>> ner=textacy.extract.named_entities(content, good_ne_types=None, bad_ne_types=None, min_freq=1, drop_determiners=True)
>>> print(i for i in ner)
<generator object <genexpr> at 0x080D37B0>

>>> print(list(i for i in ner))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <genexpr>
  File "textacy\extract.py", line 158, in named_entities
    nes = doc.ents
AttributeError: 'str' object has no attribute 'ents'

>>> [i for i in ner]
[]

May I see an actual usage example please? (I barely know how to use Python)
Thanks. Adrian.

Deleting TextDocs from a TextCorpus throws a TypeError

Expected Behavior

Suppose we have a TextCorpus such as (from the README):

>>> docs = textacy.corpora.fetch_bernie_and_hillary()
>>> content_stream, metadata_stream = textacy.fileio.split_content_and_metadata(
...     docs, 'text', itemwise=False)
>>> corpus = textacy.TextCorpus.from_texts(
...     'en', content_stream, metadata_stream, n_threads=2)
>>> docs = textacy.corpora.fetch_bernie_and_hillary()
>>> content_stream, metadata_stream = textacy.fileio.split_content_and_metadata(
...     docs, 'text', itemwise=False)
>>> corpus = textacy.TextCorpus.from_texts(
...     'en', content_stream, metadata_stream, n_threads=2)

I would expect

>>> corpus.remove_doc(0)
>>> corpus.remove_docs(lambda doc: doc.metadata['speaker'] == 'Bernard Sanders')

to work and not throw an error.

Current Behavior

>>> corpus.remove_doc(0)

~/.virtualenvs/spacy/lib/python3.4/site-packages/textacy/texts.py in remove_doc(self, index)
    979         for doc in self[index + 1:]:
    980             doc.corpus_index -= 1
--> 981         del self[index]
    982 
    983     def remove_docs(self, match_condition, limit=None):

TypeError: 'TextCorpus' object doesn't support item deletion

>>> corpus.remove_docs(lambda doc: doc.metadata['speaker'] == 'Bernard Sanders'

~/.virtualenvs/spacy/lib/python3.4/site-packages/textacy/texts.py in remove_docs(self, match_condition, limit)
    996                           for doc in self.get_docs(match_condition, limit=limit)]
    997         for index in remove_indexes:
--> 998             del self[index]
    999         # now let's re-set the `corpus_index` attribute for all docs at once
   1000         for i, doc in enumerate(self):

TypeError: 'TextCorpus' object doesn't support item deletion

Possible Solution

In textacy/texts.py, TextCorpus.remove_doc and TextCorpus.remove_docs have the following line of code:

del self[index]

This should be:

del self.docs[index]

Alternatively, you could implement the following dundermethod:

def __delitem__(self, index):
    del self.docs[index]

Unexpected behavior of to_bag_of_terms method

I've been playing around with the latest version of this great library Textacy! However, the to_bag_of_terms method is not behaving as I expected. I'm not sure if this is a bug or if I am not using the API as I should be.

Expected Behavior

I see that the documentation indicates that the parameters ('lowercase' and 'lemmatize') have been deprecated and the parameter ('normalize') seems to have replaced these. I expected to receive a deprecation warning message when setting the 'lowercase' or 'lemmatize' parameters. When setting the 'normalize' parameter to the following values I expected:

  • 'lemma' => lemmatization only
  • 'lower' => lowercasing only
  • callable => 'to_bag_of_terms' method to return the output from the callable

Current Behavior

When setting the deprecated parameters ('lowercase' and 'lemmatize') I do not receive any deprecation warning message. However, when setting the 'normalize' parameter to the following values I get:

  • 'lemma' => both lowercasing and lemmatization
  • 'lower' => both lowercasing and lemmatization
  • callable => both lowercasing and lemmatization

Possible Solution

  • A deprecation warning message when 'lowercase' or 'lemmatize' parameters are set.
  • Honoring the behavior of the callable when 'normalize' is set to it, or alternatively providing an explanation in the documentation of what the callable is meant to do.
  • If the code is correct then maybe instead explaining how I can accomplish the task explained in the Context section would suffice.

Steps to Reproduce (for bugs)

I think this should be fairly easy for you to reproduce, but I have code I can share with you if necessary. However, maybe an explanation is all that is required.

Context

I was trying to provide my own implementation to filter specific key phrases (terms) in a doc/corpus.

Your Environment

  • textacy 0.3.2 and spacy 1.4.0 versions:
  • Python version: 3.5.2
  • Operating system and version: Windows 7

cachetools 2.0.0 breaks textacy

Great tool! I noticed that cachetools==2.0.0 introduced breaking changes and textacy won't work anymore.

import textacy
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/textacy/__init__.py", line 16, in <module>
    from textacy import data, preprocess, text_utils
  File "/usr/local/lib/python2.7/site-packages/textacy/data.py", line 20, in <module>
    from cachetools import cached, Cache, hashkey
ImportError: cannot import name hashkey

As a quick solution, consider fixing cachetools version in the requirements.txt.

Thanks,

Empty string key term can be returned by textrank

Due to an issue in spaCy, textrank can return the empty string as a key term for a document. This is due to spaCy sometimes lemmatizing the string 's' into the empty string, which is not filtered out by the textrank implementation.

Expected Behavior and Possible Solution

Textacy should filter out any empty strings as key terms.

Steps to Reproduce (for bugs)

>>> from spacy.en import English
>>> import textacy
>>> 
>>> nlp = English()
>>> 
>>> s = """hobby maker show s...
...        figures: sakurai twins s..."""
>>> 
>>> doc = nlp(s)
>>> textacy.keyterms.textrank(doc, n_keyterms=3)
[('maker', 0.19939773459751794), ('sakurai', 0.19939773459751794), ('', 0.19085869422866814)]

Environment

  • textacy and spacy versions: Textacy 0.3.2 with Spacy 1.5
  • Python version: 3.6
  • Operating system and version: Mac OS X Version 10.10.5

lang_or_pipeline argument of textacy.TextCorpus.from_texts() does not accept 'en' or {'en', 'de'} per docstring

I had trouble building a corpus using the textacy.TextCorpus.from_texts() function and could not get it to work using 'en' or {'en', 'de'} as the first argument (specified in documentation shown below).

Docstring:
Convenience function for creating a :class:`TextCorpus <textacy.texts.TextCorpus>`
from an iterable of text strings.

Args:
    lang_or_pipeline ({'en', 'de'} or :class:`spacy.<lang>.<Language>`)
    texts (iterable(str))
    metadata (iterable(dict), optional)
    n_threads (int, optional)
    batch_size (int, optional)

When running either of the commands
corpus = textacy.TextCorpus.from_texts('en', iterdocs)
or
corpus = textacy.TextCorpus.from_texts({'en', 'de'}, iterdocs)

as suggested by the documentation, I get the errors:

AttributeError: 'str' object has no attribute 'lang'
or
AttributeError: 'set' object has no attribute 'lang'

I was able to build my corpus using a spacy language class instead with:
corpus = textacy.TextCorpus.from_texts(spacy.en.English(), iterdocs)

Was the use of 'en' or {'en', 'de'} a misinterpretation of the documentation on my part?

I also tried en (without quotes) just in case.

can not load wikipedia files one the disk

I try to stream wikipedia pages from files on the disk, and an error occurred.
here is my code, basically same as sample code.

import textacy
from textacy.corpora.wiki_reader import WikiReader
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

wr = WikiReader('enwiki-20161020-pages-articles1.xml-p000000010p000030302.bz2')

for record in wr.records(min_len=100, limit=100):
    pass

and i got this:
image

Possible Solution

image

I guess in some wikipedia pages, 'ref' tag can be found in some section, but can not be removed, because it's not a section itself. like this: this page always raise an error
image

Your Environment

  • textacy and spacy versions: textacy (0.3.1) spacy (1.1.2)
  • Python version: 2.7
  • Operating system and version: mac

Keyterms Error: Input terms must be strings or spacy Tokens, not <type 'unicode'>.

I am trying to get key terms from semantic network, following this guide:
https://media.readthedocs.org/pdf/textacy/latest/textacy.pdf#page=28

But getting this error "Input terms must be strings or spacy Tokens, not <type 'unicode'>." with following algorithms:
textacy.keyterms.key_terms_from_semantic_network
Ranking Algo, Edge Weightage
bestcoverage cooc_freq
bestcoverage binary
pagerank cooc_freq
pagerank binary
divrank cooc_freq
divrank binary

doc.key_terms failed with same error for 'textrank', 'singlerank' algos, but sgrank worked.

In keyterms.py spacy_utils.normalized_str(word) results into unicodes.

Error trace:
Traceback (most recent call last):
File "/home/ljha/PycharmProjects/nlp/Analyze.py", line 105, in
print 'Key Terms ' + algo + ' ' + ew, textacy.keyterms.key_terms_from_semantic_network(doc, window_width=3, edge_weighting=ew, ranking_algo=algo, join_key_words=False, n_keyterms=10)
File "/usr/local/lib/python2.7/dist-packages/textacy/keyterms.py", line 248, in key_terms_from_semantic_network
good_word_list, window_width=window_width, edge_weighting=edge_weighting)
File "/usr/local/lib/python2.7/dist-packages/textacy/transform.py", line 64, in terms_to_semantic_network
raise TypeError(msg)
TypeError: Input terms must be strings or spacy Tokens, not <type 'unicode'>.

Problem on to_terms_list method on readme

I tried to follow the example on the readme, but I got incorrect result.

import textacy

cw = textacy.corpora.CapitolWords()
docs = cw.records(speaker_name={'Hillary Clinton', 'Barack Obama'})
content_stream, metadata_stream = textacy.fileio.split_record_fields(
        docs, "text")
corpus = textacy.Corpus(u'en', texts=content_stream, metadatas=metadata_stream)
doc = corpus[-1]
terms = doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True)
print(list(terms))

The result that I got is list of empty strings.

I play around with the arguments on to_terms_list, and I got it right if I set 'normalize' to be False.

terms = doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, normalize=False)

Perhaps you need to update the readme.

Your Environment

  • textacy version: 0.3.3
  • Python version: 2.7.10
  • Operating system and version: Mac OSX 10.11.6

Reddit corpus reader?

I've thought for a while about how to give people a small reddit corpus reader. I don't want to start a spacy.corpora package, but maintaining a reddit_corpus package is sort of annoying. Maybe this is a good place for it?

I usually just do something like this:

def iter_comments(loc, limit=-1):
    with bz2.BZ2File(loc) as file_:
        for i, line in enumerate(file_):
            yield ujson.loads(line)['body']
            if i == limit:
                break

url_re = re.compile(r'\[([^]]+)\]\(%%URL\)')
link_re = re.compile(r'\[([^]]+)\]\(https?://[^\)]+\)')
space_re = re.compile(r'\s+')
def strip_meta(text):
    text = link_re.sub(r'\1', text)
    text = space_re.sub(' ', text)
    text = text.replace('&gt;', '>').replace('&lt;', '<')
    text = text.replace('`', '').replace('*', '').replace('~', '')
    return text.strip()

cld2-cffi dependency

Hi - I am in the process of pip install -U textacy. There are 2 errors (I believe). Both arise from the mix between my environment and the cld2-cffi package.

Problems and What I tried

  • textacy fails to install with the cld2-cffi dependency.

So I tried to do a pip install cld2-cffi before making another attempt.

  • pip does not install the latest package. It installs version 0.1.1. This seems to be a release error on the cld2-cffi side.

So then I tried to pip install --upgrade 'https://github.com//GregBowyer/cld2-cffi/zipball/master - Since I saw some documentation cld2-cffi was updated 20 days ago , being tested with python 3.5 and released as 0.1.2

  • cld2-cffi again failed to install. Same type of error as the 0.1.1 version.

The issue seems to be C related. As best I can tell.

Questions

  • How dependent is textacy on cld2-cffi?
  • How hard would it for me to do a pull, and remove the dependency (for myself)?

My Environment

  • Window 7 64
  • '3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]'

Seems like a really nice package(textacy) :)

thanks

Textacy import error on Python Jupyter notebooks

Hi, I am running on OSX Python 2.7 Jupyter notebook on docker container and after installing all textacy libraries, still get this error. What can it be?Thanks!

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-7-b8a2c2357440> in <module>()
import spacy
# Set up spaCy
from spacy.en import English
...
---> 28 import textacy

/usr/local/lib/python2.7/site-packages/textacy/__init__.py in <module>()
      8 
      9 # subpackages
---> 10 from textacy import corpora
     11 from textacy import fileio
     12 from textacy import representations

/usr/local/lib/python2.7/site-packages/textacy/corpora/__init__.py in <module>()
----> 1 from . import wikipedia
      2 from .bernie_and_hillary import fetch_bernie_and_hillary

ImportError: cannot import name wikipedia

How to add stopwords?

import spacy
nlp = spacy.load("en")
nlp.vocab["the"].is_stop = False
nlp.vocab["definitelynotastopword"].is_stop = True
sentence = nlp("the word is definitelynotastopword")
sentence[0].is_stop
False

doc = textacy.Doc("the word is definitelynotastopword", nlp)
doc[0].is_stop
True

This works for spacy but not for textacy.

fatal error C1083: Cannot open include file: 'stdint.h': No such file or directory

I get this error on both install methods, on Windows 8.1, Python 2.7
pip install -U textacy
and
python setup.py install

In detail, for pip install -U textacy it shows at the end

encoding_lut.cc
cld2\public\compact_lang_det.h(65) : fatal error C1083: Cannot open include file: 'stdint.h': No such file or directory
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "c:\users\acer\appdata\local\temp\pip-build-7v3c2_\cld2-cffi\setup.py", line 189, in <module>
. . . 
      File "c:\users\acer\appdata\local\temp\pip-build-7v3c2_\cld2-cffi\.eggs\cffi-0.9.2-py2.7-win32.egg\cffi\ffiplatform.py", line 54, in _build
        raise VerificationError('%s: %s' % (e.__class__.__name__, e))
    cffi.ffiplatform.VerificationError: CompileError: command 'C:\\Users\\Acer\\AppData\\Local\\Programs\\Common\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\cl.exe' failed with exit status 2

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in c:\users\acer\appdata\local\temp\pip-build-7v3c2_\cld2-cffi\

With the second method, python setup.py install, the listing ends in

encoding_lut.cc
cld2\public\compact_lang_det.h(65) : fatal error C1083: Cannot open include file: 'stdint.h': No such file or directory
error: [Error 5] **Access is denied**: 'c:\\users\\acer\\appdata\\local\\temp\\easy_install-2gzufx\\cld2-cffi-0.1.1\\.eggs\\cffi-0.9.2-py2.7-win32.egg\\_cffi_backend.pyd'

Any suggestion would be much appreciated, Thank you. Adrian

New version to PyPI

Thank you for this nice package. I've seen that you changed some code in keyterms.sgrank and friends. What about baking and uploading a new version to PyPI? Thank you!

Doc.save() writes to "metadatas.json" and "spacy_docs.bin"

Expected Behavior

I was expecting it to write to "metadata.json" and "spacy_doc.bin".

Current Behavior

Instead it seems to be writing to files that have an extra "s" on the end: "metadatas.json" and "spacy_docs.bin".

Possible Solution

I can't figure out why; looking at the code certainly suggests that the string should be "metadata.json".

Steps to Reproduce (for bugs)

def getGutenbergMetadata(textno):
    meta = {
        'title': list(get_metadata('title', textno))[0],
        'author': list(get_metadata('author', textno)),
        'rights': list(get_metadata('rights', textno)),
        'subject': list(get_metadata('subject', textno)),
        'language': list(get_metadata('language', textno))[0],
        'guten_no': textno}
    return meta

def getGutenberg(filenumber):
    return open("./data/corpora/gutenberg/strip/{}.txt".format(filenumber), mode="r", encoding="utf_8").read()

for i in acttext:
        count += 1
        #if count % 10 is 0:
        print(".", end="")
        am = getGutenbergMetadata(i)
        ad = textacy.Doc(getGutenberg(i), None, "en")
        actdocs.append(ad)        
        actmeta.append(am)
        print("m", end="")
current_corpus = textacy.corpus.Corpus('en', docs=actdocs, metadatas=actmeta)
current_corpus.save("./data")
current_corpus = textacy.Doc.load("./data")

(Renaming the files lets it find it again, of course, but then I get this (possibly separate) error:)

Traceback (most recent call last):

  File "<ipython-input-5-5605bd27e87b>", line 1, in <module>
    loadCorpus()

  File "./excalibur/action_catalog.py", line 125, in loadCorpus
    current_corpus = textacy.Doc.load("./data")

  File "C:\tools\Anaconda3\envs\genmoenv\lib\site-packages\textacy\doc.py", line 219, in load
    metadata = list(fileio.read_json(meta_fname))[0]

  File "C:\tools\Anaconda3\envs\genmoenv\lib\site-packages\textacy\fileio\read.py", line 69, in read_json
    yield json.load(f)

  File "C:\tools\Anaconda3\envs\genmoenv\lib\json\__init__.py", line 268, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)

  File "C:\tools\Anaconda3\envs\genmoenv\lib\json\__init__.py", line 319, in loads
    return _default_decoder.decode(s)

  File "C:\tools\Anaconda3\envs\genmoenv\lib\json\decoder.py", line 342, in decode
    raise JSONDecodeError("Extra data", s, end)

JSONDecodeError: Extra data

Context

Your Environment

  • textacy and spacy versions: textacy: 0.3.2, spacy: 1.2.0
  • Python version: 3.5
  • Operating system and version: Windows 7 x64

CLD2 outreach

Sorry for taking so long to get back to everyone about this, I have been working recently on windows and OSX builds.

Long and short, windows lacks the c header stdint.h which breaks a lot of compile things. This should be fixed now if you want to give another go/.

Right now 1.1.3 should be a good build, if it is not give me a shout and I will see what I can do.

For 1.2 I will be altering the CFFI binding creation and attempting to build wheels for most platforms, which should solve these problems.

I will update this bug when I have a solid 1.2 wheel build ready for you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.