juliatext / textanalysis.jl Goto Github PK

View Code? Open in Web Editor NEW

373.0 373.0 95.0 26.7 MB

Julia package for text analysis

License: Other

Julia 100.00%

textanalysis.jl's People

Stargazers

Watchers

Forkers

nfoti wiresurfer tanmaykm remusao abhijithch nalimilan vunguyene jiahao amartgon timclicks buruzaemon imclab rjagerman nagyist trappmartin aviks dgmattam slundberg runpengl ysekky mtabor150 tkelman dongwookim-ml accounto 745564106 yukota ultradian gsdu8g9 twelve33 sadit vmeas2015 superalexander ksteimel bevankoopman dellison weaver-viii zgornel salazar-ai-associates enkiv2 ayush1999 pazzo83 schottmueller congwen baggepinnen paethon cchderrick nickto aquatiko leoarc sharad24 ayushk4 phereford asbisen djokester alexpimentel lx0413 stjordanis computermaestro tejasvaidhyadev sean-gauss stardoxx radhika3504 feynmanium sts-sadr pokhrelj zeta1999 tangwang-ustc premjithb agarie kimbue gxyd rushabh-v hhaensel niranjanrao1509 sambhawdrag dunefox standardgalactic lvyitian adarshkumar712 ericphanson mysterytrader playfloor rosini13 isgasho mostol dave10t atantos sigmundv pitmonticone ms10596 rssdev10

textanalysis.jl's Issues

Error 'using' TextAnalysis package

After adding TextAnalysis package in Julia with

julia> Pkg.add("Distributions")

I am getting following error while using the package:

using TextAnalysis
ERROR: Stats not found
 in require at loading.jl:39
 in include at boot.jl:238
 in include_from_node1 at loading.jl:114
 in reload_path at loading.jl:140
 in _require at loading.jl:58
 in require at loading.jl:46
 in include at boot.jl:238
 in include_from_node1 at loading.jl:114
 in reload_path at loading.jl:140
 in _require at loading.jl:58
 in require at loading.jl:43
at /Users/user/.julia/DataFrames/src/DataFrames.jl:5
at /Users/user/.julia/TextAnalysis/src/TextAnalysis.jl:1

(On Mac OS X 10.9.1)

grobid

Do you plan to add grobid interface?

Are you aware of of julia package that works with grobid?

Enable attobot on this repo

@johnmyleswhite this one please as well?

https://github.com/apps/attobot

stemming issue for certain words e.g. providing -> provid

Some words are not converted properly. Probably a libstemmer issue but that repo doesn't seem to be active so I'm posting here :-)

julia> sm = TextAnalysis.stemmer_for_document(StringDocument("hello"))
Stemmer algorithm:english encoding:UTF_8

julia> stem(sm, "coming")
"come"

julia> stem(sm, "coding")
"code"

julia> stem(sm, "providing")
"provid"

julia> stem(sm, "improvising")
"improvis"

julia> stem(sm, "pursuing")
"pursu"

remove_corrupt_utf8() not working

The function remove_corrupt_utf8() does not work under Julia v0.4.6.
The problem is the line zeros(Char, endof(s)+1) where it complains that
zero is not defined for type Char. When using UInt8 instead I could make it
run without error, but please check if this does what it is supposed to do.

function remove_corrupt_utf8(s::AbstractString)
    r = zeros(UInt8, endof(s)+1)                                                                                          
    i = 0
    for chr in s
        i += 1
        r[i] = (chr != 0xfffd) ? chr : ' '
    end
    return utf8(r)
end

Note that on the return statement I got rid of the CharString() too.

If this is ok I can make another pull request.

Cheers,
Andre

Unable to train new pattern for models

Hi,

I am new to Grobid Tool. I need to train some new patterns for citation model. Build is successful, while trying to train, giving exception:

java.lang.unsatisfiedlinkerror, cudnt resolve this issue. I am using win-64.

Stemmer implementation

The stem! method currently is a stub. It would be very useful to have a stemmer integrated.

What is the direction to be taken to add a stemmer? Would wrapping the snowball stemmer from http://snowball.tartarus.org/ be a good approach? What are the other good alternatives?

Documentation to reflect deprications

The documentation does not reflect the deprecations of remove_xxx!() functions. These functions also use a nonstandard deprecation mechanism which makes it hard for users to understand how to transition to the new API. Consider using @deprecate.

Hi,

Hey,

When I tried adding this package to julia . I got an error with connection request  as

2015-09-18 11:54:25-- http://snowball.tartarus.org/dist/snowball_code.tgz Resolving snowball.tartarus.org (snowball.tartarus.org)... 80.252.125.10 Connecting to snowball.tartarus.org (snowball.tartarus.org)|80.252.125.10|:80... connected. HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.

So I downloaded .tar.gz file, extracted its contents and placed in .julia/v0.03/TextAnalysis/deps/download. Then I tried to build using command Pkg.build(). But, still I am getting the same error.Please help.

Update http://pkg.julialang.org version to current commit

I am currently using the version on http://pkg.julialang.org which is at commit f022c37.

There are several new features in the current (prepare and implementation of remove_words without regexes) commit that I, and certainly others, would like to use.

Error on 0.5 Julia

After the update I have the follow error:

ERROR: LoadError: MethodError: no method matching start(::TextAnalysis.StringDocument)
Closest candidates are:
  start(!Matched::SimpleVector) at essentials.jl:170
  start(!Matched::Base.MethodList) at reflection.jl:258
  start(!Matched::IntSet) at intset.jl:184
  ...
 in _append!(::Array{String,1}, ::Base.HasLength, ::TextAnalysis.StringDocument) at ./collections.jl:26
 in readData(::String) at /home/pmargreff/github/am/naive_bayes/naive_bayes.jl:35
 in include_from_node1(::String) at ./loading.jl:488
 in process_options(::Base.JLOptions) at ./client.jl:262
 in _start() at ./client.jl:318

Using prepare

I can't using prepare! and another functions are deprecated, could someone update the documentation?

error on LDA Julia 0.4

Hi I was trying to use the lsa function but it gives an error because svd doesn't seem to support sparse matrices

corpus = Corpus(docs)
update_lexicon!(corpus)
dtmat = DocumentTermMatrix(corpus)
lda(dtmat)

ERROR: svdfact has no method matching svdfact(::SparseMatrixCSC{Float64,Int64})

Extra empty token in lexicon

If I create a text file which consists of the string "1 2" (not including the quotes, my lexicon always contains the "empty token" (""). Even if I try to remove the word "" it is still there. Is this the intended behaviour? It causes problems for me because in my LDA implementation I get the word "" in my topics.

Example:

$ cat docs/Sample1.txt
1 2
$ cat text_analysis_etoken.jl
using TextAnalysis
crps = DirectoryCorpus("docs")
standardize!(crps, StringDocument)
remove_words!(crps, [""])
update_lexicon!(crps)
println("Dictionary is: " * string(lexicon(crps)))
println("Corpus contains: " * string(length(crps)) * " documents")

$ julia text_analysis_etoken.jl

Corpus's lexicon contains 3 tokens
Corpus's index contains 3 tokens
Dictionary is: ["1"=>1,"2"=>1,""=>2]
Corpus contains: 1 documents

Best Regards
-Leif

broadcast behavior of Corpus

I would argue that the following is unintuitive, and would like for broadcasted functions over a corpus to return an array with the result of the function applied to each document in the corpus. If this is deemed a good idea, I would be willing to submit a PR if desired.

julia> length(crps)
12

julia> length(crps[1])
160427

[PackageEvaluator.jl] Your package TextAnalysis may have a testing issue.

This issue is being filed by a script, but if you reply, I will see it.

PackageEvaluator.jl is a script that runs nightly. It attempts to load all Julia packages and run their test (if available) on both the stable version of Julia (0.2) and the nightly build of the unstable version (0.3).

The results of this script are used to generate a package listing enhanced with testing results.

The status of this package, TextAnalysis, on...

Julia 0.2 is 'Tests fail, but package loads.'
Julia 0.3 is 'Tests fail, but package loads.'

'No tests, but package loads.' can be due to their being no tests (you should write some if you can!) but can also be due to PackageEvaluator not being able to find your tests. Consider adding a test/runtests.jl file.

'Package doesn't load.' is the worst-case scenario. Sometimes this arises because your package doesn't have BinDeps support, or needs something that can't be installed with BinDeps. If this is the case for your package, please file an issue and an exception can be made so your package will not be tested.

This automatically filed issue is a one-off message. Starting soon, issues will only be filed when the testing status of your package changes in a negative direction (gets worse). If you'd like to opt-out of these status-change messages, reply to this message.

[PkgEval] TextAnalysis may have a testing issue on Julia 0.3 (2014-07-14)

PackageEvaluator.jl is a script that runs nightly. It attempts to load all Julia packages and run their tests (if available) on both the stable version of Julia (0.2) and the nightly build of the unstable version (0.3). The results of this script are used to generate a package listing enhanced with testing results.

On Julia 0.3

On 2014-07-12 the testing status was Tests fail, but package loads.
On 2014-07-14 the testing status changed to Package doesn't load.

Tests fail, but package loads. means that PackageEvaluator found the tests for your package, executed them, and they didn't pass. However, trying to load your package with using worked.

Package doesn't load. means that PackageEvaluator did not find tests for your package. Additionally, trying to load your package with using failed.

This issue was filed because your testing status became worse. No additional issues will be filed if your package remains in this state, and no issue will be filed if it improves. If you'd like to opt-out of these status-change messages, reply to this message saying you'd like to and @IainNZ will add an exception. If you'd like to discuss PackageEvaluator.jl please file an issue at the repository. For example, your package may be untestable on the test machine due to a dependency - an exception can be added.

Test log:

INFO: Installing ArrayViews v0.4.6
INFO: Installing DataArrays v0.1.12
INFO: Installing DataFrames v0.5.6
INFO: Installing GZip v0.2.13
INFO: Installing Languages v0.0.1
INFO: Installing Reexport v0.0.1
INFO: Installing SortingAlgorithms v0.0.1
INFO: Installing StatsBase v0.5.3
INFO: Installing TextAnalysis v0.0.1
INFO: Package database updated
Warning: could not import Sort.sortby into DataFrames
Warning: could not import Sort.sortby! into DataFrames
ERROR: repl_show not defined
 in include at ./boot.jl:245
 in include_from_node1 at ./loading.jl:128
 in include at ./boot.jl:245
 in include_from_node1 at ./loading.jl:128
 in reload_path at loading.jl:152
 in _require at loading.jl:67
 in require at loading.jl:54
 in include at ./boot.jl:245
 in include_from_node1 at ./loading.jl:128
 in reload_path at loading.jl:152
 in _require at loading.jl:67
 in require at loading.jl:51
 in include at ./boot.jl:245
 in include_from_node1 at loading.jl:128
 in process_options at ./client.jl:285
 in _start at ./client.jl:354
while loading /home/idunning/pkgtest/.julia/v0.3/DataFrames/src/dataframe/reshape.jl, in expression starting on line 163
while loading /home/idunning/pkgtest/.julia/v0.3/DataFrames/src/DataFrames.jl, in expression starting on line 110
while loading /home/idunning/pkgtest/.julia/v0.3/TextAnalysis/src/TextAnalysis.jl, in expression starting on line 1
while loading /home/idunning/pkgtest/.julia/v0.3/TextAnalysis/testusing.jl, in expression starting on line 1
INFO: Package database updated

Note this is possibly due to removal of deprecated functions in Julia 0.3-rc1: JuliaLang/julia#7609

Incorporate AhoCorasick algorithm?

I've written a Julia package containing the Aho-Corasick algorithm for fast multi-string searching:

https://github.com/gilesc/AhoCorasick.jl

I haven't yet submitted it as an independent package to Julia's METADATA, though. I was thinking that it might be better suited as a submodule in this TextAnalysis package than as its own package. If you're amenable, I'd be happy to write up a PR.

remove_patterns! not exported

Is there a reason why remove_patterns! is not exported?

Preprocessing function issue

Hi all ! Thanks for this nice package :)

The preprocessing function prepare!(::TextAnalysis.Corpus, ::UInt32) does not work (on my machine) because of the stem! function.
I get the following error

ERROR: error compiling #prepare!#10: error compiling stem!: error compiling release: could not load library "/Users/EmileMathieu/.julia/v0.6/TextAnalysis/src/../deps/usr/lib/libstemmer.dylib"
dlopen(/Users/EmileMathieu/.julia/v0.6/TextAnalysis/src/../deps/usr/lib/libstemmer.dylib, 1): image not found
Stacktrace:
 [1] prepare!(::TextAnalysis.Corpus, ::UInt32) at /Users/EmileMathieu/.julia/v0.6/TextAnalysis/src/preprocessing.jl:268

Best,
Emile

`getindex` has no method matching getindex(::Corpus, ::Int64)

Hi,

I am getting this error when attempting to create a DocumentTermMatrix from a corpus. This can be replicated with the snippet below.

docs = {}
push!(docs, StringDocument("abc def xyz"))
push!(docs, StringDocument("xyz 123"))
crps = Corpus(docs)
DocumentTermMatrix(crps)

I'm using the latest 0.3v, is there anything i'm missing here?

Ngram dictionary constructor in Readme does not work

The ngram document constructor for the dictionary of n-grams does not work correctly in Julia 0.5. This appears to be due to the change in Dictionary creation syntax not anything wrong with the classes themselves.

LDA does not return P(z|d) - distribution over topics given document

First, thanks for producing a Julia implementation of LDA.

The implementation current returns P(w|z) - the distribution over words w for the set of topic z. However, it currently does not return the other distribution P(z|d) - the distribution over topics z for each document d.

It would be great if this could be included.

Thanks.

Need to update deps/build.jl with the correct hash code for sentiment.tar.gz

From travis-ci:

Info: Calculated hash 5dcb031eccf01bb0b2d074281140679683f73603a54caa79941a1df1c8a6d70d for file /home/travis/.julia/v0.6/TextAnalysis/deps/usr/downloads/sentiment.tar.gz
============================[ ERROR: TextAnalysis ]=============================
LoadError: Hash Mismatch!
  Expected sha256:   f237378f3f866c7e697ed893b4208878a6c5dd111eddcebc84ac55dab3885004
  Calculated sha256: 5dcb031eccf01bb0b2d074281140679683f73603a54caa79941a1df1c8a6d70d
while loading /home/travis/.julia/v0.6/TextAnalysis/deps/build.jl, in expression starting on line 44
================================================================================

Example should use MultivariateStats.jl

https://github.com/JuliaStats/MultivariateStats.jl replaces https://github.com/JuliaStats/DimensionalityReduction.jl and the Extended Usage example should be corrected, as it no longer works on newer versions of Julia.

Feature Request: Sentence tokenizer

It would be nice to have a sentence tokenizer in this package.

Missing package versions for DataFrames during installation

From a fresh installation of julia on Ubuntu 14.04.5 LTS and TextAnalysis package, I'm receiving the following error with Pkg.add("TextAnalysis")

$ sudo apt-get install julia
$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.1 (2014-02-11 06:30 UTC)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |  x86_64-linux-gnu

julia> Pkg.add("TextAnalysis")
INFO: Initializing package repository /home/ltan/.julia/v0.2
INFO: Cloning METADATA from git://github.com/JuliaLang/METADATA.jl
INFO: Cloning cache of BinDeps from git://github.com/JuliaLang/BinDeps.jl.git
INFO: Cloning cache of Blocks from git://github.com/JuliaParallel/Blocks.jl.git
INFO: Cloning cache of Compat from git://github.com/JuliaLang/Compat.jl.git
INFO: Cloning cache of DataArrays from git://github.com/JuliaStats/DataArrays.jl.git
INFO: Cloning cache of DataFrames from git://github.com/JuliaStats/DataFrames.jl.git
INFO: Updating cache of DataFrames...
INFO: Cloning cache of GZip from git://github.com/JuliaIO/GZip.jl.git
INFO: Cloning cache of Languages from git://github.com/johnmyleswhite/Languages.jl.git
INFO: Cloning cache of SHA from git://github.com/staticfloat/SHA.jl.git
INFO: Cloning cache of SortingAlgorithms from git://github.com/JuliaLang/SortingAlgorithms.jl.git
INFO: Cloning cache of StatsBase from git://github.com/JuliaStats/StatsBase.jl.git
INFO: Cloning cache of TextAnalysis from git://github.com/johnmyleswhite/TextAnalysis.jl.git
INFO: Cloning cache of URIParser from git://github.com/JuliaWeb/URIParser.jl.git
ERROR: Missing package versions (possible metadata misconfiguration):  DataFrames v(nothing,v"0.4.3") [84523937447b336dfcbf1747b123e9d1da1308c6[1:10]]

 in error at error.jl:21
 in resolve at pkg/entry.jl:350
 in resolve at pkg/entry.jl:316
 in edit at pkg/entry.jl:24
 in add at pkg/entry.jl:44
 in add at pkg/entry.jl:48
 in anonymous at pkg/dir.jl:28
 in cd at file.jl:22
 in cd at pkg/dir.jl:28
 in add at pkg.jl:19

needs update for string.data changes

After JuliaLang/julia#19449, soon to be merged for Julia 0.6, you will no longer be able to access the raw bytes of a string via string.data; instead, do Vector{UInt8}(string). For length(string.data), just do sizeof(string). For example, this affects:

https://github.com/johnmyleswhite/TextAnalysis.jl/blob/4e6c7702726508a208dc150e6d1e21a4417741ff/src/preprocessing.jl#L303

error creating document matrix

this code (on julia v0.3 on Ubuntu 14.04):

using TextAnalysis

sd=StringDocument("dit is een test of dit, of dat kan werken met Nederlandse documenten.")
sd2=StringDocument("mooie badkamer met bad, douche en WC")

cp=Corpus({sd,sd2})

remove_punctuation!(cp)
remove_case!(cp)
remove_numbers!(cp)
remove_words!(cp,["de","het","een","en","of","geen","niet"])

update_lexicon!(cp)
dm=DocumentTermMatrix(cp)

gives the following error:

ERROR: getindex has no method matching getindex(::Corpus, ::Int64)
in DocumentTermMatrix at /home/steven/.julia/v0.3/TextAnalysis/src/dtm.jl:31
in include at ./boot.jl:245
in include_from_node1 at loading.jl:128
in process_options at ./client.jl:285
in _start at ./client.jl:354
in _start_3B_1716 at /usr/bin/../lib/x86_64-linux-gnu/julia/sys.so
while loading /media/sf_Documents/julia_workspace/VMRecommender/src/testTextAnalysis.jl, in expression starting on line 17

remove_corrupt_utf8 seems to want code from an old version of Julia

The first time I executed it I got the following error:

julia> TextAnalysis.prepare!(corpus, prepare_flags)
ERROR: MethodError: no method matching zero(::Type{Char})
Closest candidates are:
  zero(::Type{Base.LibGit2.GitHash}) at libgit2/oid.jl:106
  zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuildItem}) at pkg/resolve/versionweight.jl:82
  zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuild}) at pkg/resolve/versionweight.jl:124
  ...
Stacktrace:
 [1] remove_corrupt_utf8(::String) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:46
 [2] remove_corrupt_utf8!(::TextAnalysis.StringDocument) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:58
 [3] remove_corrupt_utf8!(::TextAnalysis.Corpus) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:83
 [4] #prepare!#10(::Set{AbstractString}, ::Set{AbstractString}, ::Function, ::TextAnalysis.Corpus, ::UInt32) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:271
 [5] prepare!(::TextAnalysis.Corpus, ::UInt32) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:268

I'm using Julia 0.6 and I don't have any definition of zero(:Type{Char}). I'm not sure if Char used to be a Number and it got zero from there or if I'm missing to import something.

In any case, if I define it

julia> import Base.zero

julia> zero(Char) = ' '
zero (generic function with 17 methods)

I then get

julia> TextAnalysis.prepare!(corpus, prepare_flags)
ERROR: UndefVarError: CharString not defined
Stacktrace:
 [1] remove_corrupt_utf8(::String) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:52
 [2] remove_corrupt_utf8!(::TextAnalysis.StringDocument) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:58
 [3] remove_corrupt_utf8!(::TextAnalysis.Corpus) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:83
 [4] #prepare!#10(::Set{AbstractString}, ::Set{AbstractString}, ::Function, ::TextAnalysis.Corpus, ::UInt32) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:271
 [5] prepare!(::TextAnalysis.Corpus, ::UInt32) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:268

and this suggest that it has been deprecated since Julia 0.3

Am I doing something wrong or is remove_corrupt_utf8 broken on recent versions of Julia?

remove_words! fails for long terms & terms with punctuation

Because remove_words! uses regex matching even for string input, it fails on actually-present terms if those terms are larger than the maximum pattern size accepted by PCRE. Actually-present terms also fail if they contain regex-like punctuation. This produces an error message that doesn't specify the failed pattern, and furthermore aborts remove_words! entirely.

The same problem occurs in remove_sparse_terms! and remove_frequent_terms!, since these also file down to a call to remove_pattern.

Would it be possible to force only string-literal substitution in the case where an array of type String is passed (and only use regex if the items passed are actually typed as regular expressions)?

Error with stem!(sd)

I am trying the functions like remove_punctuation!(),remove_numbers!(sd) remove_pronouns!(sd),stem!(sd) etc, on version 0.3, which is generating the following error:

julia> stem!(sd)
ERROR: error compiling Stemmer: error compiling Stemmer: could not load module /home/abhijith/.julia/TextAnalysis/deps/usr/lib/libstemmer.so: /home/abhijith/.julia/TextAnalysis/deps/usr/lib/libstemmer.so: cannot open shared object file: No such file or directory
 in stemmer_for_document at /home/abhijith/.julia/TextAnalysis.jl/src/stemmer.jl:76
 in stem! at /home/abhijith/.julia/TextAnalysis.jl/src/stemmer.jl:80

unknown words lead to error in sentiment analysis

using TextAnalysis
model = TextAnalysis.SentimentAnalyzer()
model(TextAnalysis.Document("some sense and some nonSense"))

leads to
ERROR: KeyError: key "nonSense" not found
I am not familiar with sentiment analysis but is it impossible to simply ignore words that do not have a value assigned?

Generate top words for each topic

I'm using the LDA module and wondering if it's possible to generate top words within each topic.

Better sentiment analysis model

The current sentiment analysis model isn't very good, and needs to be changed (as discussed with @aviks ). Also, following the discussion in #83 , it'd be better to warn the user before skipping works not in vocabulary.

Error while using the package for the first time

Hi,
when using the package I receive these error messages and dozen of warnings:
using TextAnalysis
....................................
ERROR: LoadError: LoadError: UndefVarError: SimpleVector not defined.
ERROR: LoadError: LoadError: Failed to precompile BSON [fbb218c0-5317-5bc6-957e-2ee96dd4b1f0] to C:\Users\emirzayev\.julia\compiled\v0.7\BSON\3tVCZ.ji.
ERROR: Failed to precompile TextAnalysis [a2db99b7-8b79-58f8-94bf-bbc811eef33d] to C:\Users\emirzayev\.julia\compiled\v0.7\TextAnalysis\5Mwet.ji.

Julia version:
Version 0.7.0 (2018-08-08 06:46 UTC)
Install method:
Downloading exe file

Feature Request: Part of speech tagging

This would be great, the R packages are slow and haven't been updated for years...

Should have a pluggable stemmer interface

Julia should have a pluggable stemmer interface supporting pos tagging or at least lemmtizer support like "Wordnet"

Native Julia porter2 stemmer is already available at: https://github.com/mguzmann/CorpusTools/blob/master/src/PortStemmer.jl

Replacement function for list of stuff.

We need some kind of thing where you input a dictionary or list and have words get replaced. This is important for standardizing synonyms, dates, and contractions. I think code like this make it easier to tokenize stuff especially since you eliminate apostrophes.

In python had code that replace contractions using regex.

'''
replacement_patterns = [
(r'(?i)won't', 'will not'),
(r'((?i)can't|(?i)can not)', 'cannot'),
(r'(?i)i'm', 'i am'),
(r'(?i)ain't', 'is not'),
(r'(\w+)'ll', '\g<1> will'),
(r'(\w+)n't', '\g<1> not'),
(r'(\w+)'ve', '\g<1> have'),
(r'(\w+t)'s', '\g<1> is'),
(r'(\w+)'re', '\g<1> are'),
(r'(\w+)'d', '\g<1> would'),
(r''cause', 'because'),]
'''
Not sure if it is possible to do this in julia

MethodError(getindex Error with corpus

Hello,

I am fairly new but very enthusiastic about text mining in Julia. I have managed to some of the basics metrics in the introductory documentation, however I am running with lots of error messages now that I am trying to work with corpuses. Every time that I tried to standardize or run other commands I get this message:

MethodError(getindex,(A Corpus,1))

Probably is something straightforward but documentation is scarce. Any hint would be really appreciated.

updated CharString semantics for Julia 0.3

It seems likely that the CharString constructor will be changing in Julia 0.3 (JuliaLang/julia#7016), but it looks like there is an easy fix to make your code compatible with both Julia 0.2 and Julia 0.3. Just change:

return utf8(CharString(r[1:i]))

to use utf8(r[1:i]). The intermediate construction of a CharString is not necessary even in Julia 0.2.

Problem creating a Corpus

I am still a bit new to Julia so perhaps I am missing something basic here. When I try to create a Corpus following the example in the documentation (after correcting for the change in syntax) using:

Corpus([StringDocument("Document 1"), StringDocument("Document 2")])

I get the error:

MethodError: Cannot convert an object of type Array{TextAnalysis.StringDocument,1} to an object of type TextAnalysis.Corpus

I am using Julia 0.5 and the latest tag version of TextAnalysis

remove_corrupt_utf8! giving "no method matching zero" error

This seems to be strange error. I looked up preprocessing.jl file but I could not find any reference to zero function. Instead, the code clearly uses zeros. Is it some kind of compatibility/installation issue?

https://github.com/JuliaText/TextAnalysis.jl/blob/v0.2.1/src/preprocessing.jl#L46

How to reproduce:

julia> sd = StringDocument("With the market providing strong extensions over the last several months, we had to review our shorter-term targets for the bull market which began in 2009. During this past week, almost every bounce we saw was corrective in nature, which mainly kept me viewing the market as being in a weak posture, signaling that we will likely test the 2796 level on the S&P 500(^GSPC). While it seems I may have wrongly given the bull market the benefit of the doubt last weekend, as the pullback we expected has now broken below the 2800 region support we cited last weekend, this break of support makes the much stronger immediate bullish expectation much less likely.")
A TextAnalysis.StringDocument

julia> remove_corrupt_utf8!(sd)
ERROR: MethodError: no method matching zero(::Type{Char})
Closest candidates are:
  zero(::Type{Base.LibGit2.GitHash}) at libgit2/oid.jl:106
  zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuildItem}) at pkg/resolve/versionweight.jl:82
  zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuild}) at pkg/resolve/versionweight.jl:124
  ...
Stacktrace:
 [1] remove_corrupt_utf8(::String) at /home/ec2-user/.julia/v0.6/TextAnalysis/src/preprocessing.jl:46
 [2] remove_corrupt_utf8!(::TextAnalysis.StringDocument) at /home/ec2-user/.julia/v0.6/TextAnalysis/src/preprocessing.jl:58

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

[PkgEval] TextAnalysis may have a testing issue on Julia 0.3 (2014-08-01)

On Julia 0.3

On 2014-07-30 the testing status was Tests fail, but package loads.
On 2014-08-01 the testing status changed to Package doesn't load.

Tests fail, but package loads. means that PackageEvaluator found the tests for your package, executed them, and they didn't pass. However, trying to load your package with using worked.

Package doesn't load. means that PackageEvaluator did not find tests for your package. Additionally, trying to load your package with using failed.

Test log:

INFO: Installing Languages v0.0.1
INFO: Installing TextAnalysis v0.0.1
INFO: Package database updated

signal (11): Segmentation fault
??? at ???:1982132950
??? at ???:1982132515
??? at ???:1982135282
??? at ???:1982138981
??? at ???:1982054748
??? at ???:1981703818
??? at ???:1981705293
??? at ???:1981708965
??? at ???:1981710070
??? at ???:1981712156
??? at ???:1981793071
typeinf at ./inference.jl:1240
??? at ???:1947850708
??? at ???:1981766906
abstract_call_gf at ./inference.jl:724
..truncated..
??? at ???:1981766906
??? at ???:1982047992
??? at ???:1982044265
??? at ???:1982110975
??? at ???:1981799125
reload_path at loading.jl:152
_require at loading.jl:67
??? at ???:1981767045
require at loading.jl:51
??? at ???:1954597347
??? at ???:1981767045
??? at ???:1982106925
??? at ???:1982111579
??? at ???:1982113648
??? at ???:1982114080
include at ./boot.jl:245
??? at ???:1981766906
include_from_node1 at loading.jl:128
??? at ???:1981767045
process_options at ./client.jl:285
_start at ./client.jl:354
??? at ???:1949235993
??? at ???:1981766906
??? at ???:4200466
??? at ???:1982069575
??? at ???:4199453
??? at ???:1962160128
??? at ???:4199507
??? at ???:0
timeout: the monitored command dumped core
INFO: Package database updated

Snowball Stemmer outdated

Hi,

yesterday I tried to install the package and it failed, because the website snowball.tartarus.org was down. The admin there kindly fixed the issue but pointed out, that the version there is outdated and the follow up project at http://snowballstem.org should be used.

Cheers.

Restricting dtm/tf_idf creation to only the top N features from the lexicon

I am working with a corpus of 100k+ documents, so the number of features in the lexicon is extremely high. Thus I'm running into memory issues and the like. I know in scipy's TfidfVectorizer and similar approaches, you can limit the number of features such that you only are dealing with the top N features when working with the dtm and tf_idf matrices. Is there some way to do that with this package, or are there plans to add such a feature?

Thanks!

function tf_idf{T <: Real}(dtm::Matrix{T})
n, p = size(dtm)

# TF tells us what proportion of a document is defined by a term
tf = Array(Float64, n, p)
# IDF tells us how rare a term is in the corpus
idf=Array(Float64,p)
for i in 1:n
    words_in_document = 0
    for j in 1:p
        words_in_document += dtm[i, j]
        idf[j]=log(n/sum(vec(dtm[:,j])))
    end
    tf[i, :] = dtm[i, :] / words_in_document
end


# TF-IDF is the product of TF and IDF
# We store it in the TF matrix to save space
for i in 1:n
    for j in 1:p
       tf[i, j] = tf[i, j] * idf[j]
    end
end

return tf

end

TODO: Rewrite this definition to return a sparse matrix

function tf_idf{T <: Real}(dtm::SparseMatrixCSC{T})
n, p = size(dtm)

# TF tells us what proportion of a document is defined by a term
tf = Array(Float64, n, p)
# IDF tells us how rare a term is in the corpus
idf=Array(Float64,p)
for i in 1:n
    words_in_document = 0
    for j in 1:p
        words_in_document += dtm[i, j]
        idf[j]=log(n/sum(vec(dtm[:,j])))
    end
    tf[i, :] = dtm[i, :] / words_in_document
end


# TF-IDF is the product of TF and IDF
# We store it in the TF matrix to save space
for i in 1:n
   for j in 1:p
      tf[i, j] = tf[i, j] * idf[j]
    end
end

return tf

end

function tf_idf!{T <: Real}(dtm::Matrix{T})
error("not yet implemented")
end

function tf_idf!{T <: Real}(dtm::SparseMatrixCSC{T})
error("not yet implemented")
end

function tf_idf(dtm::DocumentTermMatrix)
tf_idf(dtm.dtm)
end

function tf_idf!(dtm::DocumentTermMatrix)
tf_idf!(dtm.dtm)
end

juliatext / textanalysis.jl Goto Github PK

textanalysis.jl's People

Stargazers

Watchers

Forkers

textanalysis.jl's Issues

On Julia 0.3

On Julia 0.3

TODO: Rewrite this definition to return a sparse matrix

Recommend Projects

Recommend Topics

Recommend Org