Giter Site home page Giter Site logo

aritter / twitter_nlp Goto Github PK

View Code? Open in Web Editor NEW
879.0 879.0 385.0 151.44 MB

Twitter NLP Tools

License: GNU General Public License v3.0

Shell 0.04% Perl 0.04% Python 0.57% C 0.83% CSS 0.01% Java 12.26% C++ 0.05% XSLT 0.01% Makefile 0.05% Batchfile 0.01% HTML 85.89% Roff 0.23%

twitter_nlp's Issues

Randomness in output classification

When I run the named entity tagger twice with --classify the outputs are quite different. Out of 250 sentences, 50 sentences are tagged different. Is there something wrong with my build?, or is this normal behaviour?

This makes it very hard to use this tagger as a preprocessing step, and reproduce the results.

BIO encoding

Where can I find the list of BIO encoding that are available in twitter_NLP?? And has anyone tried to move the code from Python to other platforms like Java, C++ etc???

./getopt.h:131:12: error: conflicting types for 'getopt'

Hi

I was trying run the below command:

Dinakar$ cat test.1k.txt | python python/ner/extractEntities2.py

and got the below error.

No Java runtime present, requesting install.

So I have installed it.

But after it finished installing, I got below error:

/bin/sh: .//python/cap/cap_classify: No such file or directory
Traceback (most recent call last):
File "python/ner/extractEntities2.py", line 131, in
goodCap = capClassifier.Classify(words) > 0.9
File ".//python/cap/cap_classifier.py", line 33, in Classify
self.capClassifier.stdin.write("%s\n" % self.fe.Extract(' '.join(words)))
IOError: [Errno 32] Broken pipe

Referring to you earlier post, I did try running bash.sh but got an error:

http://stackoverflow.com/questions/29491692/install-twitternlp-by-allen-ritter-on-mac-os

Can you please help me ASAP

Got error when running "cat test.1k.txt | python python/ner/extractEntities2.py"

Hi,

I was trying out twitter_nlp, doing what's in the readme. I got the following error. Can you show me how to go around it?

.//python/cap/cap_classify: 1: .//python/cap/cap_classify: Syntax error: ")" unexpected
Traceback (most recent call last):
File "python/ner/extractEntities2.py", line 131, in
goodCap = capClassifier.Classify(words) > 0.9
File ".//python/cap/cap_classifier.py", line 33, in Classify
self.capClassifier.stdin.write("%s\n" % self.fe.Extract(' '.join(words)))
IOError: [Errno 32] Broken pipe

Fixed the UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 15: ordinal not in range(128) error. Please update the code

I got the encoding issues from the tweets I had and I changed the line 160 in
[jalal@goku twitter_nlp]$ vi python/ner/extractEntities.py
to

        #line = tweet.encode('utf-8', "ignore")
        line = tweet.decode('iso8859-15', 'ignore')

And my problem was fixed.

Original error was:
[jalal@goku twitter_nlp]$ python python/ner/extractEntities.py mytweets.txt -o my_out_tweets.txt
Starting with the following configuration

Input file: mytweets.txt
Text Position: 0
Output file: my_out_tweets.txt
Chunk: False
POS: False
Event: False
Classify: False
Mallet Memory: 256m

Finished loading all models. Now reading from mytweets.txt and writing to my_out_tweets.txt
Traceback (most recent call last):
  File "python/ner/extractEntities.py", line 158, in <module>
    line = tweet.encode('utf-8', "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 15: ordinal not in range(128)

No module named twokenize

After cloning and running build and then running the command

export TWITTER_NLP=./ & cat 'data.jsonl' | python2 python/ner/extractEntities2_json.py --pos --chunk > 'data_tagging.jsonl'
I get the error:

Traceback (most recent call last):
  File "python/ner/extractEntities2_json.py", line 26, in <module>
    import twokenize
ImportError: No module named twokenize
cat: write error: Broken pipe

python2 of course,

Does anyone know how to fix this?

Erroneous classification

Here is my input file test101 :
usgs reports a m0.46 #earthquake 13km nw of jodhpur city, rajasthan on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake
usgs reports a m0.46 #earthquake 13km nw of virgnia city, nevada on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake
usgs reports a m0.46 #earthquake 13km nw of jodhpur, rajasthan on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake
usgs reports a m0.46 #earthquake 13km nw of virgnia, nevada on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake
I live in Jodhpur.

Virgnia is classified as JJ(adjective) or B-Person. Similarly Jodhpur. Whereas both should be classified as B-geo-loc.

$ cat test101 | python python/ner/extractEntities2.py --classify --pos --event

usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O jodhpur/O/JJ/O city/O/NN/O ,/O/,/O rajasthan/O/VBN/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O

usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O virgnia/O/JJ/O city/O/NN/O ,/O/,/O nevada/B-geo-loc/NNP/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O

usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O jodhpur/O/NN/O ,/O/,/O rajasthan/O/VBN/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O

usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O virgnia/B-person/NNP/O ,/O/,/O nevada/B-geo-loc/NNP/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O

I/O/PRP/O live/O/VBP/B-EVENT in/O/IN/O Jodhpur/B-person/NNP/O ./O/./O

Windows

How can i use your module in windows in some part of my research work?

Incorrect parameter ordering at python/pos_tagger_stdin.py:37 ?

Hello all,

I noticed that in line 37 of python/pos_tagger_stdin.py, the current POSFeatureExtractor constructor call is

self.fe = features.POSFeatureExtractor(_TOKEN2POS_MAPS, _BIGRAM, _TOKEN_MAPS, _CLUSTERS)

while the constructor signature (line 14 of python/pos_tag/features.py) is

def __init__(self, token2pos_dir, token_dir, bigram_dir=None, cluster_fp=None):

Is the positioning of _BIGRAM and _TOKEN_MAPS incorrectly reversed? Thank you.

Best wishes,
Yiye Ruan

Sequence of Punctuation POS

Hi,

I noticed that the sequence of punctuation has no POS when using this command

cat myTweets.txt | python python/ner/extractEntities2.py --classify --pos

For example: zombies/O/NNS ..../O/ smh/O/UH
I took a quick look inside the code, but I am not a python expert, may be you protected it some how.

Thanks,

cannot find -lm

When I execute the command './build.sh' in CentOS, the final message is:
/usr/bin/ld: cannot find -lm
collect2: ld return 1

my gcc version
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)

Use sys.stdin.readline in place of codecs.getReader().readLine in extractEntities2.py

In extractEntites2.py
We have the following

reader = codecs.getreader("utf-8")(sys.stdin)
line = reader.readline().strip()

Can we change that to

  tweet = sys.stdin.readline().strip()
  line = tweet.encode('utf-8')
  • The reason is because codecs module readline blocks on input lines less than 72 characters in length.
  • When we use a subprocess to connect to extractEntities2.py and send a line over stdin that is less than 72 characters the input blocks
  • Given below are the steps to reproduce

Reproduce the issue

def GetExtracter():
    return subprocess.Popen("python extractEntities2.py --classify",                           
                           close_fds=True,
                           shell=True,                           
                           stdin=subprocess.PIPE,
                           stdout=subprocess.PIPE)

wrapper=GetExtracter()

wrapper.stdin.write('This is a test\n')
#wrapper.stdin.write(<give tweet more than 72 characters>) ->This will work
wrapper.stdin.flush()   
nlp_string=wrapper.stdout.readline().rstrip('\n').strip(' ')      
print nlp_string

Issue Noticed

The above piece of code will block on the readline input permanetly.
However giving the bigger tweet more than 72 characters will trigger the nlp wrapper

MacOS

Hello, is it possibile to use this tool on MacOS or it runs only on Linux?
I read that Linux is a requirement but I tried anyway to run it and I got an error at compile time:

./getopt.h:131:12: error: conflicting types for 'getopt'
extern int getopt ();
           ^
/usr/include/unistd.h:503:6: note: previous declaration is here
int      getopt(int, char * const [], const char *) __DARWIN_ALIAS(getopt);
         ^
param.cpp:217:17: warning: conversion from string literal to 'char *' is
      deprecated [-Wc++11-compat-deprecated-writable-strings]
    char *tmp = "TinySVM::Param::set";
                ^
1 warning and 1 error generated.
make[2]: *** [param.lo] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive-am] Error 2
ld: library not found for -lcrt0.o
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Is this related to MacOS?

Python 3 compatible?

Hello,

I've tried running extractEntities.py with Python 3 and I get the following error:

Traceback (most recent call last):
  File "twitter_nlp/python/ner/extractEntities.py", line 28, in <module>
    import Features
  File "twitter_nlp/python/ner/Features.py", line 12, in <module>
    if os.environ.has_key('TWITTER_NLP'):
AttributeError: '_Environ' object has no attribute 'has_key'

Is the package compatible with Python 3?

twokenize_wrapper.py - typo

Hi, in the twokenize_wrapper.py is a little typo that prevents I'll, you'll, etc. to be split.

Line 37, new_tok = token[:-3]
the variable is 'new_tk', no 'o'

new_tok is never used again.

Error when i use classify switch

Whenever i try to use classify switch i.e.
cat test.1k.txt | python python/ner/extractEntities2.py --classify --pos --chunk
it throws following error
Traceback (most recent call last):
File "python/ner/extractEntities2.py", line 109, in
dictionaries = Dictionaries('%s/data/LabeledLDA_dictionaries3' % (BASE_DIR), dict2index)
File "/home/jawad/Documents/twitter_nlp/hbc/python/Dictionaries.py", line 28, in init
self.dictionaries.sort(lambda a,b: cmp(dict2index[a], dict2index[b]))
File "/home/jawad/Documents/twitter_nlp/hbc/python/Dictionaries.py", line 28, in
self.dictionaries.sort(lambda a,b: cmp(dict2index[a], dict2index[b]))
KeyError: '.DS_Store'

image

IOError: [Errno 32] Broken pipe

OS is mac air OX.

How can solve it?

$ cat test.1k.txt | python python/ner/extractEntities2.py
/bin/sh: .//python/cap/cap_classify: cannot execute binary file
Traceback (most recent call last):
File "python/ner/extractEntities2.py", line 131, in
goodCap = capClassifier.Classify(words) > 0.9
File ".//python/cap/cap_classifier.py", line 33, in Classify
self.capClassifier.stdin.write("%s\n" % self.fe.Extract(' '.join(words)))
IOError: [Errno 32] Broken pipe

hashtags are not considered as Named Entities

I assume mentions should be considered as named entities
@realDonaldTrump/O Thank/O you/O for/O saying/O you/O won't/O use/O vulger/O language/O anymore/O ./O Talk/O about/O Sanders/B-ENTITY &/O Clinton/B-ENTITY ./O Take/O Cruz/O as/O VP/O ./O Mexican/O votes/O !!!/O

but as you see in the results mentions are considered Others. Is there a way to change that within your code?
Additionally, do you know why in the above example Cruz is not considered a named entity?

hard code shebang in `twokenize_wrapper`

The shebang string in the file twokenize_wrapper.py is pure hard code that won't work on any machine other than Alan Ritter's. Please replace it with:

#!/usr/bin/env python

Not being able to build on a Mac

I am trying to build on a Mac but it's failing with the following:

c++ -DHAVE_CONFIG_H -I. -I. -I.. -Wall -O9 -funroll-all-loops -finline -ffast-math -mieee-fp -c param.cpp  -fno-common -DPIC -o .libs/param.lo
In file included from common.h:76,
                 from param.cpp:28:
./getopt.h:131: error: declaration of C function 'int getopt()' conflicts with
/usr/include/unistd.h:548: error: previous declaration 'int getopt(int, char* const*, const char*)' here

Any ideas on how to solve it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.