aritter / twitter_nlp Goto Github PK

Twitter NLP Tools

License: GNU General Public License v3.0

Shell 0.04% Perl 0.04% Python 0.57% C 0.83% CSS 0.01% Java 12.26% C++ 0.05% XSLT 0.01% Makefile 0.05% Batchfile 0.01% HTML 85.89% Roff 0.23%

twitter_nlp's Issues

Randomness in output classification

When I run the named entity tagger twice with --classify the outputs are quite different. Out of 250 sentences, 50 sentences are tagged different. Is there something wrong with my build?, or is this normal behaviour?

This makes it very hard to use this tagger as a preprocessing step, and reproduce the results.

BIO encoding

Where can I find the list of BIO encoding that are available in twitter_NLP?? And has anyone tried to move the code from Python to other platforms like Java, C++ etc???

./getopt.h:131:12: error: conflicting types for 'getopt'

I was trying run the below command:

Dinakar$ cat test.1k.txt | python python/ner/extractEntities2.py

and got the below error.

No Java runtime present, requesting install.

So I have installed it.

But after it finished installing, I got below error:

/bin/sh: .//python/cap/cap_classify: No such file or directory
Traceback (most recent call last):
File "python/ner/extractEntities2.py", line 131, in
goodCap = capClassifier.Classify(words) > 0.9
File ".//python/cap/cap_classifier.py", line 33, in Classify
self.capClassifier.stdin.write("%s\n" % self.fe.Extract(' '.join(words)))
IOError: [Errno 32] Broken pipe

Referring to you earlier post, I did try running bash.sh but got an error:

http://stackoverflow.com/questions/29491692/install-twitternlp-by-allen-ritter-on-mac-os

Can you please help me ASAP

Got error when running "cat test.1k.txt | python python/ner/extractEntities2.py"

Hi,

I was trying out twitter_nlp, doing what's in the readme. I got the following error. Can you show me how to go around it?

.//python/cap/cap_classify: 1: .//python/cap/cap_classify: Syntax error: ")" unexpected
Traceback (most recent call last):
File "python/ner/extractEntities2.py", line 131, in
goodCap = capClassifier.Classify(words) > 0.9
File ".//python/cap/cap_classifier.py", line 33, in Classify
self.capClassifier.stdin.write("%s\n" % self.fe.Extract(' '.join(words)))
IOError: [Errno 32] Broken pipe

Fixed the UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 15: ordinal not in range(128) error. Please update the code

I got the encoding issues from the tweets I had and I changed the line 160 in
[jalal@goku twitter_nlp]$ vi python/ner/extractEntities.py
to

        #line = tweet.encode('utf-8', "ignore")
        line = tweet.decode('iso8859-15', 'ignore')

And my problem was fixed.

Original error was:
[jalal@goku twitter_nlp]$ python python/ner/extractEntities.py mytweets.txt -o my_out_tweets.txt
Starting with the following configuration

Input file: mytweets.txt
Text Position: 0
Output file: my_out_tweets.txt
Chunk: False
POS: False
Event: False
Classify: False
Mallet Memory: 256m

Finished loading all models. Now reading from mytweets.txt and writing to my_out_tweets.txt
Traceback (most recent call last):
  File "python/ner/extractEntities.py", line 158, in <module>
    line = tweet.encode('utf-8', "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 15: ordinal not in range(128)

No module named twokenize

After cloning and running build and then running the command

export TWITTER_NLP=./ & cat 'data.jsonl' | python2 python/ner/extractEntities2_json.py --pos --chunk > 'data_tagging.jsonl'
I get the error:

Traceback (most recent call last):
  File "python/ner/extractEntities2_json.py", line 26, in <module>
    import twokenize
ImportError: No module named twokenize
cat: write error: Broken pipe

python2 of course,

Does anyone know how to fix this?

Where could we get the additional dev data in W-NUT 2016?

Dear Prof. Ritter,

Hi, I was wondering where we can get the additional 425 tweets which acted as additional dev data in W-NUT 2016, according to the result report.

Thank you very much.

Best regards,
Bill

Erroneous classification

Here is my input file test101 :
usgs reports a m0.46 #earthquake 13km nw of jodhpur city, rajasthan on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake
usgs reports a m0.46 #earthquake 13km nw of virgnia city, nevada on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake
usgs reports a m0.46 #earthquake 13km nw of jodhpur, rajasthan on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake
usgs reports a m0.46 #earthquake 13km nw of virgnia, nevada on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake
I live in Jodhpur.

Virgnia is classified as JJ(adjective) or B-Person. Similarly Jodhpur. Whereas both should be classified as B-geo-loc.

$ cat test101 | python python/ner/extractEntities2.py --classify --pos --event

usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O jodhpur/O/JJ/O city/O/NN/O ,/O/,/O rajasthan/O/VBN/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O

usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O virgnia/O/JJ/O city/O/NN/O ,/O/,/O nevada/B-geo-loc/NNP/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O

usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O jodhpur/O/NN/O ,/O/,/O rajasthan/O/VBN/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O

usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O virgnia/B-person/NNP/O ,/O/,/O nevada/B-geo-loc/NNP/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O

I/O/PRP/O live/O/VBP/B-EVENT in/O/IN/O Jodhpur/B-person/NNP/O ./O/./O

Windows

How can i use your module in windows in some part of my research work?

Incorrect parameter ordering at python/pos_tagger_stdin.py:37 ?

Hello all,

I noticed that in line 37 of python/pos_tagger_stdin.py, the current POSFeatureExtractor constructor call is

self.fe = features.POSFeatureExtractor(_TOKEN2POS_MAPS, _BIGRAM, _TOKEN_MAPS, _CLUSTERS)

while the constructor signature (line 14 of python/pos_tag/features.py) is

def __init__(self, token2pos_dir, token_dir, bigram_dir=None, cluster_fp=None):

Is the positioning of _BIGRAM and _TOKEN_MAPS incorrectly reversed? Thank you.

Best wishes,
Yiye Ruan

demo Website is down

I learned about your tool in a paper and wanted to give your demo a shot but it is down

http://statuscalendar.com/

The paper: http://drops.dagstuhl.de/opus/volltexte/2016/6008/pdf/OASIcs-SLATE-2016-3.pdf

Sequence of Punctuation POS

Hi,

I noticed that the sequence of punctuation has no POS when using this command

cat myTweets.txt | python python/ner/extractEntities2.py --classify --pos

For example: zombies/O/NNS ..../O/ smh/O/UH
I took a quick look inside the code, but I am not a python expert, may be you protected it some how.

Thanks,

Training with new data

Is there any resource/directions that we can use different data to train this model?

cannot find -lm

When I execute the command './build.sh' in CentOS, the final message is:
/usr/bin/ld: cannot find -lm
collect2: ld return 1

my gcc version
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)

Use sys.stdin.readline in place of codecs.getReader().readLine in extractEntities2.py

In extractEntites2.py
We have the following

reader = codecs.getreader("utf-8")(sys.stdin)
line = reader.readline().strip()

Can we change that to

  tweet = sys.stdin.readline().strip()
  line = tweet.encode('utf-8')

The reason is because codecs module readline blocks on input lines less than 72 characters in length.
When we use a subprocess to connect to extractEntities2.py and send a line over stdin that is less than 72 characters the input blocks
Given below are the steps to reproduce

Reproduce the issue

def GetExtracter():
    return subprocess.Popen("python extractEntities2.py --classify",                           
                           close_fds=True,
                           shell=True,                           
                           stdin=subprocess.PIPE,
                           stdout=subprocess.PIPE)

wrapper=GetExtracter()

wrapper.stdin.write('This is a test\n')
#wrapper.stdin.write(<give tweet more than 72 characters>) ->This will work
wrapper.stdin.flush()   
nlp_string=wrapper.stdout.readline().rstrip('\n').strip(' ')      
print nlp_string

Issue Noticed

The above piece of code will block on the readline input permanetly.
However giving the bigger tweet more than 72 characters will trigger the nlp wrapper

MacOS

Hello, is it possibile to use this tool on MacOS or it runs only on Linux?
I read that Linux is a requirement but I tried anyway to run it and I got an error at compile time:

./getopt.h:131:12: error: conflicting types for 'getopt'
extern int getopt ();
           ^
/usr/include/unistd.h:503:6: note: previous declaration is here
int      getopt(int, char * const [], const char *) __DARWIN_ALIAS(getopt);
         ^
param.cpp:217:17: warning: conversion from string literal to 'char *' is
      deprecated [-Wc++11-compat-deprecated-writable-strings]
    char *tmp = "TinySVM::Param::set";
                ^
1 warning and 1 error generated.
make[2]: *** [param.lo] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive-am] Error 2
ld: library not found for -lcrt0.o
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Is this related to MacOS?

Python 3 compatible?

Hello,

I've tried running extractEntities.py with Python 3 and I get the following error:

Traceback (most recent call last):
  File "twitter_nlp/python/ner/extractEntities.py", line 28, in <module>
    import Features
  File "twitter_nlp/python/ner/Features.py", line 12, in <module>
    if os.environ.has_key('TWITTER_NLP'):
AttributeError: '_Environ' object has no attribute 'has_key'

Is the package compatible with Python 3?

twokenize_wrapper.py - typo

Hi, in the twokenize_wrapper.py is a little typo that prevents I'll, you'll, etc. to be split.

Line 37, new_tok = token[:-3]
the variable is 'new_tk', no 'o'

new_tok is never used again.

Error when i use classify switch

Whenever i try to use classify switch i.e.
cat test.1k.txt | python python/ner/extractEntities2.py --classify --pos --chunk
it throws following error
Traceback (most recent call last):
File "python/ner/extractEntities2.py", line 109, in
dictionaries = Dictionaries('%s/data/LabeledLDA_dictionaries3' % (BASE_DIR), dict2index)
File "/home/jawad/Documents/twitter_nlp/hbc/python/Dictionaries.py", line 28, in init
self.dictionaries.sort(lambda a,b: cmp(dict2index[a], dict2index[b]))
File "/home/jawad/Documents/twitter_nlp/hbc/python/Dictionaries.py", line 28, in
self.dictionaries.sort(lambda a,b: cmp(dict2index[a], dict2index[b]))
KeyError: '.DS_Store'

how to use the scripts "DL-Cotrain" ?

It's need to prepare the unlabeled and label file, how to prepare these file for DL-Cotrain ? (or where is the script to prepare these files ?)

IOError: [Errno 32] Broken pipe

OS is mac air OX.

How can solve it?

$ cat test.1k.txt | python python/ner/extractEntities2.py
/bin/sh: .//python/cap/cap_classify: cannot execute binary file
Traceback (most recent call last):
File "python/ner/extractEntities2.py", line 131, in
goodCap = capClassifier.Classify(words) > 0.9
File ".//python/cap/cap_classifier.py", line 33, in Classify
self.capClassifier.stdin.write("%s\n" % self.fe.Extract(' '.join(words)))
IOError: [Errno 32] Broken pipe

hashtags are not considered as Named Entities

I assume mentions should be considered as named entities
@realDonaldTrump/O Thank/O you/O for/O saying/O you/O won't/O use/O vulger/O language/O anymore/O ./O Talk/O about/O Sanders/B-ENTITY &/O Clinton/B-ENTITY ./O Take/O Cruz/O as/O VP/O ./O Mexican/O votes/O !!!/O

but as you see in the results mentions are considered Others. Is there a way to change that within your code?
Additionally, do you know why in the above example Cruz is not considered a named entity?

hard code shebang in `twokenize_wrapper`

The shebang string in the file twokenize_wrapper.py is pure hard code that won't work on any machine other than Alan Ritter's. Please replace it with:

#!/usr/bin/env python

c++ -DHAVE_CONFIG_H -I. -I. -I.. -Wall -O9 -funroll-all-loops -finline -ffast-math -mieee-fp -c param.cpp  -fno-common -DPIC -o .libs/param.lo
In file included from common.h:76,
                 from param.cpp:28:
./getopt.h:131: error: declaration of C function 'int getopt()' conflicts with
/usr/include/unistd.h:548: error: previous declaration 'int getopt(int, char* const*, const char*)' here

Any ideas on how to solve it?

aritter / twitter_nlp Goto Github PK

twitter_nlp's Issues

Original error was: [jalal@goku twitter_nlp]$ python python/ner/extractEntities.py mytweets.txt -o my_out_tweets.txt Starting with the following configuration

Input file: mytweets.txt Text Position: 0 Output file: my_out_tweets.txt Chunk: False POS: False Event: False Classify: False Mallet Memory: 256m

Reproduce the issue

Issue Noticed

How can solve it?

Recommend Projects

Recommend Topics

Recommend Org

Original error was:
[jalal@goku twitter_nlp]$ python python/ner/extractEntities.py mytweets.txt -o my_out_tweets.txt
Starting with the following configuration

Input file: mytweets.txt
Text Position: 0
Output file: my_out_tweets.txt
Chunk: False
POS: False
Event: False
Classify: False
Mallet Memory: 256m