aritter / twitter_nlp Goto Github PK

Twitter NLP Tools

License: GNU General Public License v3.0

Shell 0.04% Perl 0.04% Python 0.57% C 0.83% CSS 0.01% Java 12.26% C++ 0.05% XSLT 0.01% Makefile 0.05% Batchfile 0.01% HTML 85.89% Roff 0.23%

twitter_nlp's People

Contributors

Stargazers

Watchers

Forkers

georgekola diegocaro srebolledo alek ianozsvald tankiit orangelpai knowitall joskid nitinverma antoine-tran laurensdv marynagle emkay shuklas giangbinhtran chinaboy lucentcosmos charleybeller harrywy ikekonglp romilbansal vgoklani teenst af4007 benjaminhardin girlatsnow chanrom martende davidajohnston bessiec spkkasyap azizur77 protobz linghton angus07 eshioji rugbash watashi-wa czarina keithyue nareshshah139 qwaider oztalha sawfish kennyjoseph anhncs ducnguyen1911 sachuin23 neufang kevincstowe veerarc lidagh jimmy0000 phucng amitasthana shoumu k8si augustine-tran kuew ghassenj mukeshbang2001 chagge projomni indrasela vdmanthan isanfulia acha21 pavan046 shreyaspalekar dineshswamy smarthomekit codeashu olanre darlwen zaoguo japerk itswayneyang cocoxu gchrupala statisticallyfit hari-viswadeep knil-sama tooringanalytics imaxxs davidlimin architshah4101993 shk3 deveshbatra mbnthuee404 tangshiping seasonlaw ninja91 mossaab0 surendramarupudi henningwold 52nlp seifer08ms cosmozhang semihyavuzz

twitter_nlp's Issues

hashtags are not considered as Named Entities

I assume mentions should be considered as named entities
@realDonaldTrump/O Thank/O you/O for/O saying/O you/O won't/O use/O vulger/O language/O anymore/O ./O Talk/O about/O Sanders/B-ENTITY &/O Clinton/B-ENTITY ./O Take/O Cruz/O as/O VP/O ./O Mexican/O votes/O !!!/O

but as you see in the results mentions are considered Others. Is there a way to change that within your code?
Additionally, do you know why in the above example Cruz is not considered a named entity?

Incorrect parameter ordering at python/pos_tagger_stdin.py:37 ?

Hello all,

I noticed that in line 37 of python/pos_tagger_stdin.py, the current POSFeatureExtractor constructor call is

self.fe = features.POSFeatureExtractor(_TOKEN2POS_MAPS, _BIGRAM, _TOKEN_MAPS, _CLUSTERS)

while the constructor signature (line 14 of python/pos_tag/features.py) is

def __init__(self, token2pos_dir, token_dir, bigram_dir=None, cluster_fp=None):

Is the positioning of _BIGRAM and _TOKEN_MAPS incorrectly reversed? Thank you.

Best wishes,
Yiye Ruan

Erroneous classification

Here is my input file test101 :
usgs reports a m0.46 #earthquake 13km nw of jodhpur city, rajasthan on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake
usgs reports a m0.46 #earthquake 13km nw of virgnia city, nevada on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake
usgs reports a m0.46 #earthquake 13km nw of jodhpur, rajasthan on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake
usgs reports a m0.46 #earthquake 13km nw of virgnia, nevada on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake
I live in Jodhpur.

Virgnia is classified as JJ(adjective) or B-Person. Similarly Jodhpur. Whereas both should be classified as B-geo-loc.

$ cat test101 | python python/ner/extractEntities2.py --classify --pos --event

usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O jodhpur/O/JJ/O city/O/NN/O ,/O/,/O rajasthan/O/VBN/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O

usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O virgnia/O/JJ/O city/O/NN/O ,/O/,/O nevada/B-geo-loc/NNP/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O

usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O jodhpur/O/NN/O ,/O/,/O rajasthan/O/VBN/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O

usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O virgnia/B-person/NNP/O ,/O/,/O nevada/B-geo-loc/NNP/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O

I/O/PRP/O live/O/VBP/B-EVENT in/O/IN/O Jodhpur/B-person/NNP/O ./O/./O

Windows

How can i use your module in windows in some part of my research work?

No module named twokenize

After cloning and running build and then running the command

export TWITTER_NLP=./ & cat 'data.jsonl' | python2 python/ner/extractEntities2_json.py --pos --chunk > 'data_tagging.jsonl'
I get the error:

Traceback (most recent call last):
  File "python/ner/extractEntities2_json.py", line 26, in <module>
    import twokenize
ImportError: No module named twokenize
cat: write error: Broken pipe

python2 of course,

Does anyone know how to fix this?

Training with new data

Is there any resource/directions that we can use different data to train this model?

IOError: [Errno 32] Broken pipe

OS is mac air OX.

How can solve it?

$ cat test.1k.txt | python python/ner/extractEntities2.py
/bin/sh: .//python/cap/cap_classify: cannot execute binary file
Traceback (most recent call last):
File "python/ner/extractEntities2.py", line 131, in
goodCap = capClassifier.Classify(words) > 0.9
File ".//python/cap/cap_classifier.py", line 33, in Classify
self.capClassifier.stdin.write("%s\n" % self.fe.Extract(' '.join(words)))
IOError: [Errno 32] Broken pipe

./getopt.h:131:12: error: conflicting types for 'getopt'

I was trying run the below command:

Dinakar$ cat test.1k.txt | python python/ner/extractEntities2.py

and got the below error.

No Java runtime present, requesting install.

So I have installed it.

But after it finished installing, I got below error:

/bin/sh: .//python/cap/cap_classify: No such file or directory
Traceback (most recent call last):
File "python/ner/extractEntities2.py", line 131, in
goodCap = capClassifier.Classify(words) > 0.9
File ".//python/cap/cap_classifier.py", line 33, in Classify
self.capClassifier.stdin.write("%s\n" % self.fe.Extract(' '.join(words)))
IOError: [Errno 32] Broken pipe

Referring to you earlier post, I did try running bash.sh but got an error:

http://stackoverflow.com/questions/29491692/install-twitternlp-by-allen-ritter-on-mac-os

Can you please help me ASAP

Sequence of Punctuation POS

Hi,

I noticed that the sequence of punctuation has no POS when using this command

cat myTweets.txt | python python/ner/extractEntities2.py --classify --pos

For example: zombies/O/NNS ..../O/ smh/O/UH
I took a quick look inside the code, but I am not a python expert, may be you protected it some how.

Thanks,

Got error when running "cat test.1k.txt | python python/ner/extractEntities2.py"

Hi,

I was trying out twitter_nlp, doing what's in the readme. I got the following error. Can you show me how to go around it?

.//python/cap/cap_classify: 1: .//python/cap/cap_classify: Syntax error: ")" unexpected
Traceback (most recent call last):
File "python/ner/extractEntities2.py", line 131, in
goodCap = capClassifier.Classify(words) > 0.9
File ".//python/cap/cap_classifier.py", line 33, in Classify
self.capClassifier.stdin.write("%s\n" % self.fe.Extract(' '.join(words)))
IOError: [Errno 32] Broken pipe

BIO encoding

Where can I find the list of BIO encoding that are available in twitter_NLP?? And has anyone tried to move the code from Python to other platforms like Java, C++ etc???

Error when i use classify switch

Whenever i try to use classify switch i.e.
cat test.1k.txt | python python/ner/extractEntities2.py --classify --pos --chunk
it throws following error
Traceback (most recent call last):
File "python/ner/extractEntities2.py", line 109, in
dictionaries = Dictionaries('%s/data/LabeledLDA_dictionaries3' % (BASE_DIR), dict2index)
File "/home/jawad/Documents/twitter_nlp/hbc/python/Dictionaries.py", line 28, in init
self.dictionaries.sort(lambda a,b: cmp(dict2index[a], dict2index[b]))
File "/home/jawad/Documents/twitter_nlp/hbc/python/Dictionaries.py", line 28, in
self.dictionaries.sort(lambda a,b: cmp(dict2index[a], dict2index[b]))
KeyError: '.DS_Store'

Not being able to build on a Mac

I am trying to build on a Mac but it's failing with the following:

c++ -DHAVE_CONFIG_H -I. -I. -I.. -Wall -O9 -funroll-all-loops -finline -ffast-math -mieee-fp -c param.cpp  -fno-common -DPIC -o .libs/param.lo
In file included from common.h:76,
                 from param.cpp:28:
./getopt.h:131: error: declaration of C function 'int getopt()' conflicts with
/usr/include/unistd.h:548: error: previous declaration 'int getopt(int, char* const*, const char*)' here

Any ideas on how to solve it?

how to use the scripts "DL-Cotrain" ?

It's need to prepare the unlabeled and label file, how to prepare these file for DL-Cotrain ? (or where is the script to prepare these files ?)

hard code shebang in `twokenize_wrapper`

The shebang string in the file twokenize_wrapper.py is pure hard code that won't work on any machine other than Alan Ritter's. Please replace it with:

#!/usr/bin/env python

MacOS

Hello, is it possibile to use this tool on MacOS or it runs only on Linux?
I read that Linux is a requirement but I tried anyway to run it and I got an error at compile time:

./getopt.h:131:12: error: conflicting types for 'getopt'
extern int getopt ();
           ^
/usr/include/unistd.h:503:6: note: previous declaration is here
int      getopt(int, char * const [], const char *) __DARWIN_ALIAS(getopt);
         ^
param.cpp:217:17: warning: conversion from string literal to 'char *' is
      deprecated [-Wc++11-compat-deprecated-writable-strings]
    char *tmp = "TinySVM::Param::set";
                ^
1 warning and 1 error generated.
make[2]: *** [param.lo] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive-am] Error 2
ld: library not found for -lcrt0.o
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Is this related to MacOS?

Add generic API for reading text from file and outputing to file

I am cleaning the code for it to read text from a file. Even tab separated file with text in given column.
Writing to the file will be supported as well.

Will send a pull request regarding this issue.

Randomness in output classification

When I run the named entity tagger twice with --classify the outputs are quite different. Out of 250 sentences, 50 sentences are tagged different. Is there something wrong with my build?, or is this normal behaviour?

This makes it very hard to use this tagger as a preprocessing step, and reproduce the results.

Where could we get the additional dev data in W-NUT 2016?

Dear Prof. Ritter,

Hi, I was wondering where we can get the additional 425 tweets which acted as additional dev data in W-NUT 2016, according to the result report.

Thank you very much.

Best regards,
Bill

Fixed the UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 15: ordinal not in range(128) error. Please update the code

I got the encoding issues from the tweets I had and I changed the line 160 in
[jalal@goku twitter_nlp]$ vi python/ner/extractEntities.py
to

        #line = tweet.encode('utf-8', "ignore")
        line = tweet.decode('iso8859-15', 'ignore')

And my problem was fixed.

Original error was:
[jalal@goku twitter_nlp]$ python python/ner/extractEntities.py mytweets.txt -o my_out_tweets.txt
Starting with the following configuration

Input file: mytweets.txt
Text Position: 0
Output file: my_out_tweets.txt
Chunk: False
POS: False
Event: False
Classify: False
Mallet Memory: 256m

Finished loading all models. Now reading from mytweets.txt and writing to my_out_tweets.txt
Traceback (most recent call last):
  File "python/ner/extractEntities.py", line 158, in <module>
    line = tweet.encode('utf-8', "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 15: ordinal not in range(128)

demo Website is down

I learned about your tool in a paper and wanted to give your demo a shot but it is down

http://statuscalendar.com/

The paper: http://drops.dagstuhl.de/opus/volltexte/2016/6008/pdf/OASIcs-SLATE-2016-3.pdf

cannot find -lm

When I execute the command './build.sh' in CentOS, the final message is:
/usr/bin/ld: cannot find -lm
collect2: ld return 1

my gcc version
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)

twokenize_wrapper.py - typo

Hi, in the twokenize_wrapper.py is a little typo that prevents I'll, you'll, etc. to be split.

Line 37, new_tok = token[:-3]
the variable is 'new_tk', no 'o'

new_tok is never used again.

Use sys.stdin.readline in place of codecs.getReader().readLine in extractEntities2.py

In extractEntites2.py
We have the following

reader = codecs.getreader("utf-8")(sys.stdin)
line = reader.readline().strip()

Can we change that to

  tweet = sys.stdin.readline().strip()
  line = tweet.encode('utf-8')

The reason is because codecs module readline blocks on input lines less than 72 characters in length.
When we use a subprocess to connect to extractEntities2.py and send a line over stdin that is less than 72 characters the input blocks
Given below are the steps to reproduce

Reproduce the issue

def GetExtracter():
    return subprocess.Popen("python extractEntities2.py --classify",                           
                           close_fds=True,
                           shell=True,                           
                           stdin=subprocess.PIPE,
                           stdout=subprocess.PIPE)

wrapper=GetExtracter()

wrapper.stdin.write('This is a test\n')
#wrapper.stdin.write(<give tweet more than 72 characters>) ->This will work
wrapper.stdin.flush()   
nlp_string=wrapper.stdout.readline().rstrip('\n').strip(' ')      
print nlp_string

Issue Noticed

The above piece of code will block on the readline input permanetly.
However giving the bigger tweet more than 72 characters will trigger the nlp wrapper