aritter / twitter_nlp Goto Github PK
View Code? Open in Web Editor NEWTwitter NLP Tools
License: GNU General Public License v3.0
Twitter NLP Tools
License: GNU General Public License v3.0
Hi, in the twokenize_wrapper.py is a little typo that prevents I'll, you'll, etc. to be split.
Line 37, new_tok = token[:-3]
the variable is 'new_tk', no 'o'
new_tok is never used again.
After cloning and running build and then running the command
export TWITTER_NLP=./ & cat 'data.jsonl' | python2 python/ner/extractEntities2_json.py --pos --chunk > 'data_tagging.jsonl'
I get the error:
Traceback (most recent call last):
File "python/ner/extractEntities2_json.py", line 26, in <module>
import twokenize
ImportError: No module named twokenize
cat: write error: Broken pipe
python2 of course,
Does anyone know how to fix this?
When I run the named entity tagger twice with --classify the outputs are quite different. Out of 250 sentences, 50 sentences are tagged different. Is there something wrong with my build?, or is this normal behaviour?
This makes it very hard to use this tagger as a preprocessing step, and reproduce the results.
Hello, is it possibile to use this tool on MacOS or it runs only on Linux?
I read that Linux is a requirement but I tried anyway to run it and I got an error at compile time:
./getopt.h:131:12: error: conflicting types for 'getopt'
extern int getopt ();
^
/usr/include/unistd.h:503:6: note: previous declaration is here
int getopt(int, char * const [], const char *) __DARWIN_ALIAS(getopt);
^
param.cpp:217:17: warning: conversion from string literal to 'char *' is
deprecated [-Wc++11-compat-deprecated-writable-strings]
char *tmp = "TinySVM::Param::set";
^
1 warning and 1 error generated.
make[2]: *** [param.lo] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive-am] Error 2
ld: library not found for -lcrt0.o
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Is this related to MacOS?
I assume mentions should be considered as named entities
@realDonaldTrump/O Thank/O you/O for/O saying/O you/O won't/O use/O vulger/O language/O anymore/O ./O Talk/O about/O Sanders/B-ENTITY &/O Clinton/B-ENTITY ./O Take/O Cruz/O as/O VP/O ./O Mexican/O votes/O !!!/O
but as you see in the results mentions are considered Others. Is there a way to change that within your code?
Additionally, do you know why in the above example Cruz is not considered a named entity?
I am trying to build on a Mac but it's failing with the following:
c++ -DHAVE_CONFIG_H -I. -I. -I.. -Wall -O9 -funroll-all-loops -finline -ffast-math -mieee-fp -c param.cpp -fno-common -DPIC -o .libs/param.lo
In file included from common.h:76,
from param.cpp:28:
./getopt.h:131: error: declaration of C function 'int getopt()' conflicts with
/usr/include/unistd.h:548: error: previous declaration 'int getopt(int, char* const*, const char*)' here
Any ideas on how to solve it?
It's need to prepare the unlabeled and label file, how to prepare these file for DL-Cotrain ? (or where is the script to prepare these files ?)
Hi,
I was trying out twitter_nlp, doing what's in the readme. I got the following error. Can you show me how to go around it?
.//python/cap/cap_classify: 1: .//python/cap/cap_classify: Syntax error: ")" unexpected
Traceback (most recent call last):
File "python/ner/extractEntities2.py", line 131, in
goodCap = capClassifier.Classify(words) > 0.9
File ".//python/cap/cap_classifier.py", line 33, in Classify
self.capClassifier.stdin.write("%s\n" % self.fe.Extract(' '.join(words)))
IOError: [Errno 32] Broken pipe
OS is mac air OX.
$ cat test.1k.txt | python python/ner/extractEntities2.py
/bin/sh: .//python/cap/cap_classify: cannot execute binary file
Traceback (most recent call last):
File "python/ner/extractEntities2.py", line 131, in
goodCap = capClassifier.Classify(words) > 0.9
File ".//python/cap/cap_classifier.py", line 33, in Classify
self.capClassifier.stdin.write("%s\n" % self.fe.Extract(' '.join(words)))
IOError: [Errno 32] Broken pipe
I got the encoding issues from the tweets I had and I changed the line 160 in
[jalal@goku twitter_nlp]$ vi python/ner/extractEntities.py
to
#line = tweet.encode('utf-8', "ignore")
line = tweet.decode('iso8859-15', 'ignore')
And my problem was fixed.
Finished loading all models. Now reading from mytweets.txt and writing to my_out_tweets.txt
Traceback (most recent call last):
File "python/ner/extractEntities.py", line 158, in <module>
line = tweet.encode('utf-8', "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 15: ordinal not in range(128)
Whenever i try to use classify switch i.e.
cat test.1k.txt | python python/ner/extractEntities2.py --classify --pos --chunk
it throws following error
Traceback (most recent call last):
File "python/ner/extractEntities2.py", line 109, in
dictionaries = Dictionaries('%s/data/LabeledLDA_dictionaries3' % (BASE_DIR), dict2index)
File "/home/jawad/Documents/twitter_nlp/hbc/python/Dictionaries.py", line 28, in init
self.dictionaries.sort(lambda a,b: cmp(dict2index[a], dict2index[b]))
File "/home/jawad/Documents/twitter_nlp/hbc/python/Dictionaries.py", line 28, in
self.dictionaries.sort(lambda a,b: cmp(dict2index[a], dict2index[b]))
KeyError: '.DS_Store'
I learned about your tool in a paper and wanted to give your demo a shot but it is down
The paper: http://drops.dagstuhl.de/opus/volltexte/2016/6008/pdf/OASIcs-SLATE-2016-3.pdf
I am cleaning the code for it to read text from a file. Even tab separated file with text in given column.
Writing to the file will be supported as well.
Will send a pull request regarding this issue.
Hi
Ignore
Thanks
How can i use your module in windows in some part of my research work?
Hello,
I've tried running extractEntities.py
with Python 3 and I get the following error:
Traceback (most recent call last):
File "twitter_nlp/python/ner/extractEntities.py", line 28, in <module>
import Features
File "twitter_nlp/python/ner/Features.py", line 12, in <module>
if os.environ.has_key('TWITTER_NLP'):
AttributeError: '_Environ' object has no attribute 'has_key'
Is the package compatible with Python 3?
Hi,
I noticed that the sequence of punctuation has no POS when using this command
cat myTweets.txt | python python/ner/extractEntities2.py --classify --pos
For example: zombies/O/NNS ..../O/ smh/O/UH
I took a quick look inside the code, but I am not a python expert, may be you protected it some how.
Thanks,
Hello all,
I noticed that in line 37 of python/pos_tagger_stdin.py, the current POSFeatureExtractor constructor call is
self.fe = features.POSFeatureExtractor(_TOKEN2POS_MAPS, _BIGRAM, _TOKEN_MAPS, _CLUSTERS)
while the constructor signature (line 14 of python/pos_tag/features.py) is
def __init__(self, token2pos_dir, token_dir, bigram_dir=None, cluster_fp=None):
Is the positioning of _BIGRAM and _TOKEN_MAPS incorrectly reversed? Thank you.
Best wishes,
Yiye Ruan
When I execute the command './build.sh' in CentOS, the final message is:
/usr/bin/ld: cannot find -lm
collect2: ld return 1
my gcc version
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
In extractEntites2.py
We have the following
reader = codecs.getreader("utf-8")(sys.stdin)
line = reader.readline().strip()
Can we change that to
tweet = sys.stdin.readline().strip()
line = tweet.encode('utf-8')
def GetExtracter():
return subprocess.Popen("python extractEntities2.py --classify",
close_fds=True,
shell=True,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE)
wrapper=GetExtracter()
wrapper.stdin.write('This is a test\n')
#wrapper.stdin.write(<give tweet more than 72 characters>) ->This will work
wrapper.stdin.flush()
nlp_string=wrapper.stdout.readline().rstrip('\n').strip(' ')
print nlp_string
The above piece of code will block on the readline input permanetly.
However giving the bigger tweet more than 72 characters will trigger the nlp wrapper
Where can I find the list of BIO encoding that are available in twitter_NLP?? And has anyone tried to move the code from Python to other platforms like Java, C++ etc???
Hi
I was trying run the below command:
Dinakar$ cat test.1k.txt | python python/ner/extractEntities2.py
and got the below error.
No Java runtime present, requesting install.
So I have installed it.
But after it finished installing, I got below error:
/bin/sh: .//python/cap/cap_classify: No such file or directory
Traceback (most recent call last):
File "python/ner/extractEntities2.py", line 131, in
goodCap = capClassifier.Classify(words) > 0.9
File ".//python/cap/cap_classifier.py", line 33, in Classify
self.capClassifier.stdin.write("%s\n" % self.fe.Extract(' '.join(words)))
IOError: [Errno 32] Broken pipe
Referring to you earlier post, I did try running bash.sh but got an error:
http://stackoverflow.com/questions/29491692/install-twitternlp-by-allen-ritter-on-mac-os
Can you please help me ASAP
The shebang string in the file twokenize_wrapper.py
is pure hard code that won't work on any machine other than Alan Ritter's. Please replace it with:
#!/usr/bin/env python
Is there any resource/directions that we can use different data to train this model?
Here is my input file test101 :
usgs reports a m0.46 #earthquake 13km nw of jodhpur city, rajasthan on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake
usgs reports a m0.46 #earthquake 13km nw of virgnia city, nevada on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake
usgs reports a m0.46 #earthquake 13km nw of jodhpur, rajasthan on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake
usgs reports a m0.46 #earthquake 13km nw of virgnia, nevada on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake
I live in Jodhpur.
Virgnia is classified as JJ(adjective) or B-Person. Similarly Jodhpur. Whereas both should be classified as B-geo-loc.
$ cat test101 | python python/ner/extractEntities2.py --classify --pos --event
usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O jodhpur/O/JJ/O city/O/NN/O ,/O/,/O rajasthan/O/VBN/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O
usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O virgnia/O/JJ/O city/O/NN/O ,/O/,/O nevada/B-geo-loc/NNP/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O
usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O jodhpur/O/NN/O ,/O/,/O rajasthan/O/VBN/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O
usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O virgnia/B-person/NNP/O ,/O/,/O nevada/B-geo-loc/NNP/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O
I/O/PRP/O live/O/VBP/B-EVENT in/O/IN/O Jodhpur/B-person/NNP/O ./O/./O
Dear Prof. Ritter,
Hi, I was wondering where we can get the additional 425 tweets which acted as additional dev data in W-NUT 2016, according to the result report.
Thank you very much.
Best regards,
Bill
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.