noahcarnahan / plagcomps Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
when testing features that don't use POS tags, don't initialize them since they take forever. I believe @zachwooddoughty is revamping featureextraction.py
to just store a single dictionary anyway, so that should fix this issue.
When running average_word_length
, it still requires that we have an instance variable self.average_word_length_initialized
referenced here:
https://github.com/NoahCarnahan/plagcomps/blob/master/intrinsic/featureextraction.py#L265
I'm putting the instance variable back in for the moment (until we talk more about how to use avg(word)
or however the function name will work. does this make sense @zachwooddoughty or am i missing a silly fix?
I was working on the frontend and the indices of plagiarized spans appeared to be a little bit off (by a few characters). Looks like we're reading in a 'xef\xbb\xbf' at the start of the files in our corpus, causing indexing to be off by 3 characters. Looking at our new favorite plagiarism in instrinsic (i.e. part1/suspicious-document01078.txt
) and its .xml file, we can see that its first case of plag starts at 1396 and goes for 272 characters. However, when reading that file:
>>> f = file('suspicious-document01078.txt', 'r')
>>> text = f.read()
>>> f.close()
>>> text[1396: 1396 + 272]
't. Soon after\nthe last livre was spent, De la Salle had occasion to make a journey\nin connection with his work. He went on foot, as needs he must, and\nbegged his way. An old woman gave him a piece of black bread; he\nate it with joy, feeling that now he was indeed a poor m'
>>> text[1396 + 3: 1396 + 272 + 3]
'Soon after\nthe last livre was spent, De la Salle had occasion to make a journey\nin connection with his work. He went on foot, as needs he must, and\nbegged his way. An old woman gave him a piece of black bread; he\nate it with joy, feeling that now he was indeed a poor man.'
I haven't checked many (only a few) yet, but this looks like a problem across everything in the corpus. Looks like it's an encoding issue:
http://stackoverflow.com/questions/12561063/python-extract-data-from-file/12561163#12561163
http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
but something we need to deal with!
2 kind of edge cases, but _get_anchor_fingerprint doesn't catch anchors that include any punctuation. for example:
# self.anchors is ['ul', 'ay', 'oo', 'yo', 'si', 'ca', 'am', 'ie', 'mo', 'rt']
# self.anchors includes the string 'oo', so 'good' should be part of a fingerprint
# self.anchors includes the string 'ul', so should 'should' be part of a fingerprint?
# right now it's not since we're putting the anchor in the middle of the n-gram
extractor = FingerprintExtractor()
extractor.get_fingerprint('good, should be caught', 3, "anchor")
# returns []
https://github.com/NoahCarnahan/plagcomps/blob/master/extrinsic/fingerprint_extraction.py#L108
Perhaps we should just parse the input document by words, then check for each anchor in each word (rather than using one big regex)? Then we wouldn't miss the punctation (i.e. would use 'good,'
as part of an anchor) and might be clearer.
using the first file in the training set:
part1/suspicious-document00536.txt
we get paragraphs like "to be legal under the terms of the laws against monopoly; the bill amending" etc.
since we're defining "paragraph" as a newline i.e. see:
https://github.com/NoahCarnahan/plagcomps/blob/master/feature_extractor.py#L47
which doesn't really work with our corpus. We should look into this more, but defining paragraph boundaries by empty lines might be a good start. The training file mentioned seems to define them by empty lines.
If you run the evaluate_n_documents(['get_avg_word_frequency_class'], "kmeans", 2, "paragraph", 5)
function from test_framework.py
, sklearn crashes. If you run the same function on 6 documents instead of 5, everything works fine. I haven't done any investigation into this issue.
I my efforts to restructure things I was looking into how the getPosPercentageVector
feature works. I think that its not quite doing what we think it is.
The problem is that getPosPercentageVector
uses the first_word_index
and last_word_index
arguments (which are indices into self.word_spans
) as indices into self.pos_frequency_count_table
(line 591). self.pos_frequency_count_table
is not the same length as self.word_spans
though. This is because when self.pos_frequency_count_table
is built punctuation are considered to be their own words, but this is not the case when self.word_spans is created.
If you run feature_extractor.py
in the new pos_percentage_problem
branch I just created you can sort of see how this problem manifests itself. self.word_spans
is shorter than self.pos_frequency_count_table
and consequently, when getPosPercentageVector
on the last sentence of the document it is not looking at the last item in self.pos_frequency_count_table
I haven't thought about the best way to fix this. Suggestions?
We use NLTK's word_tokenize
instead of our own tokenization.tokenize
before tagging POS's. Here:
https://github.com/NoahCarnahan/plagcomps/blob/master/intrinsic/featureextraction.py#L67
Is it kosher for me to import scipy here? Do we all have it?
I'm using it to run the t-tests for the independent feature evaluation, but I wasn't sure if it was going to break things for you. If you don't have it and don't want to, I could just move the import statement into my function so it's only called if you need it.
when running controller on the first file in our training set, we're missing punctuation. the file:
/copyCats/pan-plagiarism-corpus-2009/intrinsic-detection-corpus/suspicious-documents/part1/suspicious-document00536.txt
specifically misses the ";" in the following:
"have already been acted upon by the House of Representatives; the bill"
and a number of others.
Clustering is getting confused when given a vector of all zeros (i.e. in the case that a document has no POS trigrams of VB,NN,VB). Not sure what the most logical fix is for this
I've put in some (mostly skeleton) code for setting up the analyzer to run intrinsic followed by extrinsic. Right now, the problem is that Noah's run_intrinsic code returns a list of tuples which have spans of (intrinsically) suspicious passages. As far as I can tell, there's no easy way to turn to the extrinsic world and ask for the extrinsic-suspicion of a particular passage. It also isn't clear that the best way to go about asking the database for extrinsic-suspicion (the confidence we have that a passage is plagiarized via given source documents) is through the span index values.
Thoughts?
Changing
https://github.com/NoahCarnahan/plagcomps/blob/master/intrinsic/featureextraction.py#L527
to:
f = FeatureExtractor("The brown fox ate. I go to the school? \n\n Believe it. I go.")
causes an index error. I first came across this problem when trying to run syntactic_complexity
on test1.txt
in our sample_corpus
directory, but using the above text similarly gives an index out of range error.
We talked about this today, but the population of evolved features seems to take a disproportionate amount of time -- like an hour for 50 docs.
evolved_feature_three (as example) has an _init method that calls self.get_feature_vectors() on the necessary features, and then saves that information to self.features["evolved_feature_three"]. Each query of evolved_feature_three grabs the necessary values from the self.features hash, and does a simple addition/subtraction combination of the values. The code is below
https://github.com/NoahCarnahan/plagcomps/blob/master/intrinsic/featureextraction.py#L614
Why does this take so long?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.