noahcarnahan / plagcomps Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 1.0 22.49 MB

License: Other

Python 73.12% R 1.36% HTML 25.52%

plagcomps's People

Contributors

Stargazers

Watchers

Forkers

nmeuschke

plagcomps's Issues

don't initialize pos tags!

when testing features that don't use POS tags, don't initialize them since they take forever. I believe @zachwooddoughty is revamping featureextraction.py to just store a single dictionary anyway, so that should fix this issue.

average_word_length initialized?

When running average_word_length, it still requires that we have an instance variable self.average_word_length_initialized referenced here:
https://github.com/NoahCarnahan/plagcomps/blob/master/intrinsic/featureextraction.py#L265

I'm putting the instance variable back in for the moment (until we talk more about how to use avg(word) or however the function name will work. does this make sense @zachwooddoughty or am i missing a silly fix?

reading corpus files/indexing problems

I was working on the frontend and the indices of plagiarized spans appeared to be a little bit off (by a few characters). Looks like we're reading in a 'xef\xbb\xbf' at the start of the files in our corpus, causing indexing to be off by 3 characters. Looking at our new favorite plagiarism in instrinsic (i.e. part1/suspicious-document01078.txt) and its .xml file, we can see that its first case of plag starts at 1396 and goes for 272 characters. However, when reading that file:

>>> f = file('suspicious-document01078.txt', 'r')
>>> text = f.read()
>>> f.close()
>>> text[1396: 1396 + 272]
't. Soon after\nthe last livre was spent, De la Salle had occasion to make a journey\nin connection with his work. He went on foot, as needs he must, and\nbegged his way. An old woman gave him a piece of black bread; he\nate it with joy, feeling that now he was indeed a poor m'
>>> text[1396 + 3: 1396 + 272 + 3]
'Soon after\nthe last livre was spent, De la Salle had occasion to make a journey\nin connection with his work. He went on foot, as needs he must, and\nbegged his way. An old woman gave him a piece of black bread; he\nate it with joy, feeling that now he was indeed a poor man.'

I haven't checked many (only a few) yet, but this looks like a problem across everything in the corpus. Looks like it's an encoding issue:

http://stackoverflow.com/questions/12561063/python-extract-data-from-file/12561163#12561163
http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

but something we need to deal with!

anchoring edge cases

2 kind of edge cases, but _get_anchor_fingerprint doesn't catch anchors that include any punctuation. for example:

# self.anchors is ['ul', 'ay', 'oo', 'yo', 'si', 'ca', 'am', 'ie', 'mo', 'rt']
# self.anchors includes the string 'oo', so 'good' should be part of a fingerprint
# self.anchors includes the string 'ul', so should 'should' be part of a fingerprint? 
# right now it's not since we're putting the anchor in the middle of the n-gram
extractor = FingerprintExtractor()
extractor.get_fingerprint('good, should be caught', 3, "anchor")
# returns []

https://github.com/NoahCarnahan/plagcomps/blob/master/extrinsic/fingerprint_extraction.py#L108

Perhaps we should just parse the input document by words, then check for each anchor in each word (rather than using one big regex)? Then we wouldn't miss the punctation (i.e. would use 'good,' as part of an anchor) and might be clearer.

paragraph atom type is broken (and shorter than sentences!)

using the first file in the training set:
part1/suspicious-document00536.txt

we get paragraphs like "to be legal under the terms of the laws against monopoly; the bill amending" etc.
since we're defining "paragraph" as a newline i.e. see:
https://github.com/NoahCarnahan/plagcomps/blob/master/feature_extractor.py#L47
which doesn't really work with our corpus. We should look into this more, but defining paragraph boundaries by empty lines might be a good start. The training file mentioned seems to define them by empty lines.

evaluate bug

If you run the evaluate_n_documents(['get_avg_word_frequency_class'], "kmeans", 2, "paragraph", 5) function from test_framework.py, sklearn crashes. If you run the same function on 6 documents instead of 5, everything works fine. I haven't done any investigation into this issue.

getPosPercentageVector doesn't do what we think it does...

I my efforts to restructure things I was looking into how the getPosPercentageVector feature works. I think that its not quite doing what we think it is.

The problem is that getPosPercentageVector uses the first_word_index and last_word_index arguments (which are indices into self.word_spans) as indices into self.pos_frequency_count_table (line 591). self.pos_frequency_count_table is not the same length as self.word_spans though. This is because when self.pos_frequency_count_table is built punctuation are considered to be their own words, but this is not the case when self.word_spans is created.

If you run feature_extractor.py in the new pos_percentage_problem branch I just created you can sort of see how this problem manifests itself. self.word_spans is shorter than self.pos_frequency_count_table and consequently, when getPosPercentageVector on the last sentence of the document it is not looking at the last item in self.pos_frequency_count_table

I haven't thought about the best way to fix this. Suggestions?

using inconsistent word tokenization

We use NLTK's word_tokenize instead of our own tokenization.tokenize before tagging POS's. Here:
https://github.com/NoahCarnahan/plagcomps/blob/master/intrinsic/featureextraction.py#L67

Using Scipy

Is it kosher for me to import scipy here? Do we all have it?

I'm using it to run the t-tests for the independent feature evaluation, but I wasn't sure if it was going to break things for you. If you don't have it and don't want to, I could just move the import statement into my function so it's only called if you need it.

punctuation % is broken

when running controller on the first file in our training set, we're missing punctuation. the file:

/copyCats/pan-plagiarism-corpus-2009/intrinsic-detection-corpus/suspicious-documents/part1/suspicious-document00536.txt

specifically misses the ";" in the following:
"have already been acted upon by the House of Representatives; the bill"

and a number of others.

trigram/unigram features all zeros

Clustering is getting confused when given a vector of all zeros (i.e. in the case that a document has no POS trigrams of VB,NN,VB). Not sure what the most logical fix is for this

Connecting analyzer to database

I've put in some (mostly skeleton) code for setting up the analyzer to run intrinsic followed by extrinsic. Right now, the problem is that Noah's run_intrinsic code returns a list of tuples which have spans of (intrinsically) suspicious passages. As far as I can tell, there's no easy way to turn to the extrinsic world and ask for the extrinsic-suspicion of a particular passage. It also isn't clear that the best way to go about asking the database for extrinsic-suspicion (the confidence we have that a passage is plagiarized via given source documents) is through the span index values.

Thoughts?

syntactic complexity has indexing problems

Changing
https://github.com/NoahCarnahan/plagcomps/blob/master/intrinsic/featureextraction.py#L527
to:

f = FeatureExtractor("The brown fox ate. I go to the school? \n\n Believe it. I go.")

causes an index error. I first came across this problem when trying to run syntactic_complexity on test1.txt in our sample_corpus directory, but using the above text similarly gives an index out of range error.

Populating evolved features is slow

We talked about this today, but the population of evolved features seems to take a disproportionate amount of time -- like an hour for 50 docs.

evolved_feature_three (as example) has an _init method that calls self.get_feature_vectors() on the necessary features, and then saves that information to self.features["evolved_feature_three"]. Each query of evolved_feature_three grabs the necessary values from the self.features hash, and does a simple addition/subtraction combination of the values. The code is below
https://github.com/NoahCarnahan/plagcomps/blob/master/intrinsic/featureextraction.py#L614

Why does this take so long?