Giter Site home page Giter Site logo

plagcomps's People

Contributors

huderlem avatar noahcarnahan avatar nrjones8 avatar zachwooddoughty avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

nmeuschke

plagcomps's Issues

don't initialize pos tags!

when testing features that don't use POS tags, don't initialize them since they take forever. I believe @zachwooddoughty is revamping featureextraction.py to just store a single dictionary anyway, so that should fix this issue.

reading corpus files/indexing problems

I was working on the frontend and the indices of plagiarized spans appeared to be a little bit off (by a few characters). Looks like we're reading in a 'xef\xbb\xbf' at the start of the files in our corpus, causing indexing to be off by 3 characters. Looking at our new favorite plagiarism in instrinsic (i.e. part1/suspicious-document01078.txt) and its .xml file, we can see that its first case of plag starts at 1396 and goes for 272 characters. However, when reading that file:

>>> f = file('suspicious-document01078.txt', 'r')
>>> text = f.read()
>>> f.close()
>>> text[1396: 1396 + 272]
't. Soon after\nthe last livre was spent, De la Salle had occasion to make a journey\nin connection with his work. He went on foot, as needs he must, and\nbegged his way. An old woman gave him a piece of black bread; he\nate it with joy, feeling that now he was indeed a poor m'
>>> text[1396 + 3: 1396 + 272 + 3]
'Soon after\nthe last livre was spent, De la Salle had occasion to make a journey\nin connection with his work. He went on foot, as needs he must, and\nbegged his way. An old woman gave him a piece of black bread; he\nate it with joy, feeling that now he was indeed a poor man.'

I haven't checked many (only a few) yet, but this looks like a problem across everything in the corpus. Looks like it's an encoding issue:

http://stackoverflow.com/questions/12561063/python-extract-data-from-file/12561163#12561163
http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

but something we need to deal with!

anchoring edge cases

2 kind of edge cases, but _get_anchor_fingerprint doesn't catch anchors that include any punctuation. for example:

# self.anchors is ['ul', 'ay', 'oo', 'yo', 'si', 'ca', 'am', 'ie', 'mo', 'rt']
# self.anchors includes the string 'oo', so 'good' should be part of a fingerprint
# self.anchors includes the string 'ul', so should 'should' be part of a fingerprint? 
# right now it's not since we're putting the anchor in the middle of the n-gram
extractor = FingerprintExtractor()
extractor.get_fingerprint('good, should be caught', 3, "anchor")
# returns []

https://github.com/NoahCarnahan/plagcomps/blob/master/extrinsic/fingerprint_extraction.py#L108

Perhaps we should just parse the input document by words, then check for each anchor in each word (rather than using one big regex)? Then we wouldn't miss the punctation (i.e. would use 'good,' as part of an anchor) and might be clearer.

paragraph atom type is broken (and shorter than sentences!)

using the first file in the training set:
part1/suspicious-document00536.txt

we get paragraphs like "to be legal under the terms of the laws against monopoly; the bill amending" etc.
since we're defining "paragraph" as a newline i.e. see:
https://github.com/NoahCarnahan/plagcomps/blob/master/feature_extractor.py#L47
which doesn't really work with our corpus. We should look into this more, but defining paragraph boundaries by empty lines might be a good start. The training file mentioned seems to define them by empty lines.

evaluate bug

If you run the evaluate_n_documents(['get_avg_word_frequency_class'], "kmeans", 2, "paragraph", 5) function from test_framework.py, sklearn crashes. If you run the same function on 6 documents instead of 5, everything works fine. I haven't done any investigation into this issue.

getPosPercentageVector doesn't do what we think it does...

I my efforts to restructure things I was looking into how the getPosPercentageVector feature works. I think that its not quite doing what we think it is.

The problem is that getPosPercentageVector uses the first_word_index and last_word_index arguments (which are indices into self.word_spans) as indices into self.pos_frequency_count_table (line 591). self.pos_frequency_count_table is not the same length as self.word_spans though. This is because when self.pos_frequency_count_table is built punctuation are considered to be their own words, but this is not the case when self.word_spans is created.

If you run feature_extractor.py in the new pos_percentage_problem branch I just created you can sort of see how this problem manifests itself. self.word_spans is shorter than self.pos_frequency_count_table and consequently, when getPosPercentageVector on the last sentence of the document it is not looking at the last item in self.pos_frequency_count_table

I haven't thought about the best way to fix this. Suggestions?

Using Scipy

Is it kosher for me to import scipy here? Do we all have it?

I'm using it to run the t-tests for the independent feature evaluation, but I wasn't sure if it was going to break things for you. If you don't have it and don't want to, I could just move the import statement into my function so it's only called if you need it.

punctuation % is broken

when running controller on the first file in our training set, we're missing punctuation. the file:

/copyCats/pan-plagiarism-corpus-2009/intrinsic-detection-corpus/suspicious-documents/part1/suspicious-document00536.txt

specifically misses the ";" in the following:
"have already been acted upon by the House of Representatives; the bill"

and a number of others.

trigram/unigram features all zeros

Clustering is getting confused when given a vector of all zeros (i.e. in the case that a document has no POS trigrams of VB,NN,VB). Not sure what the most logical fix is for this

Connecting analyzer to database

I've put in some (mostly skeleton) code for setting up the analyzer to run intrinsic followed by extrinsic. Right now, the problem is that Noah's run_intrinsic code returns a list of tuples which have spans of (intrinsically) suspicious passages. As far as I can tell, there's no easy way to turn to the extrinsic world and ask for the extrinsic-suspicion of a particular passage. It also isn't clear that the best way to go about asking the database for extrinsic-suspicion (the confidence we have that a passage is plagiarized via given source documents) is through the span index values.

Thoughts?

Populating evolved features is slow

We talked about this today, but the population of evolved features seems to take a disproportionate amount of time -- like an hour for 50 docs.

evolved_feature_three (as example) has an _init method that calls self.get_feature_vectors() on the necessary features, and then saves that information to self.features["evolved_feature_three"]. Each query of evolved_feature_three grabs the necessary values from the self.features hash, and does a simple addition/subtraction combination of the values. The code is below
https://github.com/NoahCarnahan/plagcomps/blob/master/intrinsic/featureextraction.py#L614

Why does this take so long?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.