Giter Site home page Giter Site logo

lynxrose / research-paper-plagiarism-detector Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 1.5 MB

Tool research paper reviewers could use to detect a single researcher claiming multiple authors’ work

Python 100.00%
plagerism pdf naive-bayes-classifier pca pca-analysis arxiv research-paper vpn

research-paper-plagiarism-detector's Introduction

Predicting Author Count

Presentation!

Project One Pager

Does the writing quality, length, or style in groups of researchers differ from how individual researchers write? I hope to assist researchers gain more insight into there paper prior to clicking the submit button. That was the question I asked myself moving into scraping pdf files from arxiv.org. Armed with 41 thousand links and a VPN, my computer made calls to arxiv.org in increments of 2 hours (in which I switched IPs) for three days. I proceeded to turn the PDF files into text with ~50% success rate leaving me with 14,066 after cleaning which consisted of cutting off bottom acknowledgements, removing escape words, and utilizing TFIDFVectorizer. My models attempted to determine if one person wrote the paper or more.

The baseline model I created chose the most prevalent class of over one researcher every time. Through naive bayes with 2000 max_features and oversampling was I able to create a model that preformed the best.

I started PCA to gain intuition into how the words are most correlated to each other. This graph shows that the most information gain was when k-means created 3 clusters.

In my PCA analysis, words relating to specific academia was being in more prevalence, the following are the top 10 word outliers on the tips of the PCA 'triangle.'

TOP LEFT(yellow: NLP):word, words, sentence, language, et, al, corpus, embeddings, sentences, and variables
BOTTOM LEFT (blue: Structures and Algorithms): algorithm, xi, variables, theorem, let, function, graph, problem, probability, and proof
BOTTOM RIGHT (red: Image Recognition): image, images, cnn, segmentation, object, network, detection, layer, convolutional, and layers

The following are papers most confidently predicted by my algorithm in their relative classes: 1 and More than 1

The model I have created and the PCA clustering analysis shows general trends of individual researchers around the borders of specific fields rather working with multiple fields in tandem. This would make sense generally, as most individuals are not experts in multiple fields of research. Through text analysis I have determined that there are very little indications of multiple researchers writing text differently than individuals.

SOURCES: Thank you Neel Shah for providing me with 30k pdf links for scraping and Andrew Mouros for a wonderful PCA tutorial. https://www.kaggle.com/neelshah18/arxivdataset by Neel Shah https://andrewmourcos.github.io/blog/2019/06/06/PCA.html by Andrew Mouros

Tools: Python, Matplotlib, Pandas, NLTK, VPN(for webscaping), and SKLearn ML

GitHub Project by Lynx Rose

research-paper-plagiarism-detector's People

Contributors

lynxrose avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.