Giter Site home page Giter Site logo

rpms's Introduction

RPMS - Reviewer-Paper Matching System

A text-based system to match paper with relevant reviewers, loosely based on Toronto Paper Matching System.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

It is recommended to use virtualenv for a clean dependency managament. The code below will install dependencies for the project, once a virtual environment has been activated.

virtualenv -p python2.7 venv
source venv/bin/activate
pip install -r requirements.txt

Install unix package pdftotext - needed to convert pdf to txt file

brew cask install pdftotext

Install Spacy's English dictionary - needed for tokenization

python -m spacy download en

Download papers collection and dataset from AISG's team drive and extract under the project root folder.

Drive location: AI Technology/TramAnh Handover

Drive URL: https://drive.google.com/open?id=10Qo7JuXZ4mv6YPKtDBSBuklUHKp9c1O8

Project Structure

These are the important folders and files that should help you get started

Files Explanation
main_download_papers.py Purpose of this file is to download papers belonged to researchers given a list of researchers (txt file, line-delimited). This file mainly used methods from paper_crawling/ folder. Downloaded papers are stored under papers/
main_build_corpus.py Process pdf papers and parse to text. Stemming and remove of stopwords and strange characters as well.
main_match_paper.py Build corpus and word vector. Cosine similarity is used to match paper
notebooks/Author-Topic_Model_2.ipynb Topic modelling of researchers papers based on Author-Topic LDA Model

Everytime a .py file is run, a log file will also be created and stored under logs/ with timestamp, file details and log warning level, so that it's easy to monitor.

To run

Download Papers

TODO:

  • Get a Google API Key by following instructions from here
  • Copy the keys.ini into venv/

To download papers from a list of researchers

python main_download_papers.py --researchers researchers.txt

The process will crawl dblp for list of papers and store the (researcher, papers) information in a pickled file, researchers_to_papers.p.

Note:

main_download_papers.py will first crawl from arxiv, if not found, it'll query Google using Google Search API. We can only query 100 queries/day under free version. Therefore, a pickled file is used to keep track of what's left to query for the next day. The counter reset at 3pm Singapore time every day.

To download papers from pickled file (for subsequent downloads, once a pickled file has been created)

python main_download_papers.py --pickled researchers_to_papers.p

Build Corpus

To run

python main_build_corpus.py

This process will do the following:

  1. Parse pdf papers to text
  2. Preprocess text (stem, remove stopwords, remove non-English words and strange characters)

Output is written under bow/, 1 pdf = 1 *.bow file. There is one master output under each author folder, called _.json. For example, bow/Bryan_Low/_.json. There is also one master output of ALL the researchers, produced and saved as papers.json in the main directory.

Most preprocessing parameters are stored in config.ini. See comments in that file for more details to tune.

Match Papers

To run

python main_match_paper.py -d <Link_to_pdf_location>

This class is straight-forward. The program will parse the pdf, preprocess build a Bag-Of-Word for it. Tf-idf is created and is compared with existing researchers' tfidf using cosine similarity to find a ranking of relevant researchers.

Topic-Model

All the code for this is in a notebook under notebooks/Author-Topic_Model_2.ipynb. The code is based on the Author-Topic Model from [1]. Intermediate representation of the link between papers and authors are stored under dataset/.

This needs to do more tuning as more authors and papers are downloaded. There are 20 topics set for LDA right now. For each topic, there is a user-defined topic name for easy reference. However, when more papers and authors are present, the topic distribution will change and the topic names would need to be changed accordindly.

References

http://papermatching.cs.toronto.edu/

[1] The Author-Topic Model for Authors and Documents

rpms's People

Contributors

tramanh06 avatar lcharlin avatar

Stargazers

Khoa Duong avatar  avatar

Watchers

 avatar

rpms's Issues

Unify paper author's folder name

Different formats of an author's name such as "Nguyen Tram Anh", "Tram Anh Nguyen" will result in 2 different folders under /papers/
IDEA: take official name from dblp
WHERE TO LOOK: file:///main_download_papers.py -> download_papers()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.