Giter Site home page Giter Site logo

jwdori / jcp-stack Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ritwiktakkar/jcp-stack

0.0 0.0 0.0 122 KB

An efficient way to scrape results from the ACM, Springer, and IEEE Xplore digital libraries

License: MIT License

Python 98.59% Batchfile 1.41%

jcp-stack's Introduction

JCP-Stack

An efficient way to collect results from the ACM, Springer, and IEEE Xplore digital libraries
View Demo

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Status
  5. Issues
  6. Contributing
  7. License
  8. Contact

About The Project

This project aims to help researchers find and sort papers from the ACM, Springer, and IEEE Xplore online databases efficiently. I have compiled a list of 291 journals and conferences with their CCF, Core and Qualis rankings in SelectedJournalsAndConferences.csv. This web scraper compares the similarity (Levenshtein ratio) between every single search result's journal/conference title and those listed in SelectedJournalsAndConferences.csv. If the similarity between them is greater than or equal to a user-specified percentage, then the result is placed in a CSV file whose path and name is also selected by the user. Once the web scraper has completed traversing each page generated by the user's search term, analyzing the results therein, and storing the ones that fit the given criteria, it alerts the user of its status prior to restarting.

Onomatology

"JCP-Stack" is short for "Journal/Conference Paper Stack" given that executing this program (ideally) outputs a CSV file that contains information about a stack of journal/conference papers related to a given keyword.

Dependencies

  • appdirs==1.4.4
  • beautifulsoup4==4.9.3
  • black==21.6b0
  • certifi==2021.5.30
  • charset-normalizer==2.0.3
  • click==8.0.1
  • colorama==0.4.4
  • configparser==5.0.2
  • crayons==0.4.0
  • idna==3.2
  • levenshtein==0.12.0
  • mypy-extensions==0.4.3
  • pathspec==0.9.0
  • regex==2021.4.4
  • requests==2.26.0
  • selenium==3.141.0
  • soupsieve==2.2.1
  • toml==0.10.2
  • urllib3==1.26.5
  • webdriver-manager==3.4.2

Getting Started

To get this project running on your local machine, follow these simple steps:

Steps

  1. Clone the repo
    git clone https://github.com/ritwiktakkar/rdb-scraper.git
  2. Make sure you're running Python 3 (I wrote and tested this project with Python 3.9.6 64-bit)
    python -V
  3. Install all the packages specified in the configuration file (requirements.txt)
    pip install -r requirements.txt
  4. You will need the latest version of Google Chrome installed on your machine
  5. Create a file called config.py inside this repo and add the following:
    from common_functions import platform
    
    if platform == "win32":
        path_to_search_results = "C:/<PATH TO SEARCH RESULTS>"
    else:
        path_to_search_results = "/Users/<PATH TO SEARCH RESULTS>"
  6. View the "Name" column inside SelectedJournalsAndConferences.csv: this is the list of names whose similarity (Levenshtein ratio) will be checked against each search result's journal/conference name. Feel free to modify this column on your local machine to add/remove journal names (not) of interest to you.
  7. Execute get_all_results.py using Python
    PATH_TO_PYTHON_INTERPRETER PATH_TO_get_all_results.py

Usage

Here is a video demo.

Status

Given that the layouts of online research databases are updated occasionally, the scraper may also need to be updated accordingly to successfully retrieve the necessary information therein. The table below provides the current status of the scraper's ability to retrieve information from different online research databases. As of 12/30/21...

Database Scraper Status
ACM
Springer
IEEE Xplore

Issues

On Windows only: Selenium's quit() method alone fails to kill chromedriver processes thereby leading to a sort of memory leak. To counter this, I added a batch file (kill_chromedriver.bat) that kills all chrome.exe processes. As a result, ANY Chrome process unrelated to this program will ALSO die at the hands of this rather brute approach.

Contributing

Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch
  3. Commit your Changes
  4. Push to the Branch
  5. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact

📧 rt398 [at] cornell [dot] edu

🏠 ritwiktakkar.com

Project Link: https://github.com/ritwiktakkar/JCP-Stack

jcp-stack's People

Contributors

ritwiktakkar avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.