Giter Site home page Giter Site logo

arxiv_crawler's Introduction

ARXIV CRAWLER

This code is a crawler for arxiv.org written in python3 and based on scrapy. It retrieves all results of an advanced arxiv search, easily saved in .csv (or .json) format.

DISCLAIMER

This code is written for practice only, to learn how to use scrapy to recursively parse an advanced arxiv search. It should not be used, as explained in the arxiv.org guidelines. Running it will likely result in you ip being blocked from the arxiv. To download arxiv metadata, you can refer to the arxiv API

Description

This code allows to fetch all arrive metadata available via the arxiv advanced search. See the section "Output" for a more detailed description of the fields available. An example output can be found in the /output folder with a jupiter notebook to visualise data. Here’s an example:

Example output

Note that scrapy launches asynchronous requests, so the data should be then sorted manually, as done in the notebook.

Prerequisites

python3 and scrapy

pip install scrapy

Running

Clone the repo, go to the main directory and run:

scrapy crawl arxiv -a search_query=<arxiv_search_query> -a field=<field> -a date_from=<from_date> -a to_date=<to_date> -o <output_file>
    • a specifies the input parameters
    • o specifies the output file if you want to save your data (e.g. my_query.csv)

Example of valid parameters and format:

  • <arxiv_search_query>: machine\ learning
  • < field >: the field where to look for (ex: 'all', 'ti', 'abs')
  • <from_date>/<to_date>: 2018-09-13
  • <output_file>: output/my_search.csv

Mandatory parameters:

  • search_query : must be a valid string. See the arxiv documentation for the syntax of advanced queries, booleans, phrases etc.

Other:

  • field : if not specified, it is automatically set to 'all'
  • date_from/ date_to : if not provided, the search is automatically set to 'all_dates'. If specified, they must be both specified in the format described above

For example, the image above and the example output in /output are the result of the following query:

scrapy crawl arxiv -a search_query=machine\ learning -a date_from = 2018-09-01 date_to=2018-09-13 -o output/ML_all.csv

Output fields

The output fields (names and order) are specified in items.py

The code fetches the following fields (names should be quite self-explanatory):

  • "ID"
  • "date"
  • "title"
  • "author"
  • "link"
  • "journal"
  • "comments"
  • "primary_cat"
  • "all_cat"
  • "abstract"

To do

  • As of now, the query is on all the arxiv fields. The input can easily be adapted to search in specific subfields (e.g : phys , cs.AI , …)

arxiv_crawler's People

Contributors

mik3m4n avatar

Stargazers

Beizhe Hu avatar  avatar asdas12312weq avatar  avatar Alex Chen avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.