Giter Site home page Giter Site logo

journal-scrapers's Introduction

journal-scrapers

Journal scraper definitions for the ContentMine framework.

This repo is a collection of ScraperJSON definitions targetting academic journals. They can be used to extract and download data from URLs of journal articles, such as:

  • Title, author list, date
  • Figures and their captions
  • Fulltext PDF, HTML, XML, RDF
  • Supplementary materials
  • Reference lists

ScraperJSON definitions

Scrapers are defined in JSON, using a schema called ScraperJSON which is currently evolving.

The current schema is described below.

There can be two keys in the root object:

  • url - a string-form regular expression specifying which URL(s) this scraper targets
  • elements - a dictionary of elements to scrape

Elements are defined as key-value pairs, where the key is a description of the element, and the value is a dictionary of specifiers defining the element and its processing. Allowed keys in the specifier dictionary are:

  • selector - an XPath selector targetting the element to be selected.
  • attribute - a string specifying the attribute to extract from the selected element. Optional (omitting this key is equivalent to giving it a value of text). In addition to html attributes there are two special attributes allowed:
    • text - extracts any plaintext inside the selected element
    • html - extracts the inner HTML of the selected element
  • download - a boolean flag: true if the element is a URL to a resource that must be downloaded. Optional (omitting this key is equivalent to giving it a value of false).

Example:

{
  "url": "plos.*\\.org",
  "elements": {
    "fulltext_pdf": {
      "selector": "//meta[@name='citation_pdf_url']",
      "attribute": "content",
      "download": true
    },
    "title": {
      "selector": "//meta[@name='citation_title']"
    }
  }
}

Usage

Currently these definitions can be used with the quickscrape tool.

journal-scrapers's People

Contributors

blahah avatar ianthe avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.