Giter Site home page Giter Site logo

paperetl's Introduction

ETL processes for medical and scientific papers

Version GitHub Release Date GitHub issues GitHub last commit Build Status Coverage Status


paperetl is an ETL library for processing medical and scientific papers. It supports the following sources:

paperetl supports the following databases for storing articles:

  • SQLite
  • Elasticsearch
  • JSON files
  • YAML files

Installation

The easiest way to install is via pip and PyPI

pip install paperetl

You can also install paperetl directly from GitHub. Using a Python Virtual Environment is recommended.

pip install git+https://github.com/neuml/paperetl

Python 3.6+ is supported

Additional dependencies

Study design detection uses scispacy and can be installed via:

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_md-0.2.5.tar.gz

PDF parsing relies on an existing GROBID instance to be up and running. It is assumed that this is running locally on the ETL server. This is not necessary for the CORD-19 dataset.

Docker

A Dockerfile with commands to install paperetl, all dependencies and scripts is available in this repository.

Clone this git repository and run the following to build and run the Docker image.

docker build -t paperetl -f docker/Dockerfile .
docker run --name paperetl --rm -it paperetl

This will bring up a paperetl command shell. Standard Docker commands can be used to copy files over or commands can be run directly in the shell to retrieve input content. All scripts in the following examples are available in this environment.

Examples

Notebooks

Notebook Description
CORD-19 Article Entry Dates Generates CORD-19 entry-dates.csv file
CORD-19 ETL Builds an article.sqlite database for CORD-19 data

Load CORD-19 into SQLite

The following example shows how to use paperetl to load the CORD-19 dataset into a SQLite database.

  1. Download and extract the dataset from Allen Institute for AI CORD-19 Release Page.

    scripts/getcord19.sh cord19/data

    The script above retrieves and unpacks the latest copy of CORD-19 into a directory named cord19/data. An optional second argument sets a specific date of the dataset in the format YYYY-MM-DD (ex. 2021-01-01) which defaults to the latest date.

  2. Download study design model

    scripts/getstudy.sh cord19/models

    The script above retrieves and unpacks a copy of the study model into a directory named cord19/models.

    The study design model with training data is also available on Kaggle.

  3. Generate entry-dates.csv for current version of the dataset

    python -m paperetl.cord19.entry cord19/data

    An optional second argument sets a specific date of the dataset in the format YYYY-MM-DD (ex. 2021-01-01) which defaults of the latest date. This should match the date used in Step 1.

    A version of entry-dates.csv is also available on Kaggle.

  4. Build database

    python -m paperetl.cord19 cord19/data cord19/models cord19/models

Once complete, there will be an articles.sqlite file in cord19/models

Load PDF Articles into SQLite

The following example shows how to use paperetl to load a set of medical/scientific pdf articles into a SQLite database.

  1. Download the desired medical/scientific articles in a local directory. For this example, it is assumed the articles are in a directory named paperetl/data

  2. Download study design model

    scripts/getstudy.sh paperetl/models

    The study design model with training data can also be found on Kaggle.

  3. Build the database

    python -m paperetl.file paperetl/data paperetl/models paperetl/models

Once complete, there will be an articles.sqlite file in paperetl/models

Load into Elasticsearch

Both of the examples above also support storing data in Elasticsearch with the following changes. These examples assume Elasticsearch is running locally, change the URL to a remote server as appropriate.

CORD-19:

python -m paperetl.cord19 cord19/data http://localhost:9200

PDF Articles:

python -m paperetl.file paperetl/data http://localhost:9200 paperetl/models

Once complete, there will be an articles index in elasticsearch with the metadata and full text stored.

Convert PDF articles to JSON/YAML

paperetl can also be used to convert PDF articles into JSON or YAML files. This is useful if the data is to be fed into another system or for manual inspection/debugging of a single file.

JSON:

python -m paperetl.file paperetl/data json://paperetl/json paperetl/models

YAML:

python -m paperetl.file paperetl/data yaml://paperetl/yaml paperetl/models

Converted files will be stored in paperetl/(json|yaml)

paperetl's People

Contributors

davidmezzetti avatar nialov avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.