Giter Site home page Giter Site logo

fgh95 / pubmed-pipeline Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nicford/pubmed-pipeline

0.0 1.0 0.0 87 KB

A library that facilitates the process of building a machine learning pipeline using PubMed as a data source

License: GNU General Public License v3.0

Python 69.37% Shell 30.63%

pubmed-pipeline's Introduction

Pubmed Pipeline Python Library

Overview

This library allows for the easy creation of a machine learning pipeline that uses PubMed as its data source. Two kinds of pipelines can be made:

Setup Pipeline: Downloads all the papers from PubMed matching a specific search query and then applies a machine learning classifier to them before saving the output in parquet format

Update pipeline: handles the logic of downloading all the new and updated papers since the setup pipeline or the last update pipeline ran. The new and updated papers retrieved from PubMed are added to the main dataframe created in the setup pipeline as well as being stored in a separate dataframe which is written in parquet format.

Requirements

python3+

pip

git

parallel

xmlstarlet

wget

curl

Installation

Make sure you have python and pip installed.

If you do not have git installed, follow these instructions to install it.

  1. Clone this repository (or alternatively download it directly from the github page):
git clone https://github.com/nicford/Pubmed-Pipeline.git
  1. In your terminal, navigate into the cloned/downloaded folder. Run the following command to install the Pubmed Pipeline library:
pip install pubmed_pipeline
  1. Install other required dependencies:

    Follow these instructions to install parallel.

    Follow these instructions to install xmlstarlet.

    Follow these instructions to install wget.

    Follow these instructions to install curl.

Usage

Requirements

1. Spark Session

To create a pipeline object, you need to pass in a spark session. Thus, you must configure your spark session beforehand. If you are unfamiliar with spark sessions, you can get started here. Note: if you are using Databricks, a spark session is automatically created called "spark".

2. API KEY (optional, for XML downloads)

If you do not have your own PubMed XML data and you wish to download XML paper data from PubMed, you need a PubMed API key. This API key can be obtained by doing the following:

"Users can obtain an API key now from the Settings page of their NCBI account. To create an account, visit http://www.ncbi.nlm.nih.gov/account/."

Setup Pipeline

The setup pipeline class allows you to create a pipeline.

See below how to use this setup pipeline.

from pubmed_pipeline import PubmedPipelineSetup

XMLFilesDirectory = ""     # path to save downloaded XML content from pubmed or path to XML data if you already have some
numSlices = ""             # The numSlices denote the number of partitions the data would be parallelized to
searchQueries = [""]       # list of strings for queries to search pubmed for
apiKey = ""                # API key from pubmed to allow increased rate of requests, to avoid HTTP 429 error(see E-utilites website for how to get a key) 
lastRunDatePath = ""       # path to store a pickle object of the date when the setup is run (this is the same path to provide to the update job)
classifierPath = ""        # path to the classifier used to classify papers
dataframeOutputPath = ""   # path to store the final dataframe to in parquet form

# your Spark session configuration
sparkSession = SparkSession.builder \    
                       .master("local") \
                       .appName("") \
                       .config("spark.some.config.option", "some-value") \
                       .getOrCreate()

# create the setup pipeline 
setupJob = PubmedPipelineSetup(sparkSession, XMLFilesDirectory, classifierPath, dataframeOutputPath, numSlices, lastRunDatePath)

# This downloads all the required papers from PubMed under the searchQueries
setupJob.downloadXmlFromPubmed(searchQueries, apiKey)

# This runs the pipeline and saves the classified papers in dataframeOutputPath
setupJob.runPipeline()

Update Pipeline

The update pipeline class allows you to update your database of papers since the setup pipeline was run, or since the last update was run.

See below how to use this update pipeline.

from pubmed_pipeline import PubmedPipelineUpdate

XMLFilesDirectory = ""          # path to save downloaded xml content from pubmed
numSlices = ""                  # The numSlices denote the number of partitions the data would be parallelized to
searchQueries = [""]            # list of strings for queries to search pubmed for
apiKey = ""                     # API key from pubmed to allow increased rate of requests, to avoid HTTP 429 error(see E-utilites website for how to get a key) 
lastRunDatePath = ""            # path containing a pickle object of the last run date (running setup job creates one)
classifierPath = ""             # path to the classifier used to classify papers
dataframeOutputPath = ""        # path to store the final dataframe to in parquet form
newAndUpdatedPapersPath = ""    # path to store the dataframe containing the new and updated papers

# your Spark session configuration
sparkSession = SparkSession.builder \
                       .master("local") \
                       .appName("") \
                       .config("spark.some.config.option", "some-value") \
                       .getOrCreate()

# create the update pipeline 
updateJob = PubmedPipelineUpdate(sparkSession, XMLFilesDirectory, classifierPath, dataframeOutputPath, numSlices, lastRunDatePath, newAndUpdatedPapersPath)

# This downloads all the required papers from pubmed under the searchQueries
updateJob.downloadXmlFromPubmed(searchQueries, apiKey)

# This runs the pipeline and saves the new and updated classified papers in newAndUpdatedPapersPath
# The pipeline also handles the logic to add new papers and remove any papers from the main dataframe which are no longer relevant
updateJob.runPipeline()

Customisation of library

If you wish to customise the library to meet your own needs, please fork the repository and do the following:

To customise the pipeline processes, change/add the functions in pubmedPipeline.py To customise the downloading of XML metadata, change setupPipeline.sh and updatePipeline.sh

Core Developers

Nicolas Ford

Yalman Ahadi

Paul Lorthongpaisarn

Dependencies

We would like to acknowledge the following projects:

and the following libraries:

pubmed-pipeline's People

Contributors

njford avatar nicford avatar fgh95 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.