Giter Site home page Giter Site logo

pubmed-batch-download's Introduction

Pubmed-Batch-Download

Batch download articles based on PMID (Pubmed ID). This project is not being updated anymore; I no longer have access to paywall journals. If someone would like to pick up support in full, go ahead and fork. Otherwise, I contributions will have to be made by others and I can merge in PRs

Version 3.0.0 Last update: 9/15/2020.

Required Packages

As of version 3.0.0, the program is written for python 3.7. It uses the following non-default packages:

requests
requests3
beautifulsoup4
lxml

Optionally, instead of installing these yourself, the included "pubmed-batch-downloader-py3.yml" file can be used with anaconda to install an environment that has versions of packages and python known to work with this program. It can be on linux installed via

conda env create -f pubmed-batch-downloader-py3.yml

or on windows via

conda env create -f pubmed-batch-downloader-py3-windows.yml

Then, activate the environment with

conda activate pubmed-batch-downloader-py3

If you use the windows environment, you will then need to run the following commands in order to install the other packages, as I cannot get the yml to work when they are included.

conda install requests beautifulsoup4 lxml
conda install requests3

Program Usage

Each run will download the enumerated files to folder by default titled "fetched_pdfs" inside the application directory, with each pdf named the PMID correpsonding to the article. Articles already within the PDF folder will not be downloaded again.

Use the program via

python fetch_pdfs.py [-pmids or -pmf] [optional arguments]

Arguments: The program has the following arguments. It must be run with either -pmids or -pmf, not both. The help page can be displayed by running the program with -h, or with no arguments.

-pmids: A comma separated list of pmids to download
-pmf: A file with 1 or 2 columns of pmids and file names to download.  See below for example
-out: The output folder to store the downloaded pdfs.  By default, this is ./fetched_pdfs
-errors: File path to write all un-downloaded PMIDs during program run.  By default, this is ./unfetched_pmids.tsv.  This file is overwritten each run.
-maxRetries: Maximum number of times to try to redownload a pdf on an Connection Error (specifically, an ECONNRESET code 104).

PMF File Format: The -pmf file allows the user to input a file with a list of pmids, one per line, to download, instead of listing them in the command line with a comma separated list. This structure would be as follows

PMID1
PMID2
PMID3
...

Optionally, this file can have a second column, which is what to name the files when you download them. For example, if I wanted to download the article with pmid 123 and name it "Article_1.pdf" and pmid 4456 with name "Some_Other_Article.pdf", I would use the following pmf file (note, the columns are tab separated)

123 Article_1
4456  Some_Other_Article

When the program cannot download files, the non-downloaded PMIDs are stored in a PMF format file. This can then be directly used at a later date with the program. PMIDs and names are both stored within this file.

Example script usage:

python fetch_pdfs.py -pmids 123,124,125,23923,111

will place the files 123.pdf, 124.pdf, 125.pdf, 23923.pdf, and 111.pdf inside of the PDF folder, assuming all were found

Known download issues

The requests package cannot execute JavaScript, and thus pages that require javascript to load the link to the pdf or to the journal cannot be obtained with this program. As of now, this covers the Wolters Kluwer's journals.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.