Giter Site home page Giter Site logo

ncbiutils's Introduction

ncbiutils

build License codecov Making retrieval of records from National Center for Biotechnology Information (NCBI) E-Utilities simpler.

Installation

Set up a virtual environment. Here, we use miniconda to create an environment named testenv:

$ conda create --name testenv python=3.8
$ conda activate testenv

Then install the package in the testenv environment:

$ pip install ncbiutils

Usage

The ncbiutils module exposes a PubMedFetch class that provides an easy to configure and use wrapper for the EFetch E-Utility. By default, PubMedFetch will retrieve PubMed article records, each indicated by its PubMed identifier (PMID).

from ncbiutils.ncbiutils import PubMedFetch
import json

# Initalize a list of PubMed identifiers for those records we wish to retrieve
uids = ['16186693', '29083299']

# Create an instance, optionally provide an E-Utility API key
pubmed_fetch = PubMedFetch()

# Retrieve the records
# Returns a generator that yields results for a chunk of the input PMIDs (see Options)
chunks = pubmed_fetch.get_citations(uids)

# Iterate over the results
for chunk in chunks:
    # A Chunk is a namedtuple with 3 fields:
    #   - error: Includes network errors as well as HTTP status >=400
    #   - citations: article records, each wrapped as a Citation
    #   - ids: input ids for chunk
    error, citations, ids = chunk

    # Citation class can be represented as a dict
    print(json.dumps(citations[0].dict()))

# Output as JSON
{
   "pmid":"16186693",
   "pmc":"None",
   "doi":"10.1159/000087186",
   "title":"Searching the MEDLINE literature database through PubMed: a short guide.",
   "abstract":"The Medline database from the National Library of Medicine (NLM) contains more than 12 million bibliographic citations from over 4,600 international biomedical journals...",
   "author_list":[
      {
         "fore_name":"Edith",
         "last_name":"Motschall",
         "initials":"E",
         "collective_name":"None",
         "orcid":"None",
         "affiliations":[
            "Institut für Medizinische Biometrie und Medizinische Informatik, Universität Freiburg, Germany. [email protected]"
         ],
         "emails":[
            "motschall@..."
         ]
      },
      ...
   ],
   "journal":{
      "title":"Onkologie",
      "issn":[
         "0378-584X"
      ],
      "volume":"28",
      "issue":"10",
      "pub_year":"2005",
      "pub_month":"Oct",
      "pub_day":"None"
   },
   "publication_type_list":[
      "D016428",
      "D016454"
   ],
   "correspondence":[],
   "mesh_list":[
      {
         "descriptor_name":{
            "ui":"D003628",
            "value":"Database Management Systems"
         }
      },
      {
         "descriptor_name":{
            "ui":"D016206",
            "value":"Databases, Bibliographic"
         }
      },
      {
         "descriptor_name":{
            "ui":"D016247",
            "value":"Information Storage and Retrieval"
         },
         "qualifier_name":[
            {
               "ui":"Q000379",
               "value":"methods"
            }
         ]
      },
     ...
   ]
}

Options

Configure the PubMedFetch instance through its constructor:

  • db: DbEnum
    • Set the database to process either <!DOCTYPE pmc-articleset ...> or <!DOCTYPE PubmedArticleSet ...> (default)
  • retmax : int
    • Maximum number of records to return in a chunk (default/max 10000)
  • api_key : str
    • API key for NCBI E-Utilities

Also available is:

Testing

As this project was built with poetry, you'll need to install poetry to get this project's development dependencies.

Once installed, clone this GitHub remote:

$ git clone https://github.com/PathwayCommons/ncbiutils
$ cd ncbiutils

Install the project:

$ poetry install

Run the test script:

$ ./test.sh

Under the hood, the tests are run with pytest. The test script also does a lint check with flake8 and type check with mypy.

Publishing a release

A GitHub workflow will automatically version and release this package to PyPI following a push directly to main or when a pull request is merged into main. A push/merge to main will automatically bump up the patch version.

We use Python Semantic Release (PSR) to manage versioning. By making a commit with a well-defined message structure, PSR will scan commit messages and bump the version accordingly in accordance with semver.

For a patch bump:

$ git commit -m "fix(ncbiutils): some comment for this patch version"

For a minor bump:

$ git commit -m "feat(ncbiutils): some comment for this minor version bump"

For a release:

$ git commit -m "feat(mod_plotting): some comment for this release\n\nBREAKING CHANGE: other footer text."

ncbiutils's People

Contributors

jvwong avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

ncbiutils's Issues

Medline format - no structure for relations

Importantly, Medline output from Bio package just aggregates all fields with the same key. E.g. All authors, all affiliations so beyond unique fields (text, abstract) can't use this.

Pydantic XML data binding is avilable

I did the XML mapping by hand, which is fine for our simple needs.

Could leverage the pydantic ORM if needs get a little more complicated, or just as a learning case. Probably more robust and easier to debug?

Output named tuple

The output from get_record_chunks is a tuple. It would be easier if it output a named tuple.

Load test

Load test with large amount of uids.( >> 10000)

Make sure that a key is used, and look for requirements for user agent etc.

Support retrival from db=pmc

Update the PubMedFetch class so that it can handle <!DOCTYPE pmc-articleset PUBLIC "-//NLM//DTD ARTICLE SET 2.0//EN" "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd"> as well as pubmed.

Extract the publication type

This attribute can be helpful in discerning what is a comment, review, retraction. Probably not as helpful to determine what IS a primary research article however.

Cannot parse XML for some records

Check list:

['PMC9590371', 'PMC9184814', 'PMC9109680', 'PMC9184355', 'PMC9512138', 'PMC9250136', 'PMC9282842', 'PMC9188984', 'PMC9349400', 'PMC9350338', 'PMC9201715', 'PMC9188682', 'PMC9854242', 'PMC9217129', 'PMC9852190', 'PMC7614080', 'PMC9852135', 'PMC9669093', 'PMC9214689', 'PMC9216575', 'PMC9208297', 'PMC9207713', 'PMC9484647', 'PMC9633344', 'PMC9780404', 'PMC9849529', 'PMC9789203', 'PMC9772360']

for _get_iso_abbreviation

Downloading PubMed daily updates

This should be a simple wrapper to:

  • download a file https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/pubmed*.xml/gz
  • unzip
  • generate a list of Citations

pmc-articleset: Extracting author emails

There is a case for extracting author emails from PMC articleset, althouhg the DTD is pretty ambiguous

Good:

<pmc-articleset>
  <article ...>
    <front>
       <article-meta>
          <contrib-group>
            <contrib>
              <address>
                <email>[email protected]</email>
...

Bad:

<pmc-articleset>
  <article ...>
    <front>
       <article-meta>
          <contrib-group>
            <contrib>
		<name>
	          <surname>Zhou</surname>
	          <given-names>Yunli</given-names>
		</name>
		<xref rid="fn001" ref-type="author-notes">
			<sup>*</sup>
		</xref>
...
          <author-notes>
            <corresp id="fn001">*Correspondence: Yunli Zhou, 
		<email xlink:href="mailto:[email protected]">[email protected]</email>
	    </corresp>
...

Ugly:

<pmc-articleset>
  <article ...>
    <front>
       <article-meta>
          <author-notes>
              <corresp id="cor001">* E-mail:
                    <email>[email protected]</email>
                    (YZ);
                    <email>[email protected]</email>
                    (XJY)</corresp>
...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.