Giter Site home page Giter Site logo

jakelever / biotext Goto Github PK

View Code? Open in Web Editor NEW
13.0 4.0 5.0 391 KB

Get a nicely-chunked local copy of the biomedical literature (to use for other projects)!

License: MIT License

Python 95.39% Shell 4.61%
text-mining pubmed pubmed-central bioc pubtator snakemake

biotext's Introduction

BioText (with added PubTator)

Sometimes you need a easily-updated local copy of PubMed and PubMed Central, and sometimes (but not always) you want annotations of entities from PubTator on those articles. This project can help with that. It manages the download of PubMed and PubMed Central and converting it into the nice BioC XML format while keeping important metadata. As a separate step, it can load up PubTator Central annotations and align them to the documents. It also handles the update process without redoing all the previous downloading and computation.

Advantages

  • Deals with format conversion
  • Chunks PubMed Central (which is normally ~2,000,000 files) into larger files that are easier to parallelise
  • Uses Snakemake, so can be deployed on a cluster
  • Can add PubTator Central annotations (of chemicals, genes, diseases, etc) to the text

Details

PubMed is released as a series of XML files with a baseline of files and updates released daily. Each file has tens of thousands of titles and abstracts along with metadata. Each update file may contain new documents or updates to previous documents. These files follow the PubMed XML standard. This project converts each file into the BioC format.

PubMed Central offers full-text articles of documents in a different XML format. A portion of PubMed Central is released for text mining as the non-commercial and commercial licensed PubMed Central Open Access subset and the Author Manuscript Collection. PubMed Central is released as about 15 archives of XML files. Each archive has a very large number of files which makes it somewhat unwieldy. Each new version of these archives contains a mix of new files and old files which need to be distinguished. This project identifies unprocessed files, groups them into chunk (of 2000 documents by default) and converts them to BioC XML.

Things To Be Aware Of

There are few details that you should keep at the back of your mind when using this project.

  • This project does not deal with duplicates of documents, both in the PubMed update files, and documents in PubMed Central that are also in PubMed. Any text mining of these documents should do a final pass to identify the latest version of a document, i.e. going through new-to-old PubMed Central files before new-to-old PubMed files.
  • PubMed Central files contain a lot of Unicode characters while PubMed generally does not. An abstract for an article that is in both resources may be processed differently in the PubMed Central file due to Unicode characters.
  • Yearly releases of PubMed means that there is a yearly cleanup required. More details are in the Yearly Baseline Releases below and BioText will throw an error to try to warn you about a new release.

PubTator Annotation

As an optional extra, you can get PubTator Central annotations added to the documents. This uses the method outlined in Lever et al, PSB 2020. It downloads the latest version of the PubTator Central annotation alignments and identifies their locations in each document. This doubles the disk space requirement.

Usage

There are two core steps involved shown below with single-core Snakemake calls for downloading and conversion. Suggestions for using a cluster are further below.

# 1. Downloading and grouping PubMed Central (which is a single thread)
snakemake --cores 1 downloaded.flag

# 2. Converting PubMed files and PubMed Central groups of files (which can be parallelised).
snakemake --cores 1 converted.flag

Those steps will download PubMed Central to a pmc_archives directory and create a biocxml directory with the converted files.

Those calls to snakemake can then be augmented to use a cluster (or whatever local set up you have), e.g.

# Run a hundred jobs at a time on a SLURM cluster using sbatch
snakemake -j 100 --cluster ' sbatch' --latency-wait 60 converted.flag

The commands for running the PubTator alignments are below. Please add appropriate cluster flags.

# Download the PubTator file
snakemake --cores 1 pubtator_downloaded.flag

# Run the conversions on all the files in biocxml/
snakemake --cores 1 pubtator.flag

Dependencies

This project requires Python 3 with dependencies that can be installed with pip.

pip install -U snakemake bioc ftputil

For testing, it also uses biopython.

pip install -U biopython

Yearly Baseline Releases

Every year, PubMed is given a new baseline release with daily updates based from this (typically in Nov/Dec). BioText will throw an error (below) if it sees any old baseline/update files in the biocxml/ directory. This will happen when a new baseline is released. You can see the year of the release by the first number in the filename. For example, pubmed_updatefiles_20n1478.bioc.xml is from the 2020 release.

When this happens, it's time for a yearly clean-out. You should delete the old PubMed files (which will likely be all PubMed files in biocxml). You will also need to delete any downstream files based upon these files to make sure that other projects don't end up with duplicate files.

AssertionError in line 66 of /projects/jlever/github/biotext/Snakefile:
Found unexpected PubMed files (e.g. biocxml/pubmed_baseline_20n0001.bioc.xml) in biocxml directory. Likely due to a new PubMed baseline release. These should be manually deleted as well as downstream files. Check the project README for more details under section Yearly Baseline Releases.
  File "/projects/jlever/github/biotext/Snakefile", line 66, in <module>

Contributing

Contributions are very welcome.

License

Distributed under the terms of the MIT license, "BioText" is free and open source software

Issues

If you encounter any problems, please file an issue along with a detailed description.

biotext's People

Contributors

creisle avatar jakelever avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

biotext's Issues

Proposed Parsing Updates

Since I've been going through these in such detail I've noticed a few cases where the output doesn't look like what I would expect but I want to clear them with you @jakelever before I make the appropriate changes. I've listed them in a table below

Input XML Proposed Output Current Output
incubator containing 5% CO<sub>2</sub> incubator containing 5% CO2 incubator containing 5% CO 2
10<sup>4</sup> 10^4 10 4
especially in <italic>CBL</italic>-W802* cells especially in CBL-W802* cells especially in CBL -W802* cells
influenced by the presence of allelic variants&#x2014;GSTP1 Ile<sub>105</sub>Val (rs1695) and <italic>GSTP1</italic> Ala<sub>114</sub>Val (rs1138272), with homozygote influenced by the presence of allelic variants--GSTP1 Ile105Val (rs1695) and GSTP1 Ala114Val (rs1138272), with homozygote influenced by the presence of allelic variants—GSTP1 Ile 105 Val (rs1695) and GSTP1 Ala 114 Val (rs1138272), with homozygote
breast cancer, clear cell renal carcinoma, and colon cancer<xref ref-type="bibr" rid="b6">6</xref><xref ref-type="bibr" rid="b7">7</xref> <xref ref-type="bibr" rid="b8">8</xref> <xref ref-type="bibr" rid="b9">9</xref> <xref ref-type="bibr" rid="b10">10</xref> have successfully identified breast cancer, clear cell renal carcinoma, and colon cancer have successfully identified breast cancer, clear cell renal carcinoma, and colon cancerhave successfully identified
, and in the transgenic\nGATA-1,\n<sup>low</sup> mouse , and in the transgenic GATA-1, low mouse , and in the transgenicGATA-1, low mouse
we selected an allele (designated <italic>cic</italic><sup><italic>4</italic></sup>) that removes we selected an allele (designated cic^4) that removes we selected an allele (designated cic 4) that removes
regulation of the Wnt-&#x3B2;-catenin pathway regulation of the Wnt-beta-catenin pathway regulation of the Wnt-β-catenin pathway
the specific HPV<sup>+</sup> gene expression the specific HPV+ gene expression the specific HPV + gene expression
known to be resistant to 1<sup>st</sup> and 2<sup>nd</sup> generation EGFR-TKIS, osimertinib known to be resistant to 1st and 2nd generation EGFR-TKIS, osimertinib known to be resistant to 1 st and 2 nd generation EGFR-TKIS, osimertinib
at 37&#xB0;C in a humidified 5% CO<sub>2</sub> incubator at 37 deg C in a humidified 5% CO2 incubator at 37°C in a humidified 5% CO 2 incubator
seeded at concentrations below 1 &#xD7; 10<sup>6</sup>/ml, selected seeded at concentrations below 1 x 10^6/ml, selected seeded at concentrations below 1 × 10 6 /ml, selected
9 patients with a <italic>BRAF</italic>-mutant tumour 9 patients with a BRAF-mutant tumour 9 patients with a BRAF -mutant tumour
patients with <italic>BRAF</italic><sup>WT</sup> tumours patients with BRAF-WT tumours patients with BRAF WT tumours
MSI<sup>hi</sup> tumours MSI-hi tumours MSI hi tumours
upper limit of normal, creatinine clearance &#x2A7E;30&#x2009;ml&#x2009;min<sup>&#x2212;1</sup>, upper limit of normal, creatinine clearance ⩾30 ml min^-1, upper limit of normal, creatinine clearance ⩾30 ml min −1,
the oncometabolite R(&#x2013;)-2-hydroxyglutarate at the the oncometabolite R(-)-2-hydroxyglutarate at the the oncometabolite R-2-hydroxyglutarate at the
[<sup>3</sup>H]-Thymidine [3H]-Thymidine [ 3 H]-Thymidine

Disabling tables as default

Converted table text isn't well represented in the BioC format. Currently the code is trying to pull the text through into BioC passages into a tab-delimited format. But this is causing some issues downstream with attempts to detect sentences within these tables. For now, tables are being disabled by default and can be reenabled as needed.

Troublesome PMC file stalls conversion

Hey Cara, a very large single PMC file seems to be stalling the strip_annotation_markers function during conversion. I've left it running for a few hours and it never finishes.

The problem article is PMC4829797 which seems to be a book. The file is very big but I don't think it should stall the conversion completely. The file is: PMC4829797.xml.gz

For the moment, I'm basically skipping strip_annotation_markers so that I can do a full run for Cancermine, etc.

The conversion command was:

python src/convert.py --i PMC4829797.xml --iFormat pmcxml --o test.bioc.xml --oFormat biocxml

Keep Citation Information as Annotations

As discussed offline, it would be useful to be able to keep the in-text citation information as annotations in bioc format. I've had a crack at this as an offshoot of my tables PR #2 since it lays some groundwork that helps. I made this ticket for discussing the particulars.

Better table header handling

I've been using the lineraized tables but one thing I've noticed is that when we have something complex like a multi-level header just linearizing makes the number of cells not always match up. so something like this

image

example article used: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2873663/

Currently gets turned into

p53 MUTATION FUNCTIONALa STATUS IARC DATABASEb FEATURESc SOMATIC GERMLINE FAMILIES TOTAL BREAST

And we lose a lot of meaning, not to mention it becomes impossible to match these up properly to the cells text from the body of the table. (see below)

p53 MUTATION FUNCTIONALa STATUS IARC DATABASEb FEATURESc SOMATIC GERMLINE FAMILIES TOTAL BREAST
T125R ALTERED 2 1 0

So i'd like to try something more complex where we simplfiy the header into a single row before we linearize but it would require making the text differ slightly from the original by repeating some words which I am not sure on. The end results would look like this

p53 MUTATION FUNCTIONALa STATUS IARC DATABASEb SOMATIC TOTAL IARC DATABASEb SOMATIC BREAST IARC DATABASEb GERMLINE FAMILIES FEATURESc
T125R ALTERED 2 1 0

@jakelever what do you think? I've already been implementing this for my own purposes but would be happy to put up a PR if you like the idea

Dealing with broken PMC files without xlink namespace

A small number of PMC files use the xlink namespace without defining it first. For example, the documents include "xlink:href" where "xlink" hasn't be defined. This breaks the XML parser and gives errors like below.

Traceback (most recent call last):
  File "/projects/jlever/github/biotext/src/bioconverters/pmcxml.py", line 390, in pmcxml2bioc
    for pmc_doc in process_pmc_file(source, tag_handlers=tag_handlers):
  File "/projects/jlever/github/biotext/src/bioconverters/pmcxml.py", line 274, in process_pmc_file
    for event, elem in etree.iterparse(source, events=("start", "end", "start-ns", "end-ns")):
  File "/home/jlever/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/xml/etree/ElementTree.py", line 1222, in iterator
    yield from pullparser.read_events()
  File "/home/jlever/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/xml/etree/ElementTree.py", line 1297, in read_events
    raise event
  File "/home/jlever/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/xml/etree/ElementTree.py", line 1269, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: unbound prefix: line 12, column 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/convertPMC.py", line 56, in <module>
    for bioc_doc in pmcxml2bioc(io.StringIO(data)):
  File "/projects/jlever/github/biotext/src/bioconverters/pmcxml.py", line 450, in pmcxml2bioc
    raise RuntimeError("Parsing error in PMC xml file: %s" % source)
RuntimeError: Parsing error in PMC xml file: <_io.StringIO object at 0x7f04d1099c18>

An initial hacky fix was implemented in 63663fe and e30c3e9. This tried to fixed href specific cases. This needs to be explored further (as a new non href-related file) has appeared.

Special Case: Table with multiple separated header lines

Examples from here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3704624/

Mutations detected in samples.
BRAFV600 (Sanger) Melanoma (n=45) Associated nevus (n=46) Control nevus (n=25) Matched melanoma (n=28)
V600E 51.1% (n=23) 63.0% (n=29) 52.0% (n=13) 39.3% (n=11)
V600K 0 0 4.0% (n=1) 0
Wildtype 48.9% (n=22) 37.0% (n=17) 44.0% (n=11) 60.7% (n=17)
BRAFV600E (Sanger + VE1 IHC) Melanoma (n=46) Associated nevus (n=46) Control nevus (n=25) Matched melanoma (n=29)
V600E 63.0% (n=29) 65.2% (n=30) 54.2% (n=13) 41.4% (n=12)
Wildtype 37.0% (n=17) 34.8% (n=16) 48.0% (n=12) 58.6% (n=17)
NRAS Exon 2 (Sanger) Melanoma (n=42) Associated nevus (n=44) Control nevus (n=21) Matched melanoma (n=26)
Silent mutations 2.4% (n=1; A66A) 2.3% (n=1; L52L) 0 0
Q61K 4.8% (n=2) 4.5% (n=2) 14.3% (n=3) 0
Q61L 2.4% (n=1) 2.3% (n=1) 0 0
Q61R 2.4% (n=1) 9.1% (n=4) 0 7.7% (n=2)
Wildtype 88.1% (n=37) 81.8% (n=36) 85.7% (n=18) 92.3% (n=24)

This should really be 3 separate tables so I'm not sure how we can parse this properly. Right now it assumes anything in the tbody isn't a header

Special Case: Pubmed Abstract Headers dropped in bioc output

For the following articles, the content of the section headers "Aim", "Conclusion", etc. is dropped in the final bioc output which means we lose some context

https://pubmed.ncbi.nlm.nih.gov/26161928/

Input XML Proposed Parse Current Parse
<AbstractText Label="AIM" NlmCategory="OBJECTIVE">To investigate the impact of KRAS mutation variants on the activity of regorafenib in SW48 colorectal cancer cells.</AbstractText> AIM: To investigate the impact of KRAS mutation variants on the activity of regorafenib in SW48 colorectal cancer cells. To investigate the impact of KRAS mutation variants on the activity of regorafenib in SW48 colorectal cancer cells.

Citation annotation offset outside passage

Hey @creisle , I've come across a citation annotation that is outside the associated passage. One of my scripts checks some things on BioC files and this got flagged. I think that it doesn't seem right. What do you think?

Below is an example where the passage offset is 56733 but the zero-length citation is at offset 56732 which is just before the passage starts.

    <passage>
      <infon key="section">floating</infon>
      <infon key="subsection">None</infon>
      <infon key="xml_path">floats-group/table-wrap/table/thead</infon>
      <offset>56733</offset>
      <text> (µg/mL)    S-2366 K M (mM) V MAX (mAU/min)</text>
      <annotation id="ANN_c6f8f533-0764-4484-8663-c18655ca06f3">
        <infon key="citation_text">1</infon>
        <infon key="type">citation</infon>
        <location offset="56732" length="0"/>
        <text/>
      </annotation>
    </passage>

To reproduce, I've included the source PMC XML file: PMC8466798.xml.gz and I converted it with the line below.

python src/convert.py --i PMC008xxxxxx/PMC8466798.xml --iFormat pmcxml --o test.bioc.xml --oFormat biocxml

Testing re-write-parser branch

Hey, I'm just running some tests on the re-write-parser branch as we discussed. I tried to do a full-run and ran into an error below. I narrowed it down to a file (I think) and had to fix an error there too.

Full Run Issue

# Commands for a full run
snakemake --cores 1 downloaded.flag
snakemake --cores 8 converted.flag
Traceback (most recent call last):
  File "src/convertPMC.py", line 48, in <module>
    for bioc_doc in pmcxml2bioc(io.StringIO(data)):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 372, in pmcxml2bioc
    for pmc_doc in process_pmc_file(source, tag_handlers=tag_handlers):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 324, in process_pmc_file
    article_elem, tag_handlers=tag_handlers
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 134, in extract_article_content
    article_elem.findall("./body"), tag_handlers=tag_handlers, annotations_map=annotations_map
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 368, in extract_text_chunks
    raw_text_chunks.extend(tag_handler(elem, tag_handlers))
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
    child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
    child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
    child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
  [Previous line repeated 5 more times]
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 225, in tag_handler
    for child in merge_adjacent_xref_siblings(elem):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 175, in merge_adjacent_xref_siblings
    prev_tail = siblings[-1].tail.strip()
AttributeError: 'NoneType' object has no attribute 'strip'

Small File Test Case Error

I got it to dump out which file it was processing when it crashed and it seems to be the file Molecules/PMC6259225.nxml from the comm_use.I-N.xml.tar.gz archive. I have attached it.

PMC6259225.nxml.gz

I got a different error (due to my hacky fix for the invalid PMC XML files)

(mypy3) [jlever@munin biotext]$ python src/convert.py --i Molecules/PMC6259225.nxml --iFormat pmcxml --o test.bioc.xml --oFormat biocxml
Converting 1 files to test.bioc.xml
Traceback (most recent call last):
  File "src/convert.py", line 26, in <module>
    convert(inFiles,inFormat,args.o,outFormat)
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/main.py", line 49, in convert
    for bioc_doc in docs2bioc(in_file, in_format, **kwargs):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 372, in pmcxml2bioc
    for pmc_doc in process_pmc_file(source, tag_handlers=tag_handlers):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 252, in process_pmc_file
    content = source.read()
AttributeError: 'str' object has no attribute 'read'

New PMC data format (baseline + increments)

As noted in #5, there is a new PMC bulk download format described at https://www.ncbi.nlm.nih.gov/pmc/about/new-in-pmc/#2021-09-21. We'll need to make a few adjustments to how PMC data is dealt with. Our code actually does its own baseline + increments system, so it'll mostly be cutting out code that isn't needed anymore.

PubMed Central (PMC) has made significant improvements to the bulk retrieval of two of the PMC Article Datasets from our FTP service. The improvements were made to bulk packages which include metadata and full text files of articles in XML or plain text formats for the PMC Open Access (OA) Subset and the Author Manuscript Dataset, which combined encompass more than half of the 7 million articles in PMC. To improve the usability of these two datasets, PMC has redesigned the bulk download directory structure and file packages on our FTP service. The new structure includes:

  • baseline packages that contain all articles available in PMC as of the baseline date for each respective dataset or grouping; and
  • daily incremental packages for each respective dataset or grouping that only contain articles that are new to the dataset or that have been updated since the baseline or previous incremental file was created.

Retain supplementary table/figure mentions

Ran across an example where in our latest version of this these are still being stripped out in some cases. For example

The NTRK3 G623R mutation conferred even greater loss of sensitivity to the other tested Trk inhibitors, TSR-011 (Tesaro) and LOXO-101 (LOXO), eliciting IC50 proliferation values of >1000 nM (supplementary Figure S4C, available at Annals of Oncology online).

Relevant PMC article: 4843186

Table normalization cases

Case Example Article License table index Tests
header hierarchical colspans PMC5029658 CC-BY 1,2,3
body hierarchical rowspans PMC5029658 CC-BY 0
header hierarchical colspans PMC2873663 author version redundant
body full colspans PMC7461630 author version redundant
body full colspans PMC4919728 CC BY-NC-ND 0
body partial colspans PMC4049792 CC-BY NC 0
paragraphs inside table cells PMC6580637 CC-BY 2
empty cells that should be repeated PMC4816447 CC-BY 0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.