jakelever / biotext Goto Github PK

Get a nicely-chunked local copy of the biomedical literature (to use for other projects)!

License: MIT License

Python 95.39% Shell 4.61%

text-mining pubmed pubmed-central bioc pubtator snakemake

biotext's Introduction

BioText (with added PubTator)

Sometimes you need a easily-updated local copy of PubMed and PubMed Central, and sometimes (but not always) you want annotations of entities from PubTator on those articles. This project can help with that. It manages the download of PubMed and PubMed Central and converting it into the nice BioC XML format while keeping important metadata. As a separate step, it can load up PubTator Central annotations and align them to the documents. It also handles the update process without redoing all the previous downloading and computation.

Advantages

Deals with format conversion
Chunks PubMed Central (which is normally ~2,000,000 files) into larger files that are easier to parallelise
Uses Snakemake, so can be deployed on a cluster
Can add PubTator Central annotations (of chemicals, genes, diseases, etc) to the text

Details

PubMed is released as a series of XML files with a baseline of files and updates released daily. Each file has tens of thousands of titles and abstracts along with metadata. Each update file may contain new documents or updates to previous documents. These files follow the PubMed XML standard. This project converts each file into the BioC format.

PubMed Central offers full-text articles of documents in a different XML format. A portion of PubMed Central is released for text mining as the non-commercial and commercial licensed PubMed Central Open Access subset and the Author Manuscript Collection. PubMed Central is released as about 15 archives of XML files. Each archive has a very large number of files which makes it somewhat unwieldy. Each new version of these archives contains a mix of new files and old files which need to be distinguished. This project identifies unprocessed files, groups them into chunk (of 2000 documents by default) and converts them to BioC XML.

Things To Be Aware Of

There are few details that you should keep at the back of your mind when using this project.

This project does not deal with duplicates of documents, both in the PubMed update files, and documents in PubMed Central that are also in PubMed. Any text mining of these documents should do a final pass to identify the latest version of a document, i.e. going through new-to-old PubMed Central files before new-to-old PubMed files.
PubMed Central files contain a lot of Unicode characters while PubMed generally does not. An abstract for an article that is in both resources may be processed differently in the PubMed Central file due to Unicode characters.
Yearly releases of PubMed means that there is a yearly cleanup required. More details are in the Yearly Baseline Releases below and BioText will throw an error to try to warn you about a new release.

PubTator Annotation

As an optional extra, you can get PubTator Central annotations added to the documents. This uses the method outlined in Lever et al, PSB 2020. It downloads the latest version of the PubTator Central annotation alignments and identifies their locations in each document. This doubles the disk space requirement.

Usage

There are two core steps involved shown below with single-core Snakemake calls for downloading and conversion. Suggestions for using a cluster are further below.

# 1. Downloading and grouping PubMed Central (which is a single thread)
snakemake --cores 1 downloaded.flag

# 2. Converting PubMed files and PubMed Central groups of files (which can be parallelised).
snakemake --cores 1 converted.flag

Those steps will download PubMed Central to a pmc_archives directory and create a biocxml directory with the converted files.

Those calls to snakemake can then be augmented to use a cluster (or whatever local set up you have), e.g.

# Run a hundred jobs at a time on a SLURM cluster using sbatch
snakemake -j 100 --cluster ' sbatch' --latency-wait 60 converted.flag

The commands for running the PubTator alignments are below. Please add appropriate cluster flags.

# Download the PubTator file
snakemake --cores 1 pubtator_downloaded.flag

# Run the conversions on all the files in biocxml/
snakemake --cores 1 pubtator.flag

Dependencies

This project requires Python 3 with dependencies that can be installed with pip.

pip install -U snakemake bioc ftputil

For testing, it also uses biopython.

pip install -U biopython

Yearly Baseline Releases

Every year, PubMed is given a new baseline release with daily updates based from this (typically in Nov/Dec). BioText will throw an error (below) if it sees any old baseline/update files in the biocxml/ directory. This will happen when a new baseline is released. You can see the year of the release by the first number in the filename. For example, pubmed_updatefiles_20n1478.bioc.xml is from the 2020 release.

When this happens, it's time for a yearly clean-out. You should delete the old PubMed files (which will likely be all PubMed files in biocxml). You will also need to delete any downstream files based upon these files to make sure that other projects don't end up with duplicate files.

AssertionError in line 66 of /projects/jlever/github/biotext/Snakefile:
Found unexpected PubMed files (e.g. biocxml/pubmed_baseline_20n0001.bioc.xml) in biocxml directory. Likely due to a new PubMed baseline release. These should be manually deleted as well as downstream files. Check the project README for more details under section Yearly Baseline Releases.
  File "/projects/jlever/github/biotext/Snakefile", line 66, in <module>

Contributing

Contributions are very welcome.

License

Distributed under the terms of the MIT license, "BioText" is free and open source software

Issues

If you encounter any problems, please file an issue along with a detailed description.

biotext's People

Contributors

Stargazers

Watchers

Forkers

aspirincode ocbier kchennen ghostintheshellarise mevol

biotext's Issues

Proposed Parsing Updates

Since I've been going through these in such detail I've noticed a few cases where the output doesn't look like what I would expect but I want to clear them with you @jakelever before I make the appropriate changes. I've listed them in a table below

Input XML	Proposed Output	Current Output
`incubator containing 5% CO<sub>2</sub>`	incubator containing 5% CO2	incubator containing 5% CO 2
`10<sup>4</sup>`	10^4	10 4
`especially in <italic>CBL</italic>-W802* cells`	especially in CBL-W802* cells	especially in CBL -W802* cells
`influenced by the presence of allelic variants—GSTP1 Ile<sub>105</sub>Val (rs1695) and <italic>GSTP1</italic> Ala<sub>114</sub>Val (rs1138272), with homozygote`	influenced by the presence of allelic variants--GSTP1 Ile105Val (rs1695) and GSTP1 Ala114Val (rs1138272), with homozygote	influenced by the presence of allelic variants—GSTP1 Ile 105 Val (rs1695) and GSTP1 Ala 114 Val (rs1138272), with homozygote
`breast cancer, clear cell renal carcinoma, and colon cancer<xref ref-type="bibr" rid="b6">6</xref><xref ref-type="bibr" rid="b7">7</xref> <xref ref-type="bibr" rid="b8">8</xref> <xref ref-type="bibr" rid="b9">9</xref> <xref ref-type="bibr" rid="b10">10</xref> have successfully identified`	breast cancer, clear cell renal carcinoma, and colon cancer have successfully identified	breast cancer, clear cell renal carcinoma, and colon cancerhave successfully identified
`, and in the transgenic\nGATA-1,\n<sup>low</sup> mouse`	, and in the transgenic GATA-1, low mouse	, and in the transgenicGATA-1, low mouse
`we selected an allele (designated <italic>cic</italic><sup><italic>4</italic></sup>) that removes`	we selected an allele (designated cic^4) that removes	we selected an allele (designated cic 4) that removes
`regulation of the Wnt-β-catenin pathway`	regulation of the Wnt-beta-catenin pathway	regulation of the Wnt-β-catenin pathway
`the specific HPV<sup>+</sup> gene expression`	the specific HPV+ gene expression	the specific HPV + gene expression
`known to be resistant to 1<sup>st</sup> and 2<sup>nd</sup> generation EGFR-TKIS, osimertinib`	known to be resistant to 1st and 2nd generation EGFR-TKIS, osimertinib	known to be resistant to 1 st and 2 nd generation EGFR-TKIS, osimertinib
`at 37°C in a humidified 5% CO<sub>2</sub> incubator`	at 37 deg C in a humidified 5% CO2 incubator	at 37°C in a humidified 5% CO 2 incubator
`seeded at concentrations below 1 × 10<sup>6</sup>/ml, selected`	seeded at concentrations below 1 x 10^6/ml, selected	seeded at concentrations below 1 × 10 6 /ml, selected
`9 patients with a <italic>BRAF</italic>-mutant tumour`	9 patients with a BRAF-mutant tumour	9 patients with a BRAF -mutant tumour
`patients with <italic>BRAF</italic><sup>WT</sup> tumours`	patients with BRAF-WT tumours	patients with BRAF WT tumours
`MSI<sup>hi</sup> tumours`	MSI-hi tumours	MSI hi tumours
`upper limit of normal, creatinine clearance ⩾30 ml min<sup>−1</sup>,`	upper limit of normal, creatinine clearance ⩾30 ml min^-1,	upper limit of normal, creatinine clearance ⩾30 ml min −1,
`the oncometabolite R(–)-2-hydroxyglutarate at the`	the oncometabolite R(-)-2-hydroxyglutarate at the	the oncometabolite R-2-hydroxyglutarate at the
`[<sup>3</sup>H]-Thymidine`	[3H]-Thymidine	[ 3 H]-Thymidine

Disabling tables as default

Converted table text isn't well represented in the BioC format. Currently the code is trying to pull the text through into BioC passages into a tab-delimited format. But this is causing some issues downstream with attempts to detect sentences within these tables. For now, tables are being disabled by default and can be reenabled as needed.

Troublesome PMC file stalls conversion

Hey Cara, a very large single PMC file seems to be stalling the strip_annotation_markers function during conversion. I've left it running for a few hours and it never finishes.

The problem article is PMC4829797 which seems to be a book. The file is very big but I don't think it should stall the conversion completely. The file is: PMC4829797.xml.gz

For the moment, I'm basically skipping strip_annotation_markers so that I can do a full run for Cancermine, etc.

The conversion command was:

python src/convert.py --i PMC4829797.xml --iFormat pmcxml --o test.bioc.xml --oFormat biocxml

Keep Citation Information as Annotations

As discussed offline, it would be useful to be able to keep the in-text citation information as annotations in bioc format. I've had a crack at this as an offshoot of my tables PR #2 since it lays some groundwork that helps. I made this ticket for discussing the particulars.

Better table header handling

I've been using the lineraized tables but one thing I've noticed is that when we have something complex like a multi-level header just linearizing makes the number of cells not always match up. so something like this

example article used: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2873663/

Currently gets turned into

p53 MUTATION	FUNCTIONALa STATUS	IARC DATABASEb	FEATURESc	SOMATIC	GERMLINE FAMILIES	TOTAL	BREAST

And we lose a lot of meaning, not to mention it becomes impossible to match these up properly to the cells text from the body of the table. (see below)

p53 MUTATION	FUNCTIONALa STATUS	IARC DATABASEb	FEATURESc	SOMATIC	GERMLINE FAMILIES	TOTAL	BREAST
T125R	ALTERED	2	1	0

So i'd like to try something more complex where we simplfiy the header into a single row before we linearize but it would require making the text differ slightly from the original by repeating some words which I am not sure on. The end results would look like this

p53 MUTATION	FUNCTIONALa STATUS	IARC DATABASEb SOMATIC TOTAL	IARC DATABASEb SOMATIC BREAST	IARC DATABASEb GERMLINE FAMILIES	FEATURESc
T125R	ALTERED	2	1	0

@jakelever what do you think? I've already been implementing this for my own purposes but would be happy to put up a PR if you like the idea

Dealing with broken PMC files without xlink namespace

A small number of PMC files use the xlink namespace without defining it first. For example, the documents include "xlink:href" where "xlink" hasn't be defined. This breaks the XML parser and gives errors like below.

Traceback (most recent call last):
  File "/projects/jlever/github/biotext/src/bioconverters/pmcxml.py", line 390, in pmcxml2bioc
    for pmc_doc in process_pmc_file(source, tag_handlers=tag_handlers):
  File "/projects/jlever/github/biotext/src/bioconverters/pmcxml.py", line 274, in process_pmc_file
    for event, elem in etree.iterparse(source, events=("start", "end", "start-ns", "end-ns")):
  File "/home/jlever/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/xml/etree/ElementTree.py", line 1222, in iterator
    yield from pullparser.read_events()
  File "/home/jlever/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/xml/etree/ElementTree.py", line 1297, in read_events
    raise event
  File "/home/jlever/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/xml/etree/ElementTree.py", line 1269, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: unbound prefix: line 12, column 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/convertPMC.py", line 56, in <module>
    for bioc_doc in pmcxml2bioc(io.StringIO(data)):
  File "/projects/jlever/github/biotext/src/bioconverters/pmcxml.py", line 450, in pmcxml2bioc
    raise RuntimeError("Parsing error in PMC xml file: %s" % source)
RuntimeError: Parsing error in PMC xml file: <_io.StringIO object at 0x7f04d1099c18>

An initial hacky fix was implemented in 63663fe and e30c3e9. This tried to fixed href specific cases. This needs to be explored further (as a new non href-related file) has appeared.

Special Case: Table with multiple separated header lines

Examples from here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3704624/

Mutations detected in samples.

BRAFV600 (Sanger)	Melanoma (n=45)	Associated nevus (n=46)	Control nevus (n=25)	Matched melanoma (n=28)
V600E	51.1% (n=23)	63.0% (n=29)	52.0% (n=13)	39.3% (n=11)
V600K	0	0	4.0% (n=1)	0
Wildtype	48.9% (n=22)	37.0% (n=17)	44.0% (n=11)	60.7% (n=17)
BRAFV600E (Sanger + VE1 IHC)	Melanoma (n=46)	Associated nevus (n=46)	Control nevus (n=25)	Matched melanoma (n=29)
V600E	63.0% (n=29)	65.2% (n=30)	54.2% (n=13)	41.4% (n=12)
Wildtype	37.0% (n=17)	34.8% (n=16)	48.0% (n=12)	58.6% (n=17)
NRAS Exon 2 (Sanger)	Melanoma (n=42)	Associated nevus (n=44)	Control nevus (n=21)	Matched melanoma (n=26)
Silent mutations	2.4% (n=1; A66A)	2.3% (n=1; L52L)	0	0
Q61K	4.8% (n=2)	4.5% (n=2)	14.3% (n=3)	0
Q61L	2.4% (n=1)	2.3% (n=1)	0	0
Q61R	2.4% (n=1)	9.1% (n=4)	0	7.7% (n=2)
Wildtype	88.1% (n=37)	81.8% (n=36)	85.7% (n=18)	92.3% (n=24)

This should really be 3 separate tables so I'm not sure how we can parse this properly. Right now it assumes anything in the tbody isn't a header

Special Case: Pubmed Abstract Headers dropped in bioc output

For the following articles, the content of the section headers "Aim", "Conclusion", etc. is dropped in the final bioc output which means we lose some context

https://pubmed.ncbi.nlm.nih.gov/26161928/

Input XML	Proposed Parse	Current Parse
`<AbstractText Label="AIM" NlmCategory="OBJECTIVE">To investigate the impact of KRAS mutation variants on the activity of regorafenib in SW48 colorectal cancer cells.</AbstractText>`	AIM: To investigate the impact of KRAS mutation variants on the activity of regorafenib in SW48 colorectal cancer cells.	To investigate the impact of KRAS mutation variants on the activity of regorafenib in SW48 colorectal cancer cells.

Citation annotation offset outside passage

Hey @creisle , I've come across a citation annotation that is outside the associated passage. One of my scripts checks some things on BioC files and this got flagged. I think that it doesn't seem right. What do you think?

Below is an example where the passage offset is 56733 but the zero-length citation is at offset 56732 which is just before the passage starts.

    <passage>
      <infon key="section">floating</infon>
      <infon key="subsection">None</infon>
      <infon key="xml_path">floats-group/table-wrap/table/thead</infon>
      <offset>56733</offset>
      <text> (µg/mL)    S-2366 K M (mM) V MAX (mAU/min)</text>
      <annotation id="ANN_c6f8f533-0764-4484-8663-c18655ca06f3">
        <infon key="citation_text">1</infon>
        <infon key="type">citation</infon>
        <location offset="56732" length="0"/>
        <text/>
      </annotation>
    </passage>

To reproduce, I've included the source PMC XML file: PMC8466798.xml.gz and I converted it with the line below.

python src/convert.py --i PMC008xxxxxx/PMC8466798.xml --iFormat pmcxml --o test.bioc.xml --oFormat biocxml

Testing re-write-parser branch

Hey, I'm just running some tests on the re-write-parser branch as we discussed. I tried to do a full-run and ran into an error below. I narrowed it down to a file (I think) and had to fix an error there too.

Full Run Issue

# Commands for a full run
snakemake --cores 1 downloaded.flag
snakemake --cores 8 converted.flag

Traceback (most recent call last):
  File "src/convertPMC.py", line 48, in <module>
    for bioc_doc in pmcxml2bioc(io.StringIO(data)):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 372, in pmcxml2bioc
    for pmc_doc in process_pmc_file(source, tag_handlers=tag_handlers):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 324, in process_pmc_file
    article_elem, tag_handlers=tag_handlers
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 134, in extract_article_content
    article_elem.findall("./body"), tag_handlers=tag_handlers, annotations_map=annotations_map
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 368, in extract_text_chunks
    raw_text_chunks.extend(tag_handler(elem, tag_handlers))
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
    child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
    child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
    child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
  [Previous line repeated 5 more times]
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 225, in tag_handler
    for child in merge_adjacent_xref_siblings(elem):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 175, in merge_adjacent_xref_siblings
    prev_tail = siblings[-1].tail.strip()
AttributeError: 'NoneType' object has no attribute 'strip'

Small File Test Case Error

I got it to dump out which file it was processing when it crashed and it seems to be the file Molecules/PMC6259225.nxml from the comm_use.I-N.xml.tar.gz archive. I have attached it.

PMC6259225.nxml.gz

I got a different error (due to my hacky fix for the invalid PMC XML files)

(mypy3) [jlever@munin biotext]$ python src/convert.py --i Molecules/PMC6259225.nxml --iFormat pmcxml --o test.bioc.xml --oFormat biocxml
Converting 1 files to test.bioc.xml
Traceback (most recent call last):
  File "src/convert.py", line 26, in <module>
    convert(inFiles,inFormat,args.o,outFormat)
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/main.py", line 49, in convert
    for bioc_doc in docs2bioc(in_file, in_format, **kwargs):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 372, in pmcxml2bioc
    for pmc_doc in process_pmc_file(source, tag_handlers=tag_handlers):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 252, in process_pmc_file
    content = source.read()
AttributeError: 'str' object has no attribute 'read'

New PMC data format (baseline + increments)

As noted in #5, there is a new PMC bulk download format described at https://www.ncbi.nlm.nih.gov/pmc/about/new-in-pmc/#2021-09-21. We'll need to make a few adjustments to how PMC data is dealt with. Our code actually does its own baseline + increments system, so it'll mostly be cutting out code that isn't needed anymore.

PubMed Central (PMC) has made significant improvements to the bulk retrieval of two of the PMC Article Datasets from our FTP service. The improvements were made to bulk packages which include metadata and full text files of articles in XML or plain text formats for the PMC Open Access (OA) Subset and the Author Manuscript Dataset, which combined encompass more than half of the 7 million articles in PMC. To improve the usability of these two datasets, PMC has redesigned the bulk download directory structure and file packages on our FTP service. The new structure includes:

baseline packages that contain all articles available in PMC as of the baseline date for each respective dataset or grouping; and

daily incremental packages for each respective dataset or grouping that only contain articles that are new to the dataset or that have been updated since the baseline or previous incremental file was created.

Retain supplementary table/figure mentions

Ran across an example where in our latest version of this these are still being stripped out in some cases. For example

The NTRK3 G623R mutation conferred even greater loss of sensitivity to the other tested Trk inhibitors, TSR-011 (Tesaro) and LOXO-101 (LOXO), eliciting IC50 proliferation values of >1000 nM (supplementary Figure S4C, available at Annals of Oncology online).

Relevant PMC article: 4843186

Table normalization cases

Case	Example Article	License	table index	Tests
header hierarchical colspans	PMC5029658	CC-BY	1,2,3	✅
body hierarchical rowspans	PMC5029658	CC-BY	0	✅
header hierarchical colspans	PMC2873663	author version		redundant
body full colspans	PMC7461630	author version		redundant
body full colspans	PMC4919728	CC BY-NC-ND	0	✅
body partial colspans	PMC4049792	CC-BY NC	0
paragraphs inside table cells	PMC6580637	CC-BY	2	✅
empty cells that should be repeated	PMC4816447	CC-BY	0

Get PMC IDs from Europe PMC files

As noted in #21, the Europe PMC files use "pmcid" and not "pmc" for the id attribute. We need to cover both.