greenelab / pubtator Goto Github PK

Retrieve and process PubTator annotations

License: Other

Python 99.75% Shell 0.25%

pubtator snorkel nlp pubmed text-mining data tool

pubtator's Introduction

PubTator: tagged PubMed abstracts for literature mining

PubTator and its 2.0 version (PubTator Central) uses text mining to tag PubMed abstracts/artciles with standardized concepts. This repository retrieves and processes PubTator annotations for use in greenelab/snorkeling and elsewhere.

Get Started

Depreciation Notice

If you have arrived at this page in order to convert Pubtator into BioCXML format, you no longer need to. Pubtator Central now provides their own BioCXML files which can be found here.

Set-up Environment

Conda

Install the conda environment.
Create the pubtator environmenmt by running:

conda create --name pubtator python=3.8

Install packages via pip by running the following:

pip install -r requirements.txt

Activate with conda activate pubtator.

Pip

Make sure you have python version 3.8 installed.
Install packages by running the following:

pip install -r requirements.txt

Execution

To start processing Pubtator/Pubtator Central run the following command:

python execute.py --config config_files/pubtator_central_config.json

If the original Pubtator is desired replace pubtator_central_config.json with pubtator_config.json. The json file contains all the necessary parameters needed to run. More information for the json file can be found here.

License

This repository is dual licensed as BSD 3-Clause and CC0 1.0, meaning any repository content can be used under either license. This licensing arrangement ensures source code is available under an OSI-approved License, while non-code content — such as figures, data, and documentation — is maximally reusable under a public domain dedication.

pubtator's People

Contributors

Stargazers

Watchers

Forkers

danich1 dhimmel project-renard-survey caseolap strategist922 shicheng-guo tybiot dancho123 dsun2

pubtator's Issues

Hetnet IDs for Disease/Compound IDs

Currently Pubtator uses MeSH ids for disease and compound mentions, but for our interests we will need to convert these ids into the same ids hetnets are using. @dhimmel let me know if you want to grab the list or I can if that is desired.

Modify execute.sh to only download missing/incomplete files?

Currently, execute.sh will re-download all files each time it is run, regardless of whether the files have already been successfully downloaded and processed.

Since it requires downloading files which are quite large, there is a decent chance that the script will need to be run more than once due to interrupted downloads.

It would be great if the script could check each file to see if it has been fully downloaded, and only download those which are missing/incomplete.

One possible approach might be to generate md5sums for each output, and check against this, at least for the files that are only updated at periodic intervals.

Thanks for taking the time to put this together and share it with the community!

Unspecified numpy version breaks execute.py

Steps to reproduce

clone repo
create and activate virtual environment (Python 3.8)
install requirements.txt with pip
run python execute.py --config config_files/pubtator_central_config.json

Error

(venv) C:\Repos\pubtator>python execute.py --config config_files/pubtator_central_config.json
C:\Repos\pubtator\venv\lib\site-packages\pandas\util\testing.py:27: FutureWarning: In the future `np.bool` will be defined as the corresponding NumPy scalar.
  import pandas._libs.testing as _testing
Traceback (most recent call last):
  File "execute.py", line 7, in <module>
    from scripts.download_full_text import download_full_text, merge_full_text
  File "C:\Repos\pubtator\scripts\download_full_text.py", line 10, in <module>
    import pandas as pd
  File "C:\Repos\pubtator\venv\lib\site-packages\pandas\__init__.py", line 182, in <module>
    import pandas.testing
  File "C:\Repos\pubtator\venv\lib\site-packages\pandas\testing.py", line 7, in <module>
    from pandas.util.testing import (
  File "C:\Repos\pubtator\venv\lib\site-packages\pandas\util\testing.py", line 27, in <module>
    import pandas._libs.testing as _testing
  File "pandas/_libs/testing.pyx", line 10, in init pandas._libs.testing
  File "C:\Repos\pubtator\venv\lib\site-packages\numpy\__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'bool'.
`np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Solution

Downgrade numpy to 1.23.1: pip install numpy==1.23.1
Add numpy with specific version to requirements.txt

Modify PubTator example file to have more diverse/tricky documents

Let's add the following document to our example datasets:

27075942|t|Beh  et's disease: A comprehensive review with a focus on epidemiology, etiology and clinical features, and management of mucocutaneous lesions.
27075942|a|UNASSIGNED: Beh  et's disease (BD) is a chronic, relapsing, inflammatory multisystem disease of unknown etiology. Oral ulcers, genital ulcers, cutaneous lesions, and ocular and articular involvement are the most frequent features of the disease. Mucocutaneous lesions are considered hallmarks of the disease, and often precede other manifestations. Therefore, their recognition may permit earlier diagnosis and treatment with beneficial results for prognosis. BD is particularly prevalent in "Silk Route" populations but has a global distribution. The disease usually starts around the third or fourth decade of life. Sex distribution is roughly equal. The diagnosis is based on clinical criteria, as there is no pathognomonic test. Genetic factors have been investigated extensively, and association with human leukocyte antigen (HLA)-B51 is still known as the strongest genetic susceptibility factor. The T-helper 17 and interleukin (IL)-17 pathways are active, and play an important role, particularly in acute attacks of BD. Neutrophil activity is increased in BD, and the affected organs show a significant neutrophil and lymphocyte infiltration. HLA-B51 association and increased IL-17 response are thought to play a role in neutrophil activation. Treatment is mainly based on the suppression of inflammatory attacks of the disease using immunomodulatory and immunosuppressive agents. Although treatment has become much more effective in recent years with the introduction of newer drugs, BD is still associated with considerable morbidity and increased mortality. Male sex, younger age of onset and increased number of organs involved at the diagnosis are associated with a more severe disease and, therefore, require more aggressive treatment.
27075942	605	607	BD	Disease	MESH:D000544
27075942	1170	1172	BD	Disease	MESH:D000544
27075942	1210	1212	BD	Disease	MESH:D000544
27075942	1640	1642	BD	Disease	MESH:D000544
27075942	157	174	Beh  et's disease	Disease	MESH:D000544
27075942	176	178	BD	Disease	MESH:D000544
27075942	259	286	Oral ulcers, genital ulcers	Disease	MESH:D014456
27075942	0	17	Beh  et's disease	Disease	MESH:D000544
27075942	122	143	mucocutaneous lesions	Disease	MESH:D001927
27075942	957	984	leukocyte antigen (HLA)-B51	Gene	3126
27075942	1068	1087	interleukin (IL)-17	Gene	3605
27075942	1331	1336	IL-17	Gene	3605

I'm choosing this one because it contains ". Also it shows what PubTator does to unicode in this file: it replaces it with whitespace, which sucks but at least will decrease encoding issues. FYI, Behçet's.

@danich1, we can do this in a separate PR.

Migrate greenelab/snorkeling#11 to this repo

@danich1 has created code to convert pubtator format to BioC XML and TSV in greenelab/snorkeling#11. We'll want to incorporate that code in this repo.

So @danich1, can you copy your latest files from greenelab/snorkeling#11 and open a pull request here?

error processing bioconcepts2pubtator_offsets.gz

I would like to try out pubtator and was running execute.sh and it gave an error:

~/pubtator$ python scripts/pubtator_to_xml.py --documents download/bioconcepts2pubtator_offsets.gz --output data/pubtator-docs.xml.xz
5079543it [2:57:28, 370.11it/s]Traceback (most recent call last):
  File "scripts/pubtator_to_xml.py", line 205, in <module>
    convert_pubtator(args.documents, args.output)
  File "scripts/pubtator_to_xml.py", line 164, in convert_pubtator
    for article in tqdm.tqdm(article_generator):
  File "/home/ksoh/anaconda3/envs/pubtator/lib/python3.8/site-packages/tqdm/std.py", line 1093, in __iter__
    for obj in iterable:
  File "scripts/pubtator_to_xml.py", line 131, in read_bioconcepts2pubtator_offsets
    yield pubtator_stanza_to_article(g)
  File "scripts/pubtator_to_xml.py", line 101, in pubtator_stanza_to_article
    annts = list(annts)
  File "/home/ksoh/anaconda3/envs/pubtator/lib/python3.8/csv.py", line 111, in __next__
    row = next(self.reader)
_csv.Error: line contains NUL
5079543it [2:57:28, 477.03it/s]

pls advise. Thank you.

bioconcepts2pubtatorcentral.offset.gz not available any more

Hi,

I just wanted to bring to your attention that 'python execute.py --config config_files/pubtator_central_config.json' doesn't run because the file bioconcepts2pubtatorcentral.offset.gz hosted on the ftp is no longer available.

Also, I think the README file at step 3 of the conda installation should be moved to after the conda activate (I don't think conda always automatically switches to the newly created env)

Thank you for making this repo public!
~Ali

Consider itertools.groupby for a slick refactoring

Consider using itertools.groupby() as an elegant method of separating bioconcepts2pubtator_offsets into articles. See pubtator_to_xml.py#L124-L137.

Compound conversion to DrugBank / Hetionet IDs

Appears to be broken as of #13. See these cells: all compound IDs are 9606.

Move .py files out of data

Let's move the .py scripts to a scripts folder in the root directory.

Let's move the examples README into data/example.

Tutorial for its usage

Can you put a tutorial for its usage I do see the reporsitory but Im getting confused what Im supposed to run the web version of pubtator is straight forward where I have to just put pmids it returns back the result . I would be glad if you can put a tutorial

I ran this

bash execute.sh
wget: download/bioconcepts2pubtatorcentral_offset.gz.log: No such file or directory

but this exist here "https://github.com/greenelab/pubtator/blob/master/download/bioconcepts2pubtator_offsets.gz.log"

Im not sure what Im doing wrong

Process and upload the complete pubtator catalog

Available at ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator/bioconcepts2pubtator_offsets.gz.

We'll want to use Git LFS to track these files.

Error while starting pubtator locally

Hello,
I followed the instructions in the readme, in order to run pubtator locally.
But when I have to execute the last command
python execute.py --config config_files/pubtator_central_config.json
after the repository is downloaded, I get an error that may be related to the fact that the extraction of the downloaded file doesn't go well.
I append the error message:
Article that broke: 35401401 228155it [3:57:28, 19.85it/s]Traceback (most recent call last): File "execute.py", line 43, in <module> convert_pubtator( File "/home/gabbo/Tesi/pubtator/scripts/pubtator_to_xml.py", line 181, in convert_pubtator for article in tqdm.tqdm(article_generator): File "/home/gabbo/anaconda3/envs/pubtator/lib/python3.8/site-packages/tqdm/_tqdm.py", line 833, in __iter__ for obj in iterable: File "/home/gabbo/Tesi/pubtator/scripts/pubtator_to_xml.py", line 146, in read_bioconcepts2pubtator_offsets g = list(g) File "/home/gabbo/Tesi/pubtator/scripts/pubtator_to_xml.py", line 141, in <genexpr> lines = (line.rstrip() for line in f) File "/home/gabbo/anaconda3/envs/pubtator/lib/python3.8/gzip.py", line 305, in read1 return self._buffer.read1(size) File "/home/gabbo/anaconda3/envs/pubtator/lib/python3.8/_compression.py", line 68, in readinto data = self.read(len(byte_view)) File "/home/gabbo/anaconda3/envs/pubtator/lib/python3.8/gzip.py", line 487, in read uncompress = self._decompressor.decompress(buf, size) zlib.error: Error -3 while decompressing data: invalid code lengths set

Could you please check this problem, and let me know if a fresh install works?

Thank you.