krassowski / easy-entrez Goto Github PK

Retrieve PubMed articles, text-mining annotations, or molecular data from >35 Entrez databases via easy to use Python package - built on top of Entrez E-utilities API.

Home Page: https://easy-entrez.readthedocs.io/en/latest/

License: GNU Lesser General Public License v3.0

Python 84.37% Jupyter Notebook 15.60% Shell 0.02%

entrez pubmed literature-search literature-mining entrez-eutilities eutilities pubmed-central gene-annotations meta-analysis

easy-entrez's Introduction

easy-entrez

Python REST API for Entrez E-Utilities, aiming to be easy to use and reliable.

Easy-entrez:

makes common tasks easy thanks to simple Pythonic API,
is typed and integrates well with mypy,
is tested on Windows, Mac and Linux across Python 3.7 to 3.12,
is limited in scope, allowing to focus on the reliability of the core code,
does not use the stateful API as it is error-prone as seen on example of the alternative entrezpy.

Examples

from easy_entrez import EntrezAPI

entrez_api = EntrezAPI(
    'your-tool-name',
    '[email protected]',
    # optional
    return_type='json'
)

# find up to 10 000 results for cancer in human
result = entrez_api.search('cancer AND human[organism]', max_results=10_000)

# data will be populated with JSON or XML (depending on the `return_type` value)
result.data

See more in the Demo notebook and documentation.

For a real-world example (i.e. used for this publication) see notebooks in multi-omics-state-of-the-field repository.

Fetching genes for a variant from dbSNP

Fetch the SNP record for rs6311:

rs6311 = entrez_api.fetch(['rs6311'], max_results=1, database='snp').data[0]
rs6311

Display the result:

from easy_entrez.parsing import xml_to_string

print(xml_to_string(rs6311))

Find the gene names for rs6311:

namespaces = {'ns0': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'}
genes = [
    name.text
    for name in rs6311.findall('.//ns0:GENE_E/ns0:NAME', namespaces)
]
print(genes)

['HTR2A']

Fetch data for multiple variants at once:

result = entrez_api.fetch(['rs6311', 'rs662138'], max_results=10, database='snp')
gene_names = {
    'rs' + document_summary.get('uid'): [
        element.text
        for element in document_summary.findall('.//ns0:GENE_E/ns0:NAME', namespaces)
    ]
    for document_summary in result.data
}
print(gene_names)

{'rs6311': ['HTR2A'], 'rs662138': ['SLC22A1']}

Obtaining the chromosomal position from SNP rsID number

from pandas import DataFrame

result = entrez_api.fetch(['rs6311', 'rs662138'], max_results=10, database='snp')

variant_positions = DataFrame([
    {
        'id': 'rs' + document_summary.get('uid'),
        'chromosome': chromosome,
        'position': position
    }
    for document_summary in result.data
    for chrom_and_position in document_summary.findall('.//ns0:CHRPOS', namespaces)
    for chromosome, position in [chrom_and_position.text.split(':')]
])

variant_positions

id chromosome position

0 rs6311 13 46897343

1 rs662138 6 160143444

	id	chromosome	position
0	rs6311	13	46897343
1	rs662138	6	160143444

Converting full variation/mutation data to tabular format

Parsing utilities can quickly extract the data to a VariantSet object holding pandas DataFrames with coordinates and alternative alleles frequencies:

from easy_entrez.parsing import parse_dbsnp_variants

variants = parse_dbsnp_variants(result)
variants

<VariantSet with 2 variants>

To get the coordinates:

variants.coordinates

rs_id ref alts chrom pos chrom_prev pos_prev consequence

rs6311 C A,T 13 46897343 13 47471478 upstream_transcript_variant,intron_variant,genic_upstream_transcript_variant

rs662138 C G 6 160143444 6 160564476 intron_variant

rs_id	ref	alts	chrom	pos	chrom_prev	pos_prev	consequence
rs6311	C	A,T	13	46897343	13	47471478	upstream_transcript_variant,intron_variant,genic_upstream_transcript_variant
rs662138	C	G	6	160143444	6	160564476	intron_variant

For frequencies:

variants.alt_frequencies.head(5)  # using head to only display first 5 for brevity

rs_id allele source_frequency total_count study count

0 rs6311 T 0.44349 2221 1000Genomes 984.991

1 rs6311 T 0.411261 1585 ALSPAC 651.849

2 rs6311 T 0.331696 1486 Estonian 492.9

3 rs6311 T 0.35 14 GENOME_DK 4.9

4 rs6311 T 0.402529 56309 GnomAD 22666

	rs_id	allele	source_frequency	total_count	study	count
0	rs6311	T	0.44349	2221	1000Genomes	984.991
1	rs6311	T	0.411261	1585	ALSPAC	651.849
2	rs6311	T	0.331696	1486	Estonian	492.9
3	rs6311	T	0.35	14	GENOME_DK	4.9
4	rs6311	T	0.402529	56309	GnomAD	22666

Obtaining the SNP rs ID number from chromosomal position

You can use the query string directly:

results = entrez_api.search(
    '13[CHROMOSOME] AND human[ORGANISM] AND 31873085[POSITION]',
    database='snp',
    max_results=10
)
print(results.data['esearchresult']['idlist'])

['59296319', '17076752', '7336701', '4']

Or pass a dictionary (no validation of arguments is performed, AND conjunction is used):

results = entrez_api.search(
    dict(chromosome=13, organism='human', position=31873085),
    database='snp',
    max_results=10
)
print(results.data['esearchresult']['idlist'])

['59296319', '17076752', '7336701', '4']

The base position should use the latest genome assembly (GRCh38 at the time of writing); you can use the position in previous assembly coordinates by replacing POSITION with POSITION_GRCH37. For more information of the arguments accepted by the SNP database see the entrez help page on NCBI website.

Obtaining amino acids change information for variants in given range

First we search for dbSNP rs identifiers for variants in given region:

dbsnp_ids = (
    entrez_api
    .search(
        '12[CHROMOSOME] AND human[ORGANISM] AND 21178600:21178720[POSITION]',
        database='snp',
        max_results=100
    )
    .data
    ['esearchresult']
    ['idlist']
)

Then fetch the variant data for identifiers:

variant_data = entrez_api.fetch(
    ['rs' + rs_id for rs_id in dbsnp_ids],
    max_results=10,
    database='snp'
)

And parse the data, extracting the HGVS out of summary:

from easy_entrez.parsing import parse_dbsnp_variants
from pandas import Series


def select_protein_hgvs(items):
    return [
        [sequence, hgvs]
        for entry in items
        for sequence, hgvs in [entry.split(':')]
        if hgvs.startswith('p.')
    ]


protein_hgvs = (
    parse_dbsnp_variants(variant_data)
    .summary
    .HGVS
    .apply(select_protein_hgvs)
    .explode()
    .dropna()
    .apply(Series)
    .rename(columns={0: 'sequence', 1: 'hgvs'})
)
protein_hgvs.head()

rs_id sequence hgvs

rs1940853486 NP_006437.3 p.Gly203Ter

rs1940853414 NP_006437.3 p.Glu202Gly

rs1940853378 NP_006437.3 p.Glu202Lys

rs1940853299 NP_006437.3 p.Lys201Thr

rs1940852987 NP_006437.3 p.Asp198Glu

rs_id	sequence	hgvs
rs1940853486	NP_006437.3	p.Gly203Ter
rs1940853414	NP_006437.3	p.Glu202Gly
rs1940853378	NP_006437.3	p.Glu202Lys
rs1940853299	NP_006437.3	p.Lys201Thr
rs1940852987	NP_006437.3	p.Asp198Glu

Fetching more than 10 000 entries

Use in_batches_of method to fetch more than 10k entries (e.g. variant_ids):

snps_result = (
    entrez.api
    .in_batches_of(1_000)
    .fetch(variant_ids, max_results=5_000, database='snp')
)

The result is a dictionary with keys being identifiers used in each batch (because the Entrez API does not always return the indentifiers back) and values representing the result. You can use parse_dbsnp_variants directly on this dictionary.

Find PubMed ID from DOI

When searching GWAS catalog PMID is needed over DOI. You can covert one to the other using:

def doi_term(doi: str) -> str:
    """Clean a DOI string by removing URL prefix."""
    doi = (
        doi
        .replace('http://', 'https://')
        .replace('https://doi.org/', '')
    )
    return f'"{doi}"[Publisher ID]'


result = entrez_api.search(
    doi_term('https://doi.org/10.3389/fcell.2021.626821'),
    database='pubmed',
    max_results=1
)
print(result.data['esearchresult']['idlist'])

['33834021']

Installation

Requires Python 3.6+ (though only 3.7+ is tested). Install with:

pip install easy-entrez

If you wish to enable (optional, tqdm-based) progress bars use:

pip install easy-entrez[with_progress_bars]

If you wish to enable (optional, pandas-based) parsing utilities use:

pip install easy-entrez[with_parsing_utils]

Contributing

To build the documentation locally:

pip install -e .[docs]
sphinx-build docs docs/_build
open docs/_build/index.html

Alternatives

You might want to try:

biopython.Entrez - biopython is a heavy dependency, but probably good choice if you already use it
pubmedpy - provides interesting utilities for parsing the responses
entrez - appears to have a comparable scope but quite different API
entrezpy - this one did not work well for me (hence this package), but may have improved since

easy-entrez's People

Contributors

Stargazers

Watchers

Forkers

wanliu2019 paritoshk zzygyx9119 jonasfreimuth arthritiskneedoctor wangdi2016

easy-entrez's Issues

Docs request: significance of `email`

Can we add more details on the email parameter of EntrezAPI?

I found using [email protected] worked fine, my request went through. Is email really a required part of the payload to send?

Request: officially supporting Python 3.12

Hello team, love this package! Excited to start using it today, except, I use Python 3.12.

Any chance we can:

Add Python 3.12 to the test matrix
Add 3.12 to setup.py's classifiers

Also, may be good to add python_requires=">=3.7" to the setup() call as well.

problem of (ReadTimeout ) API time out with easy_entrez

I used easy-entrez to get the name of the genes from the SNP ID, I have a large dataset of 7 Million SNP. I just tried with 4000 in ( for loop for just 1000 in one time ) and it gave me an error in the last loop.

HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=443): Read timed out. (read timeout=10)
So How can solve this problem?

Suggestion: `pytest-vcr` for reliable testing

I believe the test suite is actually making requests to Entrez each time it's run.

To fix this, I suggest using pytest-vcr, a pytest plug-in for caching the response of requests in a subfolder of the test folder.

It's very easy to use, and may help with the seeming flakiness of CI at the moment

How to (1) search by title and (2) download abstract from matching paper

I am trying to figure out how to:

Search for a matching title in PubMed
Download the relevant abstract from there

Point 1 is failing, I am using [Title] filter with exact title, and it's not getting a match:

# SEE: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8021862/
from easy_entrez import EntrezAPI

TITLE_SUBSTRING = "Interpreting Genetic Variation in Human and Cancer Genomes"

api = EntrezAPI(tool="easy-entrez", email="[email protected]")
search_result = api.search(
    term=f'"{TITLE_SUBSTRING}"[Title]', max_results=1, database="pubmed"
)
result = search_result.data["esearchresult"]  # count is 0 here :/

Can you help me piece it together from here?

Request: `StrEnum` for all filters

So far, I am aware of a few filters:

Publisher ID: filter by DOI
Title: filter by title

It would be cool if easy-entrez defined a StrEnum defining all of the possible filters

Entrez search result limit

Hello!
I've come to this project since the BioPython entrez search fail me.
It used to return more than 9999 results but now there's this cursed limit.
so several question

Is the default search the same as the one in BioPython?
Are the articles added by relevancy? In BioPython they are, and the first articles MIDs here and there are different
And most important one, how can I get more then 9999 results? I've tried the 'in_batchs_of' with the entrez_api.search function but I still get only 9999 results

I need the most simple use of these functions, I want to put a term ('T cell' for example) and get a list of the most 100k relevant articles PMIDs. That's the only thing standing in my project way

Cheers

How can I obrain the SNP rs ids using CHR:POS_A1_A2 format for > 3K SNPs?

I have a list of variants as follows and I would like to obtain rs ids using CHR:POS_A1_A2 format

1:26860336_C_T

`fetch` method with JSON response raises exception

Hi @krassowski ,

thanks for providing this cool package. :)

I just ran into an issue. When calling

result = eapi.fetch(['36999552', '36999549', '36999539'], max_results=3, return_type="xml")
batches.data

everything is fine. However, when changing to JSON response:

result = eapi.fetch(['36999552', '36999549', '36999539'], max_results=3, return_type="json")
batches.data

I get the exception: json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 9)

Can you reproduce that? Can you find out, why this happens?

Thanks a lot!

Cheers,

Adrian

Batch querying

Hi @krassowski thanks for the easy API!

I was wondering if there is a way to query in batches. I have a list of 1000 coordinates I want to query for rsids. I would have done it in a for-loop but the API is set to limit to 3 queries per second which becomes impossible to implement.

My main question is there a method I can use to query the 1000 coordinates to get their rsids without using a loop? I believe this would be efficient and faster besides bypassing the rate limit set by NCBI.

Examples fail to run

Description

Running the example code from the documentation fails with this exception:
TypeError: batches_support_wrapper() missing 1 required positional argument: 'collection'

Reproduce

Run this code with easy_entrez 0.3.0:

from easy_entrez import EntrezAPI

entrez_api = EntrezAPI(
    'test',
    '[email protected]'
)

print(entrez_api.link(database=None, ids=[15718680, 157427902], database_from='protein', command='acheck'))

Expected behavior

It should return results instead of throwing this exception.

Context

OS: Ubuntu 18.04.6 LTS
Python: 3.6.9
easy_entrez 0.3.0 was installed in a virtualenv.

Request: `async` support

It would be nice to support async usage. As of easy-entrez==0.3.7, it seems async isn't part of this package.

So the request is to either:

Support async methods in EntrezAPI
Generated equivalent class AsyncEntrezAPI, like how huggingface_hub does it here

Docs request: possible tool names

The first parameter of EntrezAPI is a tool. Where can possible tool names be found?

To share, I am trying to look up if a DOI returned by a LLM actually exists.

Running pytest sometimes fails

Running pytest (in a venv) on commit 6cd14fb sometimes fails with the following error:

(.venv) jfreige@sl-akali-p-cs1:easy-entrez (main)$ pytest
=================================================================== test session starts ====================================================================
platform linux -- Python 3.10.12, pytest-7.4.2, pluggy-1.3.0
rootdir: /data/local/jfreige/geo-mining/easy-entrez
plugins: cov-4.1.0
collected 15 items

tests/test_api.py .F.                                                                                                                                [ 20%]
tests/test_parsing.py ....                                                                                                                           [ 46%]
tests/test_queries.py ........                                                                                                                       [100%]

========================================================================= FAILURES =========================================================================
_______________________________________________________________________ test_search ________________________________________________________________________

    def test_search():
        result = entrez_api.search('cancer AND human[organism]', max_results=1)
        assert is_response_for(result, SearchQuery)
        assert not is_response_for(result, FetchQuery)
>       assert result.data['esearchresult']['count'] != 0

tests/test_api.py:29:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
easy_entrez/api.py:45: in data
    if self.content_type == 'json':
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <EntrezResponse status=502 for SearchQuery 'cancer AND human[organism]' in pubmed>

    @property
    def content_type(self) -> ReturnType:
        declared_type = self.response.headers['Content-Type']
        if declared_type.startswith('application/json'):
            return 'json'
        if declared_type.startswith('text/xml'):
            return 'xml'
>       raise ValueError(f'Unknown content type: {declared_type}')
E       ValueError: Unknown content type: text/plain

easy_entrez/api.py:41: ValueError
===================================================================== warnings summary =====================================================================
tests/test_parsing.py:39
  /data/local/jfreige/geo-mining/easy-entrez/tests/test_parsing.py:39: PytestUnknownMarkWarning: Unknown pytest.mark.optional - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
    @pytest.mark.optional

tests/test_parsing.py:78
  /data/local/jfreige/geo-mining/easy-entrez/tests/test_parsing.py:78: PytestUnknownMarkWarning: Unknown pytest.mark.optional - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
    @pytest.mark.optional

tests/test_parsing.py:89
  /data/local/jfreige/geo-mining/easy-entrez/tests/test_parsing.py:89: PytestUnknownMarkWarning: Unknown pytest.mark.optional - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
    @pytest.mark.optional

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================================= short test summary info ==================================================================
FAILED tests/test_api.py::test_search - ValueError: Unknown content type: text/plain
======================================================== 1 failed, 14 passed, 3 warnings in 13.15s =========================================================```