Hello! I've come to this project since the BioPython entrez search fail me. It

batching-improvements was merged in <a class="issue-l

Alright, I just understodd why the batching doesn't work with <code class="notranslate

Entrez search result limit about easy-entrez HOT 20 CLOSED

krassowski commented on May 23, 2024

Entrez search result limit

from easy-entrez.

Comments (20)

noamiz5060 commented on May 23, 2024

I really need your help, my project is stuck and i'm really desperate

from easy-entrez.

krassowski commented on May 23, 2024

Can you post a reproducible example code using in_batches_of that you tried and did not work so I could take a look?

from easy-entrez.

noamiz5060 commented on May 23, 2024

Can you post a reproducible example code using in_batches_of that you tried and did not work so I could take a look?

thank you very much for the reply!
probably instead of posting my fruitless tries (because probably because it is my lack of understanding), it will be more efficient to ask how to perform it in the most simple way if, for example, i'd want get the ids for: 't cell', from pubmed/ncbi, 100,000 ids

from easy-entrez.

noamiz5060 commented on May 23, 2024

actually i'll put an example, that I partly succeeded
I've managed to get the result but the limit is still 9999
a = entrez_api.in_batches_of(1_000).search("t cell", max_results=100_000, database='pubmed')

when I try to use in_batches_of(1_000).fetch I get the an error of 'raise ValueError(
ValueError: Received str but a list-like container of identifiers was expected'

from easy-entrez.

jonasfreimuth commented on May 23, 2024

I am currently attempting to work with GEO series in the gds database. It would be very useful to me if I could just ask for all of something during EntrezAPI.search(), i.e. all the series released within the last 3 months. While in most cases setting the max_res to 100_000 should be fine, there are currently slightly under 210'000 series in the database, so there is a given time interval for which this limit is too small. There should either be an option to also specify a retmax so that I can manually construct batches for search, or to extend the batching system to EntrezAPI.search(). I notice there is already the batching-improvements branch, but there appear to be no commits there yet. Is there already some work on adding the batching functionality, or would it be hypothetically worth for me to invest some time to come up with something myself?

So instead of

from easy-entrez import EntrezAPI
EntrezAPI("", "").in_batches_of(10**4 - 1).search({"entrytype": "GSE"}, max_results = 100_000,  database="gds")

I'd like to just say

from easy-entrez import EntrezAPI
EntrezAPI("", "").in_batches_of(10**4 - 1).search({"entrytype": "GSE"}, max_results = None,  database="gds")

and get useful results (which will then be summarized and get their accession field extracted).

from easy-entrez.

krassowski commented on May 23, 2024

batching-improvements was merged in #15 which is why you see no commits. I removed that branch now.

from easy-entrez.

jonasfreimuth commented on May 23, 2024

Ok, and there is no work so far for adding batching to search? I don't understand the system very well yet, is there a special reason why search can't be decorated with supports_batches?

from easy-entrez.

krassowski commented on May 23, 2024

Thanks for your interest! To implement batching for search a different approach than for other methods is needed. Essentially one would need to send subsequent requests with increasing retstart (see Esearch and perl example); this is of course a poor API on the side of entrez because if the database gets updated between queries you may miss out some records or retrieve some records twice, which is why I was not keen on implementing it in the first place. However, I am happy to accept a pull request if it comes with reasonable documentation/warning explaining this potential pitfall.

from easy-entrez.

jonasfreimuth commented on May 23, 2024

Alright, I just understodd why the batching doesn't work with search. Ok I will see if I am able to wrangle up something that fulfills these criteria while doing what it's supposed to. Thanks for lining out what you expect 👍

from easy-entrez.

jonasfreimuth commented on May 23, 2024

Actually, why is there this limit to 100'000 records for (e)search? I experimentally removed it, at least for the gds database, everything seems to work fine.

resp = EntrezAPI("", "").search({"entrytype": "GSE"}, max_results = 300_000,  database="gds")
# The number of unique ids corresponds to the number of total results as reported by entrez.
len(set(resp.data["esearchresult"]["idlist"])) == int(resp.data["esearchresult"]["count"])

The documentation talks about a 10'000 UID limit, but for gds at least, that seems to be as binding as a 100'000 record limit.

Some further experimentation revealed the actual limit after which entrez refuses to send anything is 2'147'483'647, aka the maximum number for a 32 bit signed integer, so if this works for all databases, retrieving all UIDs for a query would just entail setting max_results to that.

# Max limit test for convenience
resp = EntrezAPI("", "").search({"entrytype": "GSE"}, max_results = 2 ** 31 - 1,  database="gds")
len(set(resp.data["esearchresult"]["idlist"])) == int(resp.data["esearchresult"]["count"])

from easy-entrez.

krassowski commented on May 23, 2024

Well the limit on search() is there exactly because the documentation of ESearch states that there is such a limit.

retmax

Total number of UIDs from the retrieved set to be shown in the XML output (default=20). By default, ESearch only includes the first 20 UIDs retrieved in the XML output. [...] Increasing retmax allows more of the retrieved UIDs to be included in the XML output, up to a maximum of 10,000 records.

To retrieve more than 10,000 UIDs from databases other than PubMed, submit multiple esearch requests while incrementing the value of retstart (see Application 3). For PubMed, ESearch can only retrieve the first 10,000 records matching the query. To obtain more than 10,000 PubMed records, consider using that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved.

It is good to hear that it works for you with a specific database but I suspect it might not work for all databases.

Some further experimentation revealed the actual limit after which entrez refuses to send anything is 2'147'483'647, aka the maximum number for a 32 bit signed integer

Are there really as many records in the gds database? Or is len(set(resp.data["esearchresult"]["idlist"])) actually a lower number?

from easy-entrez.

jonasfreimuth commented on May 23, 2024

Well the limit on search() is there exactly because the documentation of ESearch states that there is such a limit.

But that would be 10'000 not 100'000, or is there just a typo somewhere?

It is good to hear that it works for you with a specific database but I suspect it might not work for all databases.

I am currently checking what I get back for the other ones, as defined in easy_entrez/data/entrez_databases.tsv

Are there really as many records in the gds database? Or is len(set(resp.data["esearchresult"]["idlist"])) actually a lower number?

In gds for that query there are ~210'000 records. In total though, gds comprises 6'961'960 records. I just wanted to find out what the actual maximum number possible is, because Entrez always returns the number of records that are the minimum of retmax and the actual number of results. I'd put Infinity there if it'd work, the idea is just to get everything by default. Having any other number that everything is just arbitrary, no?

from easy-entrez.

krassowski commented on May 23, 2024

But that would be 10'000 not 100'000, or is there just a typo somewhere?

Yes, it appears so that currently it is to lax a limit and should have been 10k nor 100k.

from easy-entrez.

jonasfreimuth commented on May 23, 2024

Should this limit then be dynamic, depending on the database? I can see if I can determine individual database limits by experimentation...

from easy-entrez.

krassowski commented on May 23, 2024

Well these limits can change. I would be more inclined to have a separate argument force_override_max_results_i_know_what_i_am_doing: Optional[int] (ok maybe the name could be shortened).

from easy-entrez.

jonasfreimuth commented on May 23, 2024

This seems reasonable 👍

from easy-entrez.

jonasfreimuth commented on May 23, 2024

Could you look over #18 please, @krassowski? This should solve my immediate problem, and at least logically I don't see why the actual request limit for eSearch should ever lie below the number of search results which would necessitate batched search. eSearch returns nothing (much) besides the IDs...

from easy-entrez.

krassowski commented on May 23, 2024

It turns out I had an implementation of pagination locally: #21. I hope it is no longer needed with #18 but if it turns out that Entrez API changes to more restrictive we can always get back to #21.

from easy-entrez.

krassowski commented on May 23, 2024

v0.3.7 is now released an available on PyPI: https://pypi.org/project/easy-entrez/0.3.7/

from easy-entrez.

jonasfreimuth commented on May 23, 2024

Great thank you very much!

from easy-entrez.

Entrez search result limit about easy-entrez HOT 20 CLOSED

Comments (20)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent