Comments (19)
The last fix works. Here is an example with your SRP list: https://colab.research.google.com/drive/1pNeuZJjjHliYFk582kGNRpGJ1Fa2h9cn?usp=sharing
Let me know if you still face any errors. I prefer giving it a few seconds of sleep time to make sure it doesn't hit NCBI's API limits.
from pysradb.
Thanks for reporting @anwarMZ, I will be taking a look at it later tomorrow.
Thanks!
Saket
from pysradb.
Sorry about the delay in responding. I am able to obtain results for the first two of these ids:
- SRP040281
- SRP046387
https://colab.research.google.com/drive/1UQpJG32BbjHOf0cV6rxmljf8vhqw22R-?usp=sharing
The problem with the third id is a missing organism tag ERP000171
(which ideally should have been Yersinia. I will have a fix for this soon, but this is not really a bug at the pysradb end.
from pysradb.
Also, SRP040281 has 120k+ records, so it takes approximately 7 minutes on Colab to fetch it which I think is reasonable.
from pysradb.
Thanks a lot for the bug report @anwarMZ. Would you be able to share a SRP or SRR so that I can reproduce it at my end?
from pysradb.
I just pushed b1fa5d6 which might fix it.
Can you try this with the version on master?:
conda create -y -n pysradb_fix && conda activate pysradb_fix && pip install git+https://github.com/saketkc/pysradb.git
from pysradb.
Thank for prompt reply, i have attached the file of study accessions here -
SRA_srp.txt
from pysradb.
Thanks for the SRP list. I will update here once I have a proper fix.
from pysradb.
Hi @saketkc This works well for querying the ids. However, in this case it creates separate files for each query. In my case, I would like to have one file combined for all SRP queries. But i am not sure if the except can catch the error if the list is passed directly. any thoughts?
from pysradb.
You should be able to concat the dataframes using pandas:
master_df = pd.concat([df1,df2, df3,....])
It is possible to query multiple SRPs at once, however given the NCBI's API limits it might time out if there are multiple SRRs (100s of them as in this case).
from pysradb.
Sure so i just wanted to confirm that querying multiple (100s) of ids at once doesn't work with NCBI's API.
Thank you for answering all queries.
I have a quick question - For IDs where a certain metadata is missing. Does it still make a column for that and leave the cell empty? Because when concatenating, this needs to be made sure that two files don't have varying columns & order.
from pysradb.
I have a quick question - For IDs where a certain metadata is missing. Does it still make a column for that and leave the cell empty? Because when concatenating, this needs to be made sure that two files don't have varying columns & order.
That's correct.
The only scenario in which this is not true is when you request detailed metadata. sra_metadata(srp, detailed=True)
. But you can still concat the dataframes pd.concat(sort=False)
Closing this, feel free to reopen if you still encounter issues.
from pysradb.
It worked well for me when we last spoke but now i am gradually increasing my list to fetch metadata and i am facing an issue. The problem is when there is a certain Study accession that for some reason doesn't fetch metadata it takes long time catch the exception and move on to next one.For example in the current loop as we discussed - here in collab it stalls on following IDs and it takes significant time to get pass these IDs.
In this case i checked that for example these two accession IDs have had issues:
SRP040281
SRP046387
ERP000171
Also after looking at #47 i tried to update pysradb
with v=0.10.5.dev0
after commit #6904315
Thanks,
Zohaib
from pysradb.
Hi @saketkc did you get a chance to reproduce the error?
Cheers,
Zohaib
from pysradb.
Okay, I was trying to get the details about the host specie which only comes with detailed flag e.g. db.sra_metadata(srp, detailed=True)
. In this case when I was get error in one of the accession ids, it just freezes for a significant time. But good to see i can now calculate time on each. Thanks
from pysradb.
Yes, for a project with lot of runs, the retrieval time for metadata will increase (though only linearly as you would see in the last Colab notebook). The detailed mode adds an additional overhead, I haven't done any benchmarking but it should take at least 2x the time for the non-detailed mode.
I have fixed the issue with ERP000171, so I am closing this. Please feel free to reopen this if you face any issues. For projects with a lot of runs, you can expect it to take ~ 0.004 * nrecords
seconds if you are on Colab using the non-detailed mode.
from pysradb.
Hi again @saketkc , Thank you for insights, i managed to get this done. I am now trying to download the sra
files for the fetched metadata. I used this example mentioned here in ipynb. I am running this script as a job on Sun GridEngine based cluster and script ended with error
self.retrieve()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 921, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
FileNotFoundError: [Errno 2] No such file or directory: '/projects/NCBI_seqdata/pysradb_downloads/SRP251618/SRX8624823/SRR12100406.sra.part'
With this the process was killed, I would like to know if you have any idea about this ? I believe it could be becasue the API timed out and needs time delay between successive downloads? Also if there is a way to skip the files that are already downloaded?
Thank you
from pysradb.
The download method first downloads to a temporary location which in this case is pysradb_downloads/SRP251618/SRX8624823/SRR12100406.sra.part
: notice the .part
. Downloads are resumable by default. Once a download finishes, the .part
extension is removed to mark it complete.
In this case the error you get seems to likely be arising because the parallel module is getting confused if this particular file has already been downloaded (it thinks it hasn't been, but probably its download is already complete).
You should have SRR12100406.sra
Please feel free to open a new issue otherwise.
Thanks,
Saket
from pysradb.
Thanks, i will open a new issue to discuss downloading
from pysradb.
Related Issues (20)
- [BUG] varying number of columns in output
- [BUG] gse-to-srp not producing results HOT 1
- [BUG] gse_to_srp returns an error in Python API HOT 1
- [ENH] Include data processing steps, reference to which the reads were aligned or if possible lab protocol into the main table HOT 1
- [BUG] The error arises from setting a deprecated value for the "display.max_colwidth" option in pandas. HOT 2
- [BUG] Parsing error in gse-to-srp HOT 1
- installation using conda fails with UnsatisfiableError HOT 3
- [BUG] HOT 1
- ValueError: Value must be a nonnegative integer or None HOT 1
- Possible missing keys in esearch response results HOT 1
- Data download is interrupted after a few minutes HOT 7
- Filtering results by instrument type HOT 3
- [ENH] Super useful package!
- [BUG] cannot download a single experiment from command line HOT 1
- [BUG] aspera HOT 3
- [BUG] Metadata download not only for the provided accession number HOT 3
- [BUG] aspera HOT 3
- [BUG] gse-to-srp error HOT 1
- [BUG] Download error HOT 1
- [BUG] download error HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pysradb.