Comments (5)
Thanks for reporting back. I haven't done any benchmarking against prefetch
. Until now I was using a wget
like approach for downloading .sra
files.
With the latest commit on master branch, pysradb
supports multithreaded downloads. This works both for downloading .sra
or directly downloading .fastq.gz
files. Feel free to give it a try and let me know if you have any comments.
Example notebook here: https://colab.research.google.com/drive/1rpQ00uUdaa6evB9QjLxOCUzITcckNTjN?usp=sharing
from pysradb.
To add to this, running the script again caught different exception -
Traceback (most recent call last):
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 1332, in getresponse
response.begin()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 303, in begin
version, status, reason = self._read_status()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 272, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 724, in urlopen
retries = retries.increment(
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/util/retry.py", line 403, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/packages/six.py", line 734, in reraise
raise value.with_traceback(tb)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 1332, in getresponse
response.begin()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 303, in begin
version, status, reason = self._read_status()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 272, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 431, in _process_worker
r = call_item()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 285, in __call__
return self.fn(*self.args, **self.kwargs)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in __call__
return self.func(*args, **kwargs)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 252, in __call__
return [func(*args, **kwargs)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 252, in <listcomp>
return [func(*args, **kwargs)
File "/projects/test/parallel_download_pysradb.py", line 8, in single_download
db.download(df=df_single, skip_confirmation=True)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pysradb/sradb.py", line 1318, in download
file_sizes = df.apply(get_file_size, axis=1)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pandas/core/frame.py", line 6878, in apply
return op.get_result()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pandas/core/apply.py", line 186, in get_result
return self.apply_standard()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pandas/core/apply.py", line 295, in apply_standard
result = libreduction.compute_reduction(
File "pandas/_libs/reduction.pyx", line 618, in pandas._libs.reduction.compute_reduction
File "pandas/_libs/reduction.pyx", line 128, in pandas._libs.reduction.Reducer.get_result
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pysradb/download.py", line 54, in get_file_size
return float(requests.head(url).headers["content-length"])
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/api.py", line 104, in head
return request('head', url, **kwargs)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/projects/test/parallel_download_pysradb.py", line 34, in <module>
Parallel(n_jobs=jobs)(
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 1042, in __call__
self.retrieve()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 921, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Appreciate your help with this.
Thanks,
Zohaib
from pysradb.
If i run the script again does it in anyways check which ones are dowloaded already and skip them ? or lets say resume from where to start from?
It resumes downloads as long asx.sra.part
file exists. In the error you posted it seems it doesn't. I am not able to replicate this at my end unfortunately. So I am not sure how to help.
If you have any opinion with using example mentioned here on SunGridEngine based job queue system?
You will have to write a custom script to take a SRP, split it into subsets and submit the subset dataframe topysradb download
. For example if you want to download only one SRR:
pysradb metadata SRR12100406 --detailed | pysradb download
(1)
You can get a list of SRRs using:
pysradb srp-to-srr SRP251618 --saveto SRP251618.tsv && cut -f 2 SRP251618.tsv
This list of SRR can then be passed onto (1). You can use snakemake
to take care of parallelization and ensure jobs are rerun if they fail (because of network issues).
from pysradb.
Any updates on this?
from pysradb.
I tried to optimize this outside of nextflow
first but seems like it takes significantly more time to download as compared to prefetch
of sra-toolkit
. I did not get the time to look into nextflow
option yet but will do at somepoint. pysradb
has worked well for getting the metadata, if the download option is optimized, it may increase the usability significantly.
from pysradb.
Related Issues (20)
- [BUG] cannot download a single experiment from command line HOT 1
- [BUG] aspera HOT 3
- [BUG] Metadata download not only for the provided accession number HOT 3
- [BUG] aspera HOT 3
- [BUG] gse-to-srp error HOT 1
- [BUG] Download error HOT 1
- [BUG] download error HOT 2
- [BUG]Downloading issue HOT 1
- fastq's URLs are empty HOT 2
- Not all attributes are being exported HOT 6
- [BUG] missing run-related entries for experiments with high number of runs
- [BUG] Metadata download only for a few run accessions, not for all the run accessions of the study HOT 4
- [BUG] srp-to-srr giving SRSs as result HOT 3
- could you expend to support European Nucleotide Archive data HOT 1
- [BUG] All arrays must be of the same length
- [ENH] geo_download_links error handling HOT 1
- Incosistencies with retrieving SRX data from different archives HOT 1
- Specific srp seems to cause metadata download to freeze
- [BUG] gse_to_srp problems HOT 2
- [BUG] Cant download using ascp HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pysradb.