Giter Site home page Giter Site logo

Comments (5)

saketkc avatar saketkc commented on July 20, 2024 1

Thanks for reporting back. I haven't done any benchmarking against prefetch. Until now I was using a wget like approach for downloading .sra files.

With the latest commit on master branch, pysradb supports multithreaded downloads. This works both for downloading .sra or directly downloading .fastq.gz files. Feel free to give it a try and let me know if you have any comments.

Example notebook here: https://colab.research.google.com/drive/1rpQ00uUdaa6evB9QjLxOCUzITcckNTjN?usp=sharing

from pysradb.

anwarMZ avatar anwarMZ commented on July 20, 2024

To add to this, running the script again caught different exception -

Traceback (most recent call last):
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 1332, in getresponse
    response.begin()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 303, in begin
    version, status, reason = self._read_status()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 272, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 724, in urlopen
    retries = retries.increment(
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/util/retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 1332, in getresponse
    response.begin()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 303, in begin
    version, status, reason = self._read_status()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 272, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 431, in _process_worker
    r = call_item()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 285, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 252, in __call__
    return [func(*args, **kwargs)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 252, in <listcomp>
    return [func(*args, **kwargs)
  File "/projects/test/parallel_download_pysradb.py", line 8, in single_download
    db.download(df=df_single, skip_confirmation=True)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pysradb/sradb.py", line 1318, in download
    file_sizes = df.apply(get_file_size, axis=1)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pandas/core/frame.py", line 6878, in apply
    return op.get_result()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pandas/core/apply.py", line 186, in get_result
    return self.apply_standard()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pandas/core/apply.py", line 295, in apply_standard
    result = libreduction.compute_reduction(
  File "pandas/_libs/reduction.pyx", line 618, in pandas._libs.reduction.compute_reduction
  File "pandas/_libs/reduction.pyx", line 128, in pandas._libs.reduction.Reducer.get_result
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pysradb/download.py", line 54, in get_file_size
    return float(requests.head(url).headers["content-length"])
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/api.py", line 104, in head
    return request('head', url, **kwargs)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/projects/test/parallel_download_pysradb.py", line 34, in <module>
    Parallel(n_jobs=jobs)(
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 1042, in __call__
    self.retrieve()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 921, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Appreciate your help with this.
Thanks,
Zohaib

from pysradb.

saketkc avatar saketkc commented on July 20, 2024

If i run the script again does it in anyways check which ones are dowloaded already and skip them ? or lets say resume from where to start from?
It resumes downloads as long as x.sra.part file exists. In the error you posted it seems it doesn't. I am not able to replicate this at my end unfortunately. So I am not sure how to help.
If you have any opinion with using example mentioned here on SunGridEngine based job queue system?
You will have to write a custom script to take a SRP, split it into subsets and submit the subset dataframe to pysradb download. For example if you want to download only one SRR:

pysradb metadata SRR12100406 --detailed | pysradb download (1)

You can get a list of SRRs using:

pysradb srp-to-srr SRP251618 --saveto SRP251618.tsv && cut -f 2 SRP251618.tsv

This list of SRR can then be passed onto (1). You can use snakemake to take care of parallelization and ensure jobs are rerun if they fail (because of network issues).

from pysradb.

saketkc avatar saketkc commented on July 20, 2024

Any updates on this?

from pysradb.

anwarMZ avatar anwarMZ commented on July 20, 2024

I tried to optimize this outside of nextflow first but seems like it takes significantly more time to download as compared to prefetch of sra-toolkit. I did not get the time to look into nextflow option yet but will do at somepoint. pysradb has worked well for getting the metadata, if the download option is optimized, it may increase the usability significantly.

from pysradb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.