louiejtaylor / grabseqs Goto Github PK

View Code? Open in Web Editor NEW

103.0 103.0 16.0 288 KB

A utility for easy downloading of reads from next-gen sequencing repositories like NCBI SRA

License: MIT License

Python 75.80% Shell 24.20%

bioinformatics conda metagenomics ncbi-sra ngs python sra

grabseqs's People

Contributors

Stargazers

Watchers

Forkers

guanxiangliang linxingchen hajkd dgelsin cnyuanh ravindra-raut rajaldebnath scottdaniel williamwzt vikash84 sebymusundi iamh2o polojacky yumyai wook2014 teashull

grabseqs's Issues

For SRA downloading, make sure pigz zips only the correct sequence files

This is extremely unlikely to be an issue in practice, but if for some reason an individual were to be downloading two accession numbers such that one accession number was a substring of another accession number, pigz might clobber the shorter accession number because of the way you compress files.

grabseqs and geofetch

Hey, I just came across grabseqs and at first glance, it looks really similar to a tool I've been developing called geofetch -- just wondered if you had any interest in exploring the possibility of working together on this. or, perhaps you'd be interested in the idea of a PEP, which geofetch produces, which is a standardized way to represent the sample metadata that is downloaded from geo. I haven't delved too deep into grabseqs yet as I just found it, but I thought I'd reach out to see if we could make a connection and alert you to some related projects.

Version 0.5.0 breaks numpy

Multiple errors incl

module 'numpy' has no attribute '__version__'

and

ImportError: Something is wrong with the numpy installation. While importing we detected an older version of numpy in ['/.../miniconda2/envs/sunbeam/lib/python3.6/site-packages/numpy']. One method of fixing this is to repeatedly uninstall numpy until none is found, then reinstall this version.

I suspect this is due to duplicated requirements/dependencies between setup.py (pip) and other packages grabbing numpy in their environment.yml (conda) although I'm not sure what the correct workaround is for this...

Migrate tests to be platform-independent

(i.e. work on more than just circleci)

Filter output from `-l` to conform with expected result from `--no_parsing` flag

Grabseqs error

Hi,
I met this problem when I used grabseqs.

Traceback (most recent call last):
File "/media/home/user05/anaconda3/envs/python36/bin/grabseqs", line 11, in
sys.exit(main())
File "/media/home/user05/anaconda3/envs/python36/lib/python3.6/site-packages/grabseqslib/init.py", line 58, in main
acclist, metadata_agg = get_sra_acc_metadata(sra_identifier, args.outdir, args.list, not args.SRR_parsing, metadata_agg)
File "/media/home/user05/anaconda3/envs/python36/lib/python3.6/site-packages/grabseqslib/sra.py", line 52, in get_sra_acc_metadata
run_col = lines[0].index("Run")
ValueError: 'Run' is not in list
Could you please tell me how to solve this problem? Thanks.

Abstract metadata-parsing and downloading functions to separate modules

Minimal library (grabseqslib?)
One module per repository
Unit tests per function

fasterq-dump error:

Thanks for making this tool, it's a real time saver!

I'm attempting to download a list of SRS accessions, which, at the start was working fine, but after a few hours has been consistently erroring:

downloading SRR2192724 using fasterq-dump 2019-04-09T00:56:16 fasterq-dump.2.9.1 err: storage exhausted while creating directory within file system module - error with https open 'https://sra-download.ncbi.nlm.nih.gov/traces/sra33/SRR/002141/SRR2192724' 2019-04-09T00:56:16 fasterq-dump.2.9.1 err: **invalid accession 'SRR2192724'** pigz: skipping: SRS475922/SRR2192724*fastq does not exist SRA download for acc SRR2192724 failed, retrying 0 more times. Traceback (most recent call last): File "/localscratch/EisenRa/miniconda2/bin/grabseqs", line 11, in <module> sys.exit(main()) File "/localscratch/EisenRa/miniconda2/lib/python3.6/site-packages/grabseqslib/__init__.py", line 59, in main run_fasterq_dump(acc, args.retries, args.threads, args.outdir, args.force, args.fastqdump) File "/localscratch/EisenRa/miniconda2/lib/python3.6/site-packages/grabseqslib/sra.py", line 114, in run_fasterq_dump raise Exception("download for "+acc+" failed. fast(er)q-dump returned "+str(retcode)+", pigz returned "+str(rgzip)+".") Exception: download for SRR2192724 failed. fast(er)q-dump returned 0, pigz returned 0.

It claims invalid accession, but the SRA file link is downloadable with wget. Is this some kind of cache error? I've got enough space on the disk.

Commands ran:
while read SRS; do grabseqs sra -t 50 -m -o $SRS -r 3 $SRS; done < SRS.txt
Where SRS.txt = a list of SRS accessions, one per line.

Best wishes,
Raphael

SRA metadata not including BioSample attributes

I thought I was missing some metadata in one of our own previously-submitted SRA datasets because it wasn't showing up in the CSV file, but then the SRA admins pointed out that it does show up on the web interface and the TSV file generated there, just not the version downloaded by grabseqs via the SRA CGI URL.

This is for BioProject PRJNA506241, where you can see the full metadata (columns like dsODN) when viewing it here:

https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA506241

But downloading via this URL gives only the core SRA columns and not the BioSample ones:

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=PRJNA506241

Looking closer they're almost two completely different sets of columns except for BioSample (and Consent, for some reason). Did something change server-side with this behavior, maybe? I also asked the SRA admins so I'll post an update if I learn anything.

pigz attempting to zip files that don't exist

Saw this when a test was running:

downloading SRR1913936 using fasterq-dump
spots read      : 11
reads read      : 22
reads written   : 22
pigz: skipping: /home/circleci/grabseqs_unittest/test_tiny_sra_paired/SRR1913936.fastq does not exist
✔ SRA paired sample download test passed

pigz should not attempt to zip a paired sample..?

Or, do I explicitly zip all possible files that come down to hedge against unpaired sequences? Either way, should fix this and #8 at the same time.

Figure out why iMicrobe breaks on circleCI

Use "conda deactivate" instead of "source deactivate"

For the test scripts. As in sunbeam-labs/sunbeam#198

Warn the user if the SRA metadata "paired/unpaired" information does not match which reads come down

Since sometimes the SRA metadata says the samples are paired-end (but only one file comes down!)

Make --no_SRR_parsing default

Makes more sense this way--problems have come from people needing to use it
(Keep the option around until v1.0 at least)

Add support to download data from EBI

It would be nice if grabseqs supported downloading data from EBI.

Add integration tests for a few (very small) .fastq files

grabseqs sra
grabseqs mgrast

Add documentation/FAQs for all subparsers

It would be useful to have a list of "things I have seen go wrong and how I diagnosed/fixed them" for each of the repos, I think

SRA
MG-RAST

Refactor code

Simplify/break up funcs to facilitate testing
Abstract most stuff into repo-specific functions from __init__.py
Write warnings to stderr, not stdout
Re-write tests to be more modular (and make it easier to run specific ones)

Running with previously downloaded file prints hundreds of thousands of "helpful" messages

Re-downloading a SRA sample prints thousands of lines telling you that it found the sample already and won't re-download. Nice to have, but one line will do.

It is:

not specific to HMP samples
Independent of the -m and --no_parsing flags

More intelligently handle HTTP errors in downloading MG-RAST files

Add callout to FAQ in main README

Look into downloading from iMicrobe

See https://www.imicrobe.us/

allow restarting if download fails

don't re-download already complete files (add --force flag to re-download)

gzip fasterq-dump output

fasterq-dump does not have a gzip option--do manually using the SRR#

Not compatible with Python 3.7?

Can only install it in Python 3.6, but standard conda is already on Python 3.7. Is there a reason for this restriction?

error handling when sra accession doesn't exist / doesn't return runs

Hi -
when I run grabseqs with a project identifier that has no links to any runs
(example:
grabseqs sra -l PRJNAXXXXX)

, grabseqs dies with

ValueError: 'Run' is not in list

Rightly so, because list.index() raises a ValueError when there is no matching item (see e.g. https://docs.python.org/3/tutorial/datastructures.html)

solution:
in line 98 of sra.py the error should be caught with
except ValueError: raise ValueError("Could not find samples for accession: "+pacc+". If this accession number is valid, try re-running.")

Best wishes -
Anna

Add a -l option for MG-RAST

Nicely format SRA, MG-RAST, and iMicrobe metadata

Add tests for not clobbering already-downloaded samples

SRA
MG-RAST

Add support to download data from National Genomics Data Center

In the same vein as issue #53, it would be great if this tool could be used to pull data from the National Genomics Data Center also.

Retries parameter should be improved

Related to this Sunbeam issue. It seems as though the grabseqs retry functionality isn't working as intended--make sure that all errors for SRA downloading are caught properly and make the error messages a little clearer.

Dependency check

Using shutil.which

Brought up originally in this context: #35

grabseqs sra -l PRJDB5400 pigz not found, using gzip Traceback (most recent call last): File "/home/tools/anaconda3/bin/grabseqs", line 8, in <module> sys.exit(main()) File "/home/tools/anaconda3/lib/python3.8/site-packages/grabseqslib/init.py", line 58, in main metadata_agg = process_sra(args, zip_func) File "/home/tools/anaconda3/lib/python3.8/site-packages/grabseqslib/sra.py", line 27, in process_sra acclist, metadata_agg = get_sra_acc_metadata(sra_identifier, File "/home/tools/anaconda3/lib/python3.8/site-packages/grabseqslib/sra.py", line 97, in get_sra_acc_metadata run_col = lines[0].index("Run") ValueError: 'Run' is not in list

when I use the command "grabseqs sra -l PRJDB5400", I have some errors.
pigz not found, using gzip
Traceback (most recent call last):
File "/home/tools/anaconda3/bin/grabseqs", line 8, in
sys.exit(main())
File "/home/tools/anaconda3/lib/python3.8/site-packages/grabseqslib/init.py", line 58, in main
metadata_agg = process_sra(args, zip_func)
File "/home/tools/anaconda3/lib/python3.8/site-packages/grabseqslib/sra.py", line 27, in process_sra
acclist, metadata_agg = get_sra_acc_metadata(sra_identifier,
File "/home/tools/anaconda3/lib/python3.8/site-packages/grabseqslib/sra.py", line 97, in get_sra_acc_metadata
run_col = lines[0].index("Run")
ValueError: 'Run' is not in list

All grabseqs SRA downloads failing

Looks like some changes on the NCBI side lead to failures in SRA downloads:

grabseqs sra SRR11733975
Traceback (most recent call last):
  File "/users/cdiener/miniconda3/envs/sra/bin/grabseqs", line 11, in <module>
    sys.exit(main())
  File "/users/cdiener/miniconda3/envs/sra/lib/python3.7/site-packages/grabseqslib/__init__.py", line 58, in main
    metadata_agg = process_sra(args, zip_func)
  File "/users/cdiener/miniconda3/envs/sra/lib/python3.7/site-packages/grabseqslib/sra.py", line 31, in process_sra
    metadata_agg)
  File "/users/cdiener/miniconda3/envs/sra/lib/python3.7/site-packages/grabseqslib/sra.py", line 97, in get_sra_acc_metadata
    run_col = lines[0].index("Run")
ValueError: 'Run' is not in list

This seems to be caused by a hardcoded address to download the SRA manifest that is not reachable anymore.

Add tests for iMicrobe

Downloading in .fasta format
Downloading in .fastq format
-l (listing)
Not clobbering an already-downloaded file
Forcing download of an already existing file