Giter Site home page Giter Site logo

draftgenomes's Introduction

draftGenomes


Collect all the NCBI WGS sequences for any taxonomic subtree



IMPORTANT NOTICE

LEGACY SOFTWARE: Due to recent changes in NCBI database framework, this software is not longer working as expected. While it will succeed in retrieving old projects from WGS database, you will need another complementary method to get new projects. See last issues for some examples of the problems. Sorry, I had to move to a different research topic and I have no time to develop the major update needed to keep this working after the NCBI changes.



Overview

NCBI WGS (Whole Genome Shotgun) is a huge database from NCBI including sequences from incomplete genomes that have been sequenced by a whole genome shotgun strategy. Those sequences belong to hundreds of thousands of different sequencing projects which should be located and downloaded individually.

draftGenomes greatly simplifies the otherwise arduous task of collecting all the NCBI WGS sequences related to a taxonomic identifier (taxid) at any taxonomic level. This script downloads the appropriate sequence files from NCBI WGS projects and processes them to generate a single coherent fasta file by parsing the sequence headers and updating them if needed.

Details

In the beginning, draftGenomes was conceived as a Python version of the taxid2wgs Perl script from NCBI, but finally, it now goes beyond such initial purpose.

As downloading and parsing NCBI WGS projects could take a long time (and require a lot of disk space) depending on the taxid selected, the script has progress indicators and recovers from several errors. It has a resume mode in case of any fatal interruption of the process.

In addition, there are some other modes of operation:

  • The reverse mode enables another instance of the script to manage the download of sequences without interfering with the first one, which is also parsing the sequences to generate the resulting fasta file.
  • The force mode ignores previous downloads and recreates the final FASTA file in spite of any previous run.
  • The download mode for just downloading without parsing the WGS project files.
  • The verbose mode substitutes the progress indicator with details about every project parsed.

It has been tested successfully in ~TB downloads with several forced and unforced interruptions.

Installing

Just clone the GitHub repository or, even easier, download the script or copy&paste its source code.

Running

draftGenomes only requires a Python 3 interpreter. No other packages beyond the Python Standard Library ones are needed.

The name of the output files has the format: WGS4taxid{include}-{exclude}.fa, where {include} is the taxid of the root of the taxonomical subtree of interest, while {exclude} (optional) is the taxid of the root of the excluded taxa in that subtree. Both taxids are options of the script (a run with no taxid related arguments will test the script).

Please run ./draftGenomes --help to see all the possibilities and details.

References

draftgenomes's People

Contributors

khyox avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

pythseq

draftgenomes's Issues

downloading the entire WGS

If I wanted to download the entire WGS into 1 fasta file. What command should I run? Also, can a blastn search be performed on the WGS database?

WGS project count different than expected

Hi,
Neat tool, looking forward to taking it for a test drive... Noticed that for bacterial projects, draftGenomes says there are 191660, while the following entrez query of Nucleotide on the NCBI website returns 515087 hits: "wgs master"[Properties] AND txid2[orgn] . Is the discrepancy because draftGenomes is excluding duplicate refseqs? I got an error on my third download and multiple attempts at using the -r flag did not help:
./draftGenomes.py -r -t 2
FAILED! Unexpected EOF while parsing file AAAD01.1.fsa_nt.gz. Is it corrupted?
NOTE: You can try to solve any issue and resume
the process using the -r/--resume flag.
Thanks!

Unresolved 'resume DG with the -r flag enabled'

Hi,
Sorry about the two issues in one. Since the other one remains unresolved, I am making it a separate issue now.
./draftGenomes. py -t 2
FAILED! Unexpected EOF while parsing file AAAD01.1.fsa_nt.gz. Is it corrupted?
NOTE: You can try to solve any issue and resume
the process using the -r/--resume flag.
Then the following iterated 4X:

rm AAAD01.1.fsa_nt.gz
./draftGenomes.py -t 2 -r

all with the same error between iterations
I did this on both ubuntu and centos linux systems with the same result. Would you double check that you don't also replicate on your system?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.