Giter Site home page Giter Site logo

create_krakendb's Introduction

create_krakenDB: a script to create an MDU-specific kraken DB

The MDU kraken DB should include bacteria, virus, fungi, and synthetic sequences (e.g., Illumina adaptors). At some point, we will require also parasites.

Dependencies

* Latest taxonomy from NCBI
* Genomic sequences from the aformentioned organisms
* A FASTA file with adaptor sequences formatted with the *synthetic DNA*
  NCBI code (details below)

Algorithm outline

How will this script accomplish its goal?

1. Download taxonomy
   
   Sample command:
   > kraken-build --download-taxonomy --db $DBNAME

2. Add sequences to it
   Sample command:
   > kraken-build --add-to-library chr1.fa --db $DBNAME

	* Assumption 1: Sequences are in FASTA file
	* Assumption 2: The header includes GI number so kraken can
			match up to NCBI's taxonomic database
	* Assumption 3: If assumption 2 is not met, then the sequence
			header includes the karken taxon-id
	* Assumption 4: Unclear if this step can be done in parallel

3. Build the database
   Sample command:
   > kraken-build --build --db $DBNAME

	* Assumption 1: The default command assumes kmer length 31, it is
			not clear to me at the outset if this is suitable
			for MDU's purposes. I say we go with 31 for now.
	* Assumption 2: The default minimiser length is 12. This is a 
			parameter that will mostly affect speed of search.
			But, the exact effect of a value (faster/slower)
			queries must be empirically determined. I say we
			start with 12, unless we have reason to believe 
			we can change this.
	* Assumption 3: Maximum size of the DB. We can set a maximum size,
			which once reached, will cause Kraken to subsample
			kmer:taxon pairs to fit within the limits. Or, allow
			it to be as big as it needs to be. The default is to
			let it grow as much as it needs to. As an initial pass
			that might do us ok, so at least we can see how big
			it can get!
	* Assumption 4: We can and should parallelise this step. I am thinking
			at least 36 threads?

4. OPTIONAL Thin the database
   Sample command:
   > kraken-build --shrink 10000 --db $DBNAME --new-db minikraken

	* The above command would make the DB exactly 10000 kmer:taxon
	  pairs. Seems small. Here, a decision can be later made. However,
	  presumably the smaller the database, the faster the search. So,
	  there is likely some room here for optimisation. Unclear how much
	  time we would gain by going from 1M kmer:taxon pairs to 100K or 
   	  less, and what that would mean for our accuracy.

Things we will need

Adding sequences to the DB is the most problematic step.

Our data is stored in *.gbk.gz files. This holds two problems:

1. First none of the kraken tools read compressed files
2. Second none of the kraken tools read Genbank files

Solving the first one is relatively straightforward. A call to gunzip might suffice. The second step is deceiving. While there are tools readily available that will easily convert genbank format into fasta format, I have not found one yet that will correctly add the GI number to the header that is necessary for kraken to correctly parse the taxon information associated with the sequence. The solution might require writing a Python or PERL script to wrap around the whole of step 2 of our algorithm. This might allow us to take advantage of some parallelism by adding suitable semaphores.

create_krakendb's People

Contributors

andersgs avatar

Stargazers

Andreas Stroehlein avatar

Watchers

Torsten Seemann avatar James Cloos avatar Andrew Buultjens avatar Jason Kwong avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.