csb5 / germs_16s_pipeline Goto Github PK

14.0 9.0 7.0 58 KB

Pipeline for Illumina shotgun sequencing of 16S rRNA amplicon sequences

Python 96.64% Shell 3.36%

germs_16s_pipeline's Introduction

This is a remake of the original 16S pipeline developed with in the Genome Institute of Singapore, designed for Illumina shotgun sequencing of 16S rRNA amplicon sequences (Ong et al., 2013, PMID 23579286).

Running the pipeline

The pipeline is based on snakemake. The main program (S16.py) will write a config file (conf.json) and snakemake file (snake.make) in the given output directory. These are then used to call snakemake via qsub using the also created wrapper script snake.sh. The system will send an email to you upon completion (be it successful or not).

For help see S16.py --help.

Only upon successful completion the output directory will contain an empty file called COMPLETE.

Results (abundance tables and piecharts) can then be found in results subdirectory (see report.html there).

Steps involved

Reads are first preprocessed (merging of multiple files, trimming etc.) with famas.
Preprocessed reads are classified into 'unspecific', 'spike-in' and '16S', based on a BWA-MEM mapping against the original EMIRGE/ARB SSU database containing the spike-in (see ratios.txt for corresponding ratios).
Only 16S sequences are used for reconstructing the full-length sequences with EMIRGE (Miller et al., 2011, PMID 21595876) or EMIRGE amplicon (Miller et al., 2013; PMID 23405248)
Primers are clipped from the reconstructed sequences
Clipped sequences are mapped with Graphmap against a preclustered version (99% OTU) of the Greengenes database for classification.
Abundance-tables and piecharts are created in the results subdirectory of the output directory. Pairwise identity thresholds for the different taxonomi ranks are implemented as determined by Yarza et al. (2014; PMID 25118885).

Tip

If you want to first see what the pipeline would do, call S16.py with --no-run. Then check the created files (see above).

To print all commands that would be run, use:

snakemake -s snake.make --configfile conf.json -n -p

To get a graphical representation of the workflow, run (from the output directory):

snakemake -s snake.make --configfile conf.json --dag --forceall | dot -Tpdf > dag.pdf

and have a look at dag.pdf. Once you're satisfied just run bash snake.sh.

Setup

This is tuned towards an in-house setup. If you want to replicate it see the CONF variable in S16.py, which lists expected binaries and databases.

germs_16s_pipeline's People

Stargazers

Watchers

Forkers

betacollins daxiw xxz19900 tankmermaid dongqing7 haoziyeung xiangrong131

germs_16s_pipeline's Issues

Get rid of hardcoded working dir in snake.sh

set all scripts to force overwrite

otherwise you can for example get, "Cowardly refusing to overwrite existing file 'emirge_outprimer_trimmed.fa':

rpos >= len(ref)

See dir bug-rpos-not-lt-ref
Must be Graphmap
Reported by Collins 2016-05-06

Compute rarefaction automatically

Keep BAM from final EMIRGE iteration
Extract counts from BAM and assume fixed detection limit
The downsample in silico

rarefaction curve: blue max line sometimes clipped

e.g. /mnt/projects/desessions/teohhk/emirge/danone_June2015/batch9_May2016/AHS150/results

Plot changes over iteration

Use probN.pkl.gz as in paper

Use local rules in snakefile (and increase mem in master job)

Trim GG before clustering

The classification database (99% OTU) should have been trimmed before clustering instead of using the preclustered database. Pointed out by Christophe LAY

Set Bowtie and Usearch paths pre calling

Hardcoded and different Python interpreters used

Create own env?

Determine source of EMIRGE/Bowtie unaligned reads

example emirge-bowtie-unmapped-source-PHH4059
note 1: emirge is a b**** and sequentially renames reads before mapping starting from 0
note 2: only properly paired reads seem to be listed in bam
created tmp copy: emirge_tmp_reads_[12].fastq.keep.gz
samtools view -F 0x4 emirge/iter.10/bowtie.iter.10.PE.u.b.bam | cut -f 1 | sort -us -o bowtie.iter.10.PE.u.b.bam.mapped-ids.txt

Empty BAM

No errors in logs, but BAM files empty. See bug-empty-bam-PHH4043

Graphmap 0.3.0 results in many unaligned seqs

See also isovic/graphmap#28

See e.g. AHS182

cd /mnt/projects/wilma/16s/S16_V2/bugs/AHS182

source /mnt/software/etc/gis.bashrc
source activate py3k
snakemake -s snake.make --configfile conf.json -p -T --dryrun --force


new:

samtools view graphmap-0.3.0-1d16f07.ident.bam | head -n 1
82|ACCG02000012.525518.527039   0   742595  281 1   477M1D249M  *   0   0   <removed>   *   MD:Z:3C473^G246T2   NM:i:3  AS:i:3604   H0:i:4  ZE:f:0  ZF:f:0.334728   ZQ:i:726    ZR:i:1382   Xi:f:99.6

samtools view greengenes-hits-graphmap.bam | grep '821|AF414833.1.1363'
821|AF414833.1.1363 4   *   0   255 *   *   0   0   <removed> * NM:i:-1 AS:i:-731   H0:i:0  ZE:f:inf    ZF:f:0  ZQ:i:731    ZR:i:0  Xi:f:0

/mnt/projects/wilma/16s/S16_V2_inst/f714a48/classify_hits.py -f -q emirge_outprimer_trimmed.fa -i graphmap-0.3.0-1d16f07.ident.bam -t /mnt/genomeDB/misc/greengenes.secondgenome.com/downloads/13_5/gg_13_5_otus/taxonomy/99_otu_taxonomy.txt -o -
82|ACCG02000012.525518.527039   0.290834    742595  99.6    k:Bacteria  p:Actinobacteria    c:Actinobacteria    o:Bifidobacteriales f:Bifidobacteriaceae    g:Bifidobacterium   s:breve
821|AF414833.1.1363 0.127213    *   0   k:  p:  c:  o:  f:  g:  s:

old:

samtools view graphmap-0.2.2-dev-604a386.ident.bam | head -n 1
82|ACCG02000012.525518.527039   0   484303  338 1   726M    *   0   0   <removed>   *   NM:i:2  AS:i:3612   H0:i:1  ZE:f:0  ZF:f:0.334728   ZQ:i:726    ZR:i:1535   Xi:f:99.7
821|AF414833.1.1363 0   4480544 347 0   731M    *   0   0   <removed> * NM:i:5  AS:i:3610   H0:i:1  ZE:f:0  ZF:f:0.335099   ZQ:i:731    ZR:i:1536   Xi:f:99.3

/mnt/projects/wilma/16s/S16_V2_inst/f714a48/classify_hits.py -f -q emirge_outprimer_trimmed.fa -i graphmap-0.2.2-dev-604a386.ident.bam -t /mnt/genomeDB/misc/greengenes.secondgenome.com/downloads/13_5/gg_13_5_otus/taxonomy/99_otu_taxonomy.txt -o -
82|ACCG02000012.525518.527039   0.290834    484303  99.7    k:Bacteria  p:Actinobacteria    c:Actinobacteria    o:Bifidobacteriales f:Bifidobacteriaceae    g:Bifidobacterium   s:breve
821|AF414833.1.1363 0.127213    4480544 99.3    k:Bacteria  p:Bacteroidetes c:Bacteroidia   o:Bacteroidales f:Prevotellaceae    g:Prevotella    s:nigrescens

What does "Skipping because of primer matches with wrong orientation" mean?

When I run primer_trimmer.py, I get a lot of warning messages about “Skipping ___ because of primer matches with wrong orientation”. What does this mean?

The code that probably caused this situation was in lines 181-183 of primer_trimmer.py and it looked like this:

        if pos['fw'] > pos['rv']:
            LOG.info("Skipping %s because of primer matches with wrong orientation" % (seqrec.id))
            Continue