Giter Site home page Giter Site logo

ctgap's Introduction

CtGAP - Chlamydia trachomatis Genome Assembly Pipeline

Install

  1. git clone this repo:
git clone https://github.com/ammaraziz/ctgap
  1. Install miniconda or preferrably mamba

  2. Install snakemake:

mamba install -c bioconda snakemake 

Manually install rust/scrubby

mamba install -c conda-forge rust
cargo install scrubby
  1. Download the human genome, rename to resources/grch38.fasta

  2. Download one of the kraken dbs with bacterial genomes, rename to resources/standardDB:

  3. Done - The pipeline will handle the dependencies internally.

Usage

  1. Create a folder ctgap/input/

  2. Add your fastq.gz files in ctgap/input/.

    • Ensure they're named as follows: {sample_name}_{direction}.fastq.gz.
      • eg SRR12345_R1.fastq.gz and SRR12345_R2.fastq.gz.
  3. In ctgap/ folder run the pipeline:

snakemake -j 8 --use-conda -k
  • -j 8 specifies the number of threads to use in total.
  • --use-conda tells snakemake to install the dependencies.
  • -k tells snakemake to keep going if a sample fails.

Dependencies

  • Snakemake
  • Spades
  • Shovill
  • Bowtie2
  • Samtools
  • fastp
  • bbmap (bbnorm)
  • kraken2
  • multiqc
  • blast+

Output

TBA

Cite

Pipeline is created by Shola Olagoke with assistance from Ammar Aziz.

ctgap's People

Contributors

ammaraziz avatar gokeson avatar

Stargazers

 avatar  avatar Eike Steinig avatar

Watchers

 avatar

Forkers

gokeson

ctgap's Issues

some quick fixes:

          some quick fixes:

rule index:
prefix = "resources/ctReferences"
#from "ctReference" to "ctReferences"


scaffold.yaml

dependencies:

  • ragtag
    #from "rag-tag" to "ragtag"

Originally posted by @gokeson in #2 (comment)

How to handle plasmids?

@gokeson Could you provide advice on how the pipeline should handle plasmids? In a test run I detected a 7.5kb plasmid, full length!

automate ref-denovo for specific reference strain

automate ref-denovo assembly arm of CtGAP
Users to supply a file containing sample names and choice reference strain for each sample for the ref-denovo step.
The choice ref strain may be from CtGAP coverage output (or just from knowledge).
See examples below:

sample_ID ref_strain
ERR12345678 A_Sa1
ERR87654321 Ba_Apache2

add phylogeny output

Create phylogeny for each sample.

  • Create representative reference set to use as a whole genome backbone
  • Same for ompa gene only
  • Create bed file of recombination regions to mask
  • Use KSNP for generating reference free snps

Add snippy for SNP detection

Add snippy to detect snps between multiple assembly methods.

We should be able to have this feature for between sample comparisons too. What do you think?

So, we will have

  • between assembly methods comparison and
  • between samples comparison

Add QUASt

assembly metrics with QUASt
multiQC too?

v0.3.0

  • add second scrubby step to keep chlamydia reads only
  • consensus sequences:
    • use ragtag for scaffolding
    • separate genome from plasmid
    • collate genomes in single output folder
    • collate plasmids in single output folder
    • add sample name to fasta headers
  • add mlst typing for two existing schemes
  • remove bbnorm as shovill can downsample
  • collate blast results genotype
  • collate coverage
  • use config to specify cpus used

reference 'guided' consensus assembly

From sola, process to generate a consensus from mapping + denovo.

#!/bin/bash

while IFS="," read -r sample_name      ref_strain      ref_file_name      R1_path      R2_path
do

bowtie2 -x  $ref_strain -1 $R1_path -2 $R2_path -S $sample_name"_"$ref_genome".sam" --local --threads 6

##convert sam to bam, sort, index, calculate genome coverage, convert bam to fastq
samtools view -bSh -o $sample_name".bam" $sample_name"_"$ref_genome".sam"
samtools sort $sample_name".bam" -o $sample_name"_sorted.bam"
samtools index $sample_name"_sorted.bam"
samtools coverage $sample_name"_sorted.bam" > $sample_name"_coverage.txt"

#bamToFastq 
bamToFastq -i $sample_name"_sorted.bam" -fq $sample_name"_refg_R1.fastq.gz" -fq2 $sample_name"_refg_R2.fastq.gz"


#Shovill final assembly:
shovill --outdir shovill/$sample_name"ref-guided" --gsize 1.04M --R1 $sample_name"_refg_R1.fastq.gz" --R2 $sample_name"_refg_R2.fastq.gz"

done < <(tail -n +1 purple_ref-genome_assembly_guide.csv)

to-do

at some point, we may want to

  • reconfigure scrubby to keep chlamydia reads only
  • remove bbnorm as shovill can downsample
  • collate all assembled genomes, genotype and coverage outputs into a folder each when working with multiple samples.

Scrubby updates

@gokeson Ammar mentioned you were integrating Scrubby with the pipeline! Really cool, it was mostly a small side project thing, but people seem to be using it here and there, so will do my best to upgrade it accordingly over the next two weeks or so.

Is there anything specific you were keen to see besides easy deployment via BioConda and/or binaries? We can keep a checklist here, including if you'd like to add anything relevant for you lab as well.

Scrubby wishlist:

  • Distribution via BioConda or at least private channel
  • HPRG reference genome database for depletion
  • Reference database downloader with pre-built indices

Changes for v0.02

  • Use scrubby for human removal step
  • Filter for chlamydia spp
  • Add fastp for adapter and quality trimming
  • Collate coverage stats from samtools coverage, compute meanmapq for each genome hit.
  • get uniquely mapped reads for ct_ref.fasta alignment
  • use top hit reference for
    • alignment based genome assembly
    • denovo assembly
  • replace spades with shovill
  • look at genome completeness for the alignment step

create plurality consensus from 24 reference genomes

To create the plurality consensus:

  1. Reference genomes were oriented with dnaapler
for f  in *.fasta; do echo $f; dnaapler all -i $f -o plurality/${f/.fasta/.reorient} -t 6; done
  1. Aligned with all-to-all mugsy default settings on the output of dnaapler
mugsy -p mugsyout *.fasta
  1. Each region (separated by =) is extracted manually
  2. For each region goalign consensus is run:
goalign consensus --ignore-gaps -i input.fasta -o output.cons.fasta
  1. Concat all cons.fa sequences and rename:
cat *.cons.fa | seqkit replace -p "$" -r "_{nr}" > plurality_all.fasta
  1. All plurality consensus was scaffolded against Ct Genotype D using ragtag:
ragtag.py scaffold -u scaffoldReference.fasta plurality_all.fasta -o plurality_1_scaffold
  1. Output is renamed again:
seqkit replace -p "(.*)" -r "plurality_{nr}" plurality_scaffold/ragtag.scaffold.fasta > plurality_final.fasta

plurality_final.txt

error:

error:

RuleException in rule scrub in file /home/oolago2/assembly_pipeline/fullTest_Dec23/CtGAP_test/ctgap/workflow/rules/2-scrub.smk, line 1:
AttributeError: 'OutputFiles' object has no attribute 'r1tmp', when formatting the following:

    scrubby scrub-reads     -i {input.r1} {input.r2}        -o {output.r1tmp} {output.r2tmp}        --kraken-db {params.db}         --kraken-taxa "Archaea Eukaryota Holozoa Nucletmycea"   --min-len {params.minlen}       --minimap2-index {params.human}         --kraken-threads {threads}      --workdir {params.workdir:q} 2> {log}

    echo -e "
Scrubby Kraken Extract
" >> {log}

    scrubby scrub-kraken    -i {output.r1tmp} {output.r2tmp}        -o {output.r1} {output.r2}      --extract       --kraken-taxa {params.kraken_taxa_extract}      --kraken-reads {params.workdir}/0-standardDB.kraken     --kraken-report {params.workdir}/0-standardDB.report    --kraken-threads {threads} 2>> {log}

    touch {output.status}

Originally posted by @gokeson in #2 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.