The ctgap from ammaraziz

rename outputs to sample name

The contigs need to be renamed.

shovill -> scaffold -> gapfiller -> dnaapler -> split+rename

some quick fixes:

          some quick fixes:

rule index:
prefix = "resources/ctReferences"
#from "ctReference" to "ctReferences"

scaffold.yaml

dependencies:

ragtag
#from "rag-tag" to "ragtag"

Originally posted by @gokeson in #2 (comment)

How to handle plasmids?

@gokeson Could you provide advice on how the pipeline should handle plasmids? In a test run I detected a 7.5kb plasmid, full length!

automate ref-denovo for specific reference strain

automate ref-denovo assembly arm of CtGAP
Users to supply a file containing sample names and choice reference strain for each sample for the ref-denovo step.
The choice ref strain may be from CtGAP coverage output (or just from knowledge).
See examples below:

sample_ID ref_strain
ERR12345678 A_Sa1
ERR87654321 Ba_Apache2

add phylogeny output

Create phylogeny for each sample.

Create representative reference set to use as a whole genome backbone
Same for ompa gene only
Create bed file of recombination regions to mask
Use KSNP for generating reference free snps

Add snippy for SNP detection

Add snippy to detect snps between multiple assembly methods.

We should be able to have this feature for between sample comparisons too. What do you think?

So, we will have

between assembly methods comparison and
between samples comparison

Add QUASt

assembly metrics with QUASt
multiQC too?

v0.3.0

reference 'guided' consensus assembly

From sola, process to generate a consensus from mapping + denovo.

#!/bin/bash

while IFS="," read -r sample_name      ref_strain      ref_file_name      R1_path      R2_path
do

bowtie2 -x  $ref_strain -1 $R1_path -2 $R2_path -S $sample_name"_"$ref_genome".sam" --local --threads 6

##convert sam to bam, sort, index, calculate genome coverage, convert bam to fastq
samtools view -bSh -o $sample_name".bam" $sample_name"_"$ref_genome".sam"
samtools sort $sample_name".bam" -o $sample_name"_sorted.bam"
samtools index $sample_name"_sorted.bam"
samtools coverage $sample_name"_sorted.bam" > $sample_name"_coverage.txt"

#bamToFastq 
bamToFastq -i $sample_name"_sorted.bam" -fq $sample_name"_refg_R1.fastq.gz" -fq2 $sample_name"_refg_R2.fastq.gz"


#Shovill final assembly:
shovill --outdir shovill/$sample_name"ref-guided" --gsize 1.04M --R1 $sample_name"_refg_R1.fastq.gz" --R2 $sample_name"_refg_R2.fastq.gz"

done < <(tail -n +1 purple_ref-genome_assembly_guide.csv)

to-do

at some point, we may want to

reconfigure scrubby to keep chlamydia reads only
remove bbnorm as shovill can downsample
collate all assembled genomes, genotype and coverage outputs into a folder each when working with multiple samples.

Scrubby updates

@gokeson Ammar mentioned you were integrating Scrubby with the pipeline! Really cool, it was mostly a small side project thing, but people seem to be using it here and there, so will do my best to upgrade it accordingly over the next two weeks or so.

Is there anything specific you were keen to see besides easy deployment via BioConda and/or binaries? We can keep a checklist here, including if you'd like to add anything relevant for you lab as well.

Scrubby wishlist:

Distribution via BioConda or at least private channel
HPRG reference genome database for depletion
Reference database downloader with pre-built indices

Changes for v0.02

create plurality consensus from 24 reference genomes

To create the plurality consensus:

Reference genomes were oriented with dnaapler

for f  in *.fasta; do echo $f; dnaapler all -i $f -o plurality/${f/.fasta/.reorient} -t 6; done

Aligned with all-to-all mugsy default settings on the output of dnaapler

mugsy -p mugsyout *.fasta

Each region (separated by =) is extracted manually
For each region goalign consensus is run:

goalign consensus --ignore-gaps -i input.fasta -o output.cons.fasta

Concat all cons.fa sequences and rename:

cat *.cons.fa | seqkit replace -p "$" -r "_{nr}" > plurality_all.fasta

All plurality consensus was scaffolded against Ct Genotype D using ragtag:

ragtag.py scaffold -u scaffoldReference.fasta plurality_all.fasta -o plurality_1_scaffold

Output is renamed again:

seqkit replace -p "(.*)" -r "plurality_{nr}" plurality_scaffold/ragtag.scaffold.fasta > plurality_final.fasta

plurality_final.txt

add dnaapler to reorientate contigs

shovill -> scaffold -> gapfiller -> dnaapler

dnaapler all \
-i input.fasta \
-o output_directory_path \
-p my_genome_name

error:

RuleException in rule scrub in file /home/oolago2/assembly_pipeline/fullTest_Dec23/CtGAP_test/ctgap/workflow/rules/2-scrub.smk, line 1:
AttributeError: 'OutputFiles' object has no attribute 'r1tmp', when formatting the following:

    scrubby scrub-reads     -i {input.r1} {input.r2}        -o {output.r1tmp} {output.r2tmp}        --kraken-db {params.db}         --kraken-taxa "Archaea Eukaryota Holozoa Nucletmycea"   --min-len {params.minlen}       --minimap2-index {params.human}         --kraken-threads {threads}      --workdir {params.workdir:q} 2> {log}

    echo -e "
Scrubby Kraken Extract
" >> {log}

    scrubby scrub-kraken    -i {output.r1tmp} {output.r2tmp}        -o {output.r1} {output.r2}      --extract       --kraken-taxa {params.kraken_taxa_extract}      --kraken-reads {params.workdir}/0-standardDB.kraken     --kraken-report {params.workdir}/0-standardDB.report    --kraken-threads {threads} 2>> {log}

    touch {output.status}

Originally posted by @gokeson in #2 (comment)

ammaraziz / ctgap Goto Github PK

ctgap's Introduction

CtGAP - Chlamydia trachomatis Genome Assembly Pipeline

Install

Usage

Dependencies

Output

Cite

ctgap's People

Contributors

Stargazers

Watchers

Forkers

ctgap's Issues

Recommend Projects

Recommend Topics

Recommend Org