Giter Site home page Giter Site logo

tgolubch / castanet Goto Github PK

View Code? Open in Web Editor NEW
2.0 3.0 1.0 27.97 MB

Data processing scripts for analysis of targeted metagenomics sequencing data. Intended for use with multiplexed Illumina short-read data generated after enrichment.

Shell 0.98% Python 99.02%
metagenomics python bash bioinformatics-scripts bioinformatics pathogenomics pathogen viruses bacteria enrichment probes sepsis meningitis

castanet's Introduction

Castanet

Analysis of targeted metagenomics data as described in https://doi.org/10.1101/716902

Dependencies:

  • python2.7+
  • numpy
  • pandas
  • matplotlib
  • samtools
  • optional for mapping - bwa, or mapper of your choice
  • optional for adapter removal - trimmomatic
  • optional for pretty plots - seaborn
  • optional for removing human reads - kraken (v1 or v2) with standard database of human + bacterial/viral RefSeq genomes

Workflow:

  1. Optionally pre-process your raw reads to remove human reads (genomic and mitochondrial) and common contaminants. Assuming you have run kraken on a sample called SeqName and saved the output to ${SeqName}.kraken.gz, the following creates a set of filtered fastq files, ${SeqName}_[12]_filt.fastq:
filter_keep_reads.py -i ${SeqName}_[12].fastq.gz  -k ${SeqName}.kraken.gz --xT Homo,Alteromonas,Achromobacter -x 1969841 --suffix filt
  1. Trim adapters and poor-quality reads:
    trimmomatic PE -threads 8 SeqName_1_filt.fastq SeqName_2_filt.fastq SeqName_1_clean.fastq tmp_SeqName_1_trimmings.fq SeqName_2_clean.fastq tmp_SeqName_2_trimmings.fq ILLUMINACLIP:${AdaptersPath}:2:10:7:1:true MINLEN:80 
  1. Map each sample's reads to the reference containing consensus target sequences with your mapper of choice, and produce a sorted BAM file of mapped reads only, for each sample. Do not deduplicate the BAM files.
bwa index rmlst_virus_extra_ercc.fasta
RefStem="rmlst_virus_extra_ercc.fasta"
bwa mem ${RefStem} SeqName_[12]_clean.fastq | samtools view -F4 -Sb -| samtools sort - 1> ${SeqName}.bam 

If using your own reference file, you can generate a probelengths CSV for analysis using the supplied helper script:

bam2probelengths.sh ${SeqName}.bam 1> ${RefStem}_probelengths.csv
  1. Generate a CSV file containing the counts for each uniquely mapped sequence in the entire pool, including improper pairs:
    for BamFilePath in $(ls \*.bam); do
        count_duplicates_single_bam_improper.sh ${BamFilePath}
    done > PosCounts.csv
  1. Analyse all organisms detected in the pool, and combine with information on samples (including raw read number) and study participants (eg. clinical diagnosis):
process_pool_grp.py -h

usage: process_pool_grp.py [-h] -b BATCHNAME -i INFILE [-o OUTDIR]
                           [-p PROBELENGTHS] [-d] --samples SAMPLES
                           [--clin CLIN] [--depth_inf DEPTH_INF]

optional arguments:
  -h, --help            show this help message and exit
  -b BATCHNAME, --batchname BATCHNAME
                        Batch name for these samples. Must be alphanumeric.
  -i INFILE, --infile INFILE
                        Data frame (csv[.gz]) file to process. If gzipped,
                        filename must end in .gz.
  -o OUTDIR, --outdir OUTDIR
                        Output directory. Default: current working directory.
  -p PROBELENGTHS, --probelengths PROBELENGTHS
                        Path to file containing probe lengths. Default:
                        $CASTANET_PATH/scripts/probelengths_rmlst_virus_extra_ercc.csv
  -d, --keepdups        If true, do not reassign duplicates to the sample with
                        the majority in each duplicate cluster (Default:
                        False).
  --samples SAMPLES     Path to CSV file containing information about raw
                        reads (must have at least following fields: sampleid,
                        pt, rawreadnum). Field "pt" must match clinical data.
  --clin CLIN           Path to CSV file containing clinical data (must have
                        at least following fields: pt, clin_int; the field
                        "sampleid" if present will be ignored). Other fields
                        will be ignored.
  --depth_inf DEPTH_INF
                        (For regenerating full CSV with new clinical info):
                        Path to previously generated CSV file of read depth
                        per position for each probe, for all samples in this
                        batch. Must contain the following fields: sampleid,
                        target_id, depth_mean, depth_std, depth_25pc,
                        depth_median, depth_75pc, prop_target_covered,
                        prop_target_covered_mindepth2,
                        prop_target_covered_mindepth5,
                        prop_target_covered_mindepth10, udepth_mean,
                        udepth_std, udepth_25pc, udepth_median, udepth_75pc,
                        uprop_target_covered, uprop_target_covered_mindepth2,
                        uprop_target_covered_mindepth5,
                        uprop_target_covered_mindepth10

Example: process_pool_grp.py -i /path/to/my/dataframe.csv.gz --samples
/path/to/sampleinfo.csv

This produces three key outputs: a directory of coverage plots Depth_BATCHNAME, and the files BATCHNAME_depth.csv and BATCHNAME_reads_to_drop.csv.

BATCHNAME_depth.csv contains the number and proportion of reads for each sample and reference genome. Positives may need to be calibrated against a reference set, but in general, the proportion of all clean reads that match the given target (clean_prop_of_reads_on_target) is a good place to start.

BATCHNAME_reads_to_drop.csv can be used to clean BAM files to remove misassigned reads, for downstream processing.

  1. If required, you can filter BAM files of interest to remove reads marked as contamination due to index misassignment:
   samtools -h ${SAMPLENAME}.bam | filter_bam.py SAMPLENAME BATCHNAME_reads_to_drop.csv

castanet's People

Contributors

tgolubch avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

mayne941

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.