Giter Site home page Giter Site logo

iamh2o / covid-19-signal Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jaleezyy/covid-19-signal

0.0 0.0 0.0 1.92 MB

Files and methodology pertaining to the sequencing and analysis of SARS-CoV-2, causative agent of COVID-19.

Shell 18.92% Perl 1.38% Python 73.92% Dockerfile 0.33% WDL 1.73% TeX 3.61% Batchfile 0.12%

covid-19-signal's Introduction

SARS-CoV-2 Illumina GeNome Assembly Line (SIGNAL)

This is a complete standardized workflow the assembly and subsequent analysis for short-read viral sequencing. This core workflow is compatible with the illumina artic nf pipeline and produces the same consensus and variants using ivar (1.3) Grubaugh, 2019. However, it performs far more extensive quality control and visualisation of results including an interactive HTML summary of run results.

Briefly, raw reads undergo qc using fastqc Andrews before removal of host-related reads by competitive mapping against a composite host and human reference with BWA-MEM (0.7.5) Li, 2013, samtools (1.9) Li, 2009, and a custom script. This is to ensure raw as data as possible can be deposited in central databases. After this, reads undergo adapter trimming and further qc with trim-galore (0.6.5) Martin. Residual truseq sequencing adapters are then removed through another custom script. Reads are then mapped to the viral reference with BWA-MEM, and amplicon primer sequences trimmed using ivar (1.3) Grubaugh, 2019. Fastqc is then used to perform a QC check on the reads that map to the viral reference. After this, ivar is used to generate a consensus genome and variants are called using both ivar variants and breseq (0.35) Deatherage, 2014. Coverage statistics are calculated using bedtools before a final QC via quast and a kraken2 taxonomic classification of mapped reads. Finally, data from all samples are collated via a post-processing script into an interactive summary for exploration of results and quality control. Optionally, users can run ncov-tools to generate additional quality control and summary plots and statistics.

If you use this software please cite:

Nasir, Jalees A., Robert A. Kozak, Patryk Aftanas, Amogelang R. Raphenya, Kendrick M. Smith, Finlay Maguire, Hassaan Maan et al. "A Comparison of Whole Genome Sequencing of SARS-CoV-2 Using Amplicon-Based Sequencing, Random Hexamers, and Bait Capture." Viruses 12, no. 8 (2020): 895.
https://doi.org/10.3390/v12080895

Setup/Execution

  1. Clone the git repository (--recursive only needed to runncov-tools postprocessing)

     git clone --recursive https://github.com/jaleezyy/covid-19-signal
    
  2. Install conda and snakemake (version >5) e.g.

     wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
     bash Miniconda3-latest-Linux-x86_64.sh # follow instructions
     source $(conda info --base)/etc/profile.d/conda.sh
     conda create -n signal -c conda-forge -c bioconda -c defaults snakemake pandas conda
     conda activate signal 
    

There are some issues with conda failing to install newer versions of snakemake so alternatively install mamba and use that (snakemake has beta support for it within the workflow)

    conda install -c conda-forge mamba
    mamba create -c conda-forge -c bioconda -n signal snakemake conda
    conda activate signal

Additional software dependencies are managed directly by snakemake using conda environment files:

  • trim-galore 0.6.5 (docs)
  • kraken2 2.1.1 (docs)
  • quast 5.0.2 (docs)
  • bwa 0.7.17 (docs)
  • samtools 1.7/1.9 (docs)
  • bedtools 2.26.0 (docs)
  • breseq 0.35.0 (docs)
  • ivar 1.3 (docs)
  • ncov-tools postprocessing scripts require additional dependencies (see file).
  1. Download necessary database files

The pipeline requires:

  • Amplicon primer scheme sequences

  • SARS-CoV2 reference fasta

  • SARS-CoV2 reference gbk

  • SARS-CoV2 reference gff3

  • kraken2 viral database

  • Human GRCh38 reference fasta (for composite human-viral BWA index)

     bash scripts/get_data_dependencies.sh -d data -a MN908947.3
    
  1. Configure your config.yaml file

Either using the convenience python script or through modifying the example_config.yaml to suit your system

  1. Specify your samples in CSV format (e.g. sample_table.csv)

See the example table example_sample_table.csv for an idea of how to organise this table. You can attempt to use generate_sample_table.sh to circumvent manual creation of the table.

  1. Execute pipeline (optionally explicitly specify --cores)

    snakemake -kp --configfile config.yaml --cores=NCORES --use-conda --conda-prefix=$PWD/.snakemake/conda all

    If the --conda-prefix is not set as this then all envs will be reinstalled for each time you change the results_dir in the config.yaml.

  2. Postprocessing analyses:

    snakemake -p --configfile config.yaml --cores=NCORES --use-conda --conda-prefix=$PWD/.snakemake/conda postprocess

After postprocessing finishes, you'll see the following summary files:

  - summary.html                top-level summary, with links to per-sample summaries
  - {sample_name}/sample.html   per-sample summaries, with links for more detailed info
  - {sample_name}/sample.txt    per-sample summaries, in text-file format instead of HTML
  - summary.zip                 zip archive containing all of the above summary files.

Note that the pipeline postprocessing ('snakemake postprocess') is separated from the rest of the pipeline ('snakemake all'). This is because in a multi-sample run, it's likely that at least one pipeline stage will fail. The postprocessing script should handle failed pipeline stages gracefully, by substituting placeholder values when expected pipeline output files are absent. However, this confuses snakemake's dependency tracking, so there seems to be no good alternative to separating piepline processing and postprocessing into 'all' and 'postprocess' targets.

Related: because pipeline stages can fail, we recommend running 'snakemake all' with the -k flag ("Go on with independent jobs if a job fails").

Additionally, SIGNAL can prepare output for use with @jts' ncov-tools to generate phylogenies and alternative summaries.

snakemake --use-conda --cores 10 ncov_tools

SIGNAL manages installing the dependencies and will generate the necessary hard links to required input files from SIGNAL for ncov-tools if it has been cloned as a sub-module and a fasta containing sequences to include in the tree has been specified using phylo_include_seqs: in the main SIGANLconfig.yaml.

Outputs will be written as specified within the ncov-tools folder and documentation. At present, invoking ncov-tools should be done manually as per its documentation.

Docker

Alternatively, the pipeline can be deployed using Docker (see resources/Dockerfile_pipeline for specification). To pull from dockerhub:

    docker pull finlaymaguire/signal

Download data dependencies into a data directory that already contains your reads (data is this example but whatever name you wish to use):

    mkdir -p data && docker run -v $PWD/data:/data finlaymaguire/signal:latest bash scripts/get_data_dependencies.sh -d /data

Generate your config.yamland sample_table.csv (with paths to the readsets underneath /data) and place them into the data directory:

    cp config.yaml sample_table.csv $PWD/data

WARNING result_dir in config.yaml must be within /data e.g. /data/results to automatically be copied to your host system. Otherwise they will be automatically deleted when the container finishes running (unless docker is run interactively).

Then execute the pipeline:

    docker run -v $PWD/data:/data finlaymaguire/signal conda run -n snakemake snakemake --configfile /data/config.yaml --use-conda --conda-prefix /covid-19-signal/.snakemake/conda --cores 8 all

Summaries:

  • postprocessing and ncov_tools as described above generate many summaries including interactive html reports.`

  • Generate summaries of BreSeq among many samples, see

Pipeline details:

For a step-by-step walkthrough of the pipeline, see pipeline/README.md.

A diagram of the workflow is shown below.

Workflow Version 8

Possible Artefacts

  • @jts: Host derived poly-A reads that sneak through the composite host removal stage can align to the viral poly-A tail giving it enough coverage to be called in the consensus. Having this poly-A tail in the consensus can mess up the later analyses that require MSA. If a sample is causing issues, check for host-derived poly-A reads.

covid-19-signal's People

Contributors

fmaguire avatar hkeward avatar jaleezyy avatar kmsmith137 avatar nknox avatar nodrogluap avatar pvanheus avatar raphenya avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.