This is a Snakemake workflow for the somatic variant valling from targeted tumor-only NGS data. Workflow covers the preprocessing of NGS reads (trimming, alignment, optional deduplication and base quality score recalibration) and calls somatic and germline variants with Mutect2 algorithm. Called variants are annotated with the SnpEff.
MSc Jouni Kujala, University of Eastern Finland, Kuopio, Finland
Following dependencies must be installed before running this workflow:
We recommend you to first install Conda and then install Mamba and Snakemake with following commands. These steps install Mamba to your local system and generate a virtual environment for Snakemake workflow. Actual workflow runs are performed within this environment. Rest of the dependencies are handled automatically by Snakemake.
# Install Mamba package manager.
conda install -c conda-forge mamba
# Use Mamba to create virtual environment for Snakemake workflow.
mamba create -c conda-forge -c bioconda -n snakemake snakemake
Next step is to obtain your working copy of this workflow. You can either download this repository as .zip file or clone repository with git
. Following command can be used to clone this repository:
# Go to directory to which the repository should be cloned.
cd path/to/target/directory
# Clone repository.
git clone https://github.com/jounikuj/tumor_only_variant_calling_workflow.git
Preparing your workflow run is straightforwards as it only contains three steps.
- Import your FASTQ files to
input/
directory. Naming convention{sample_id}_S1_R1_001.fastq.gz
should be used used for FASTQ files where the{sample_id}
is the sample identifier used by workflow. - Import your gene panel design as BED file to
config/
directory. - Check that the configuration file
config/config.yaml
is correctly set up. This file defines the used adapter sequences and reference files (reference genome etc.) and it is essential to configurate workflow before your analysis.
Go to workflow directory and activate Snakemake's virtual environment. It is recommended to test Snakemake in its built-in dryrun mode as it will usually identify possible errors in the workflow configuration.
# Activate virtual environment.
conda activate snakemake
# Run Snakemake in dryrun mode.
snakemake --use-conda --cores 1 --dryrun
If everything seems to be fine, proceed to actual workflow by dropping the --dryrun
mode from the command. Parameter --cores
specifies the number of available cores for the workflows. Provided value must be at least one, but you can significantly speed up the workflow by providing more cores for the analysis.
# Run workflow.
snakemake --use-conda --cores [int]
Workflow should now start. It will take some time to generate all required virtual environments especially if this is your first run.
- Adapter and primer sequences are first trimmed from raw FASTQ reads with cutadapt. By default, reads are quality-trimmed so that the bases with base quality scores less than Q10 are trimmed from 3' end of the read and reads shorter than 10 bp are discarded from further analysis.
- Trimmed reads are aligned against human reference genome hg38 with BWA-MEM algorithm.
- Aligned reads are deduplicated with Picard MarkDuplicates tool to identify possible PCR artifacts and optical sequencing errors. Although deduplication is performed by default, it can be disabled from
config/config.yaml
if required by specific sequencing approaches (such as amplicon based target enrichment). - Base quality score recalibration (BQSR) is applied to detect and compensate systematic errors in the sequencing data that originate from sequencing machine.
- Somatic and germline variants are called with GATK4 Mutect2. Low-quality somatic variants are subsequently discarded by GATK4 FilterMutectCalls to obtain high-quality variant calls.
- Variant calls are annotated with gene name, amino acid changes, functional consequence etc. beneficial information by using SnpEff. Variant calls are further annotated by comparing variants to existing ClinVar and OncoKB database records.
- Key results of variant calling workflow are parsed to Excel file.
Found a bug from source code or have idea for improvements and new features? Open a new Issue or make a new Pull request through GitHub to help improve the development of this workflow!
This workflow has not been published yet. If you decide to use this workflow in your research, simply refer to this GitHub repository.