Integrated Analysis of Germline SNVs, INDELs, and CNVs in Breast Cancer Whole Exome Sequencing Data

Description
- Motivation
- Results
Installation
Dataset Information
Directory Structure

Description

Motivation

The rapidly evolving field of genomics emphasizes the need for a holistic understanding of the genetic structures associated with diseases like breast cancer. Current methods often analyze genomic variants such as Single Nucleotide Variants (SNVs), insertions and deletions (INDELs), and Copy Number Variants (CNVs) separately, leaving a gap for integrated analysis.

Results

We introduce an adaptable method for analyzing SNVs, INDELs, and CNVs from Whole Exome Sequencing (WES) data, emphasizing germline variants. Our approach rigorously validates and filters variants for accuracy. Through the reanalysis of two public WES datasets, our tool highlights its versatility, uncovering both novel and known variations crucial for breast cancer predisposition.

Installation

Prerequisites

Ensure ~200GB of disk space.
Clone or download this repository.

Dataset Retrieval

Download the first dataset from ENA browser.
For the second dataset, request access.

Tools & Setup

Annovar: Download and configure Annovar, following its official documentation.
GATK Bundle: Acquire the GATK bundle (for hg19 genome).
AnnotSV: Install AnnotSV, following its official documentation.

Pipeline Directory Setup

Use VS Code to navigate to the pipeline directory.
Follow the provided structure for placing data, configuration files, and tools.

Pipeline Configuration

Choose the appropriate snakefile from the Snakefiles folder.
Open Snakefile and set the number of threads for Snakemake.
Create config files for datasets based on provided examples.

Installation Options

Using Docker:
- Download and install Docker.
- Configure Docker via the app settings.
- Build and run the pipeline using the provided commands.
- Monitor progress and retrieve results within the Docker app.
- Optionally, execute specific parts of the pipeline as directed.

Using Snakemake (No Docker):

Download and install Anaconda.

Set up the environment and install necessary tools using the provided commands.

 conda create -c conda-forge -c bioconda -n #environmentname snakemake -y 
 conda activate approach
 conda install -y c conda-forge perl
 conda install -y -c conda-forge r-base
 conda install -y -c bioconda samtools
 conda install -y -c bioconda trimmomatic
 conda install -y -c conda-forge -c bioconda gatk4
 conda install -y -c bioconda annotsv
 conda install -y -c bioconda bcftools
 R -e "install.packages('BiocManager', repos='http://cran.us.r-project.org')"
 R -e 'BiocManager::install("ExomeDepth")'
 R -e 'BiocManager::install("DNAcopy")'
 R -e 'BiocManager::install("cn.mops")'

Navigate to the appropriate folder inside "replication_cnv_snakemake/Snakefiles" and execute the pipeline with Snakemake.

Dataset Information

Datasets are publicly sourced. The second dataset includes 7 Illumina samples from this paper. Requests for data can be made to the corresponding author.

Key Features

Accessibility: The dataset promotes open and collaborative research.

Accessing the Dataset

Visit PRJEB3235 and PRJEB31704.
Download using methods from NCBI SRA.

Note: Adherence to usage guidelines ensures ethical data usage.

Directory Structure

1. Using Docker

Ensure your directory structure matches this repository to avoid errors.

100bp_exon.bed: File .bed containing information about genomic regions of interest
alignedFiles/: Folder cointaining intermediate file for determining SNV and Indels
AnnotSV/:
annovar/
clean_and_merge.py: Python file for merging results from different CNV callers
config_paired.csv: Configuration file for paired_end data
config_single.csv: Configuration file for single_end data
dockerfile: docker file containing the instruction to create the docker image
exomeDepth_paired.r: R script that manages ExomeDepth method for paired_end reads
exomeDepth_single.r: R script that manages ExomeDepth method for paired_end reads
final: Folder containing final aligned, deduplicated and recalibrated bam files
gatkbundle/: Folder containing all the resources (variant files, genome files, ...) required from GATK
GENDB/:
index/: Folder containing index files for the reference genome
logs/: Folder containing logs from the execution
mapped/: Folder containing partial aligned and sorted bam files
reads/: Folder containing the raw sequencing reads that are to be analyzed in /single/ and /paired/ directories
results/: Folder containing results from cnv calling
scriptmops.r: cn.mops script for CNV detection and analysis
Snakefiles/: Folder containing snakemake's files to run the method
trimmed: Folder containing trimmed files
tmpgenomicsdb
tmpPicard
tables

2. Using Snakemake (No Docker)

Ensure your directory structure matches this repository to avoid errors.

100bp_exon.bed: File .bed containing information about genomic regions of interest
alignedFiles/: Folder cointaining intermediate file for determining SNV and Indels
clean_and_merge.py: Python file for merging results from different CNV callers
config_paired.csv: Configuration file for paired_end data
config_single.csv: Configuration file for single_end data
exomeDepth_paired.r: R script that manages ExomeDepth method for paired_end reads
exomeDepth_single.r: R script that manages ExomeDepth method for paired_end reads
final: Folder containing final aligned, deduplicated and recalibrated bam files
gatkbundle/: Folder containing all the resources (variant files, genome files, ...) required from GATK
GENDB/:
index/: Folder containing index files for the reference genome
logs/: Folder containing logs from the execution
mapped/: Folder containing partial aligned and sorted bam files
reads/: Folder containing the raw sequencing reads that are to be analyzed in /single/ and /paired/ directories
results/: Folder containing results from cnv calling
scriptmops.r: cn.mops script for CNV detection and analysis
Snakefiles/: Folder containing snakemake's files to run the method
trimmed: Folder containing trimmed files
tmpgenomicsdb
tmpPicard
tables

anbianchi / integratedsnvindelsandcnv Goto Github PK

integratedsnvindelsandcnv's Introduction

Integrated Analysis of Germline SNVs, INDELs, and CNVs in Breast Cancer Whole Exome Sequencing Data

Table of Contents

Description

Motivation

Results

Installation

Prerequisites

Dataset Retrieval

Tools & Setup

Pipeline Directory Setup

Pipeline Configuration

Installation Options

Dataset Information

Key Features

Accessing the Dataset

Directory Structure

1. Using Docker

2. Using Snakemake (No Docker)

integratedsnvindelsandcnv's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org