The rapidly evolving field of genomics emphasizes the need for a holistic understanding of the genetic structures associated with diseases like breast cancer. Current methods often analyze genomic variants such as Single Nucleotide Variants (SNVs), insertions and deletions (INDELs), and Copy Number Variants (CNVs) separately, leaving a gap for integrated analysis.
We introduce an adaptable method for analyzing SNVs, INDELs, and CNVs from Whole Exome Sequencing (WES) data, emphasizing germline variants. Our approach rigorously validates and filters variants for accuracy. Through the reanalysis of two public WES datasets, our tool highlights its versatility, uncovering both novel and known variations crucial for breast cancer predisposition.
- Ensure ~200GB of disk space.
- Clone or download this repository.
- Download the first dataset from ENA browser.
- For the second dataset, request access.
- Annovar: Download and configure Annovar, following its official documentation.
- GATK Bundle: Acquire the GATK bundle (for hg19 genome).
- AnnotSV: Install AnnotSV, following its official documentation.
- Use VS Code to navigate to the pipeline directory.
- Follow the provided structure for placing data, configuration files, and tools.
- Choose the appropriate snakefile from the Snakefiles folder.
- Open Snakefile and set the number of threads for Snakemake.
- Create config files for datasets based on provided examples.
-
Using Docker:
- Download and install Docker.
- Configure Docker via the app settings.
- Build and run the pipeline using the provided commands.
- Monitor progress and retrieve results within the Docker app.
- Optionally, execute specific parts of the pipeline as directed.
-
Using Snakemake (No Docker):
- Download and install Anaconda.
- Set up the environment and install necessary tools using the provided commands.
conda create -c conda-forge -c bioconda -n #environmentname snakemake -y conda activate approach conda install -y c conda-forge perl conda install -y -c conda-forge r-base conda install -y -c bioconda samtools conda install -y -c bioconda trimmomatic conda install -y -c conda-forge -c bioconda gatk4 conda install -y -c bioconda annotsv conda install -y -c bioconda bcftools R -e "install.packages('BiocManager', repos='http://cran.us.r-project.org')" R -e 'BiocManager::install("ExomeDepth")' R -e 'BiocManager::install("DNAcopy")' R -e 'BiocManager::install("cn.mops")'
- Navigate to the appropriate folder inside "replication_cnv_snakemake/Snakefiles" and execute the pipeline with Snakemake.
Datasets are publicly sourced. The second dataset includes 7 Illumina samples from this paper. Requests for data can be made to the corresponding author.
- Accessibility: The dataset promotes open and collaborative research.
- Visit PRJEB3235 and PRJEB31704.
- Download using methods from NCBI SRA.
Note: Adherence to usage guidelines ensures ethical data usage.
Ensure your directory structure matches this repository to avoid errors.
- 100bp_exon.bed: File .bed containing information about genomic regions of interest
- alignedFiles/: Folder cointaining intermediate file for determining SNV and Indels
- AnnotSV/:
- annovar/
- clean_and_merge.py: Python file for merging results from different CNV callers
- config_paired.csv: Configuration file for paired_end data
- config_single.csv: Configuration file for single_end data
- dockerfile: docker file containing the instruction to create the docker image
- exomeDepth_paired.r: R script that manages ExomeDepth method for paired_end reads
- exomeDepth_single.r: R script that manages ExomeDepth method for paired_end reads
- final: Folder containing final aligned, deduplicated and recalibrated bam files
- gatkbundle/: Folder containing all the resources (variant files, genome files, ...) required from GATK
- GENDB/:
- index/: Folder containing index files for the reference genome
- logs/: Folder containing logs from the execution
- mapped/: Folder containing partial aligned and sorted bam files
- reads/: Folder containing the raw sequencing reads that are to be analyzed in /single/ and /paired/ directories
- results/: Folder containing results from cnv calling
- scriptmops.r: cn.mops script for CNV detection and analysis
- Snakefiles/: Folder containing snakemake's files to run the method
- trimmed: Folder containing trimmed files
- tmpgenomicsdb
- tmpPicard
- tables
Ensure your directory structure matches this repository to avoid errors.
- 100bp_exon.bed: File .bed containing information about genomic regions of interest
- alignedFiles/: Folder cointaining intermediate file for determining SNV and Indels
- clean_and_merge.py: Python file for merging results from different CNV callers
- config_paired.csv: Configuration file for paired_end data
- config_single.csv: Configuration file for single_end data
- exomeDepth_paired.r: R script that manages ExomeDepth method for paired_end reads
- exomeDepth_single.r: R script that manages ExomeDepth method for paired_end reads
- final: Folder containing final aligned, deduplicated and recalibrated bam files
- gatkbundle/: Folder containing all the resources (variant files, genome files, ...) required from GATK
- GENDB/:
- index/: Folder containing index files for the reference genome
- logs/: Folder containing logs from the execution
- mapped/: Folder containing partial aligned and sorted bam files
- reads/: Folder containing the raw sequencing reads that are to be analyzed in /single/ and /paired/ directories
- results/: Folder containing results from cnv calling
- scriptmops.r: cn.mops script for CNV detection and analysis
- Snakefiles/: Folder containing snakemake's files to run the method
- trimmed: Folder containing trimmed files
- tmpgenomicsdb
- tmpPicard
- tables