Giter Site home page Giter Site logo

gnomadextractor's Introduction

gnomADextractor

This pipeline is designed to extract population allele frequency (AF) from gnomAD for WGS/WES genomic variants stored in a Variant Call Format (VCF) file. The pipeline is designed to employ the SLURM scheduler used in high-performance computing clusters.

Prerequisites

Before running the pipeline, make sure you have the following dependencies installed:

R (version 3.0.2 or higher)

The pipeline assumes R is loaded on the cluster using:

$ module load R

Input files

To run gnomADextractor you need:

  1. VCF file to be annotated with population allele frequency (AF) - note: both chromosome notations are acceptable (i.e. chr1 vs 1)
  2. the gnomAD reference files per chromosome in hg38_gnomAD_AF_chrom/ subdir in the WORKDIR (large files; available upon request)

How it works

There are three key steps in this pipeline:

  • Generate input files per chromosome from the VCF file: This is a space dilimited file with five columns of chromosome, start position, end position, reference allele and the alternate allele - this is implemented in R/gnomAD_input_from_vcf.R
  • Search for variants observed in each chromsome of the VCF file for gnomAD population allele frequency (AF) - this is executed by BASH/variant_searcher_array.sh using the SLURM parallelisation feature (-a) for fast implementation.
  • Collate the chromosome-level output files from the variant search step into one genome-wide output file - this is implemented in R/gnomAD_output_collate.R

Usage

To run the pipeline, follow these steps:

1. Clone the repository:

$ git clone https://github.com/nansari-pour/gnomADextractor

2. Navigate to the directory containing the pipeline scripts - this is considered the working directory:

$ cd gnomADextractor

Then copy the reference files folder (hg38_gnomAD_AF_chrom/) to this directory

$ cp -r /path/to/hg38_gnomAD_AF_chrom/ .

The VCF file can be anywhere but full path to it must be provided as vcf_file in the pipeline (i.e. the first argument provided to the pipeline script; see below)

3. Submit the pipeline job using the sbatch command:

$ sbatch gnomADextractor.sh /full/path/to/VCF_file.vcf "filter1,filter2,filter3"

The pipeline script takes in two arguments:

a) full path to the VCF file

b) The variant filter types in the FILTER column of the VCF file to be retained (at least one and comma-delimited for more than one) e.g. "PASS" or "PASS,germline_risk,somatic"

Output file format

  1. The main output file (*_gnomAD_output.txt) will be in a tab-delimited text file format, with the following columns (including header):

Chromosome (CHROM)

Position (POS)

Reference allele (REF)

Alternate allele (ALT)

Population allele frequency (AF): Variants with value of NA are those absent in gnomAD (i.e. novel or rare variant)

  1. An additional file (specifically required by the Tumour-only mode of the Battenberg copy number pipeline) will also be generated (*_chrom_notation_length.txt), which contains a single numeric value:

The numeric value (either 1 or 4) represents the length of the chromosome notation in the VCF file (emanating from the respective BAM file).

Note for TOBB users

Both output files should be copied or symlinked to your TOBB working directory where your BAM file is (to be read by the getSNVrho function in TOBB)

Contact

If you have any questions or issues with the pipeline, please contact [email protected]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.