Giter Site home page Giter Site logo

forestqc's Introduction

ForestQC

This classifier uses random forest model to identify good or bad variants from grey variants.

Installation

To install the software. This soft is compatible with OSX, 64-bit Linux and 64-bit Windows systems.

$ conda install forestqc -c avallonking

To update the software.

$ conda update forestqc -c avallonking

After installation, you can check the usage

$ ForestQC -h
# and check the usage for each command, for example, stat
$ ForestQC stat -h

We provide example usages. Note that all example files are only for testing. Do not use them to analyze real data.

$ git clone [email protected]:avallonking/ForestQC.git
$ cd ForestQC/examples/example_script_and_output/
$ cat example.script.sh
$ sh example.script.sh

Workflow

example/example_output/example.script.sh provides step-by-step tutorial about how to run this software.

Before doing the classification, we should set the thresholds for Outlier_GQ and Outlier_DP. It would print out Outlier_DP threshold and Outlier_GQ threshold on the screen, which would be used as inputs in the next step. Idealy, all vcf files in this analysis should be included, but we suggest to use Chromosome 1 only to speed up the computation if the dataset is very large. Please make sure the memory size (set with -m argument) smaller than the total size of input files and the memory limit of your system

$ ForestQC set_outlier -m 1G -i [vcf_file1],[vcf_file2],[...]

If you use other reference genome other than hg19, please use our script to generate a GC content table for your reference genome. Otherwise, we have prepared hg19 GC content table in examples/gc_content_hg19.tsv and you can directly use it.

# generate GC content table for your reference genome
$ ForestQC compute_gc -i [ref_genome.fasta] -o [output_file]

First, we need to calculate ths statistics from vcf file. It will output a file containing all statistics information for each variant. Note:

  • It is highly recommended to calculate the statistics for different regions of the genome at the same time, which can make this procedure much faster
  • The vcf file should have the information of each individual on eahc site
  • You don't have to merge all vcf files together
  • If no discordant genotype file provided, the number of discordant genotype of all variants will be NA
  • If no pedigree file provided, the mendel errors of all variants will be NA
  • If no HWE p-value file provided, HWE p-value would be computed by ForestQC. Or if ForestQC cannot find HWE p-value for some sites in the provided HWE p-value file, ForestQC would compute HWE p-value for those sites
$ ForestQC stat -i [input_vcf] -o [output_filename] -c [gc_content_file] -g [gender_file(optional)] -p [ped_file (optional)] -d [discordant_genotype_file (optional)] -w [hwe_file(optional)] --gq [Outlier_GQ] --dp [Outlier_DP] -as [user_defined_statistics_file (optional)]

Second, we need to divide the dataset into good, bad and gray variants. The output files would be three: good.xx, bad.xx and grey.xx. The output filename is only the suffix, that is, the latter part of the file name. All output files would be in the same folder with input files. Note that you don't have to merge the input file together. Also, the model used in classification and splitting must be the same. Or you would get ValueError

$ ForestQC split -i [input_file] -o [output_filename_suffix (optional)] -t [user_defined_threshold_file (optional)] -as [user_defined_statistics_names (if user-defined statistics added in last step, this is required)]

Third, we can train our random forest model and apply it on the classification of gray variants. The output files would be predicted_good_xx and predictred_bad_xx. The output filename is only the suffix, that is, the latter part of the file name. All output files would be in the same folder with input files. Note that you need to merge all good variants into one file, all bad variants into one file and all grey variants into one file. Also, the model used in classification and splitting must be the same. Otherwise, you would get ValueError

$ ForestQC classify -g [good_variants] -b [bad_variants] -y [grey_variants] -o [output_filename_suffix (optional)] -t [probability_threshold (optional)] -f [selected_features_names (optional)]

Fourth, to get the final sets of good variants and bad variants, we need to combine the variants predicted by the random forest model (step 3) and the variants separated by the filters (step 2).

$ cat [good_variants(from step 2)] [predicted_good_variants(from step 3)] > [total_good_variants]
$ cat [bad_variants(from step 2)] [predicted_bad_variants(from step 3)] > [total_bad_variants]

Fifth, to get a VCF file containing the variants passing ForestQC, you can get the positions of bad variants and remove them from the original VCF file.

# if chromosome names in the original VCF file are like "chr1", "chr2", etc.
$ awk -F "\t" 'NR>1{print $2"\t"$3}' [total_bad_variants] > [bad_variants_positions]
# if chromosome names in the original VCF file are like "1", "2", etc.
$ awk -F "\t" 'NR>1{print $2"\t"$3}' [total_bad_variants] | sed 's/chr//' > [bad_variants_positions]
# remove bad variants from the original VCF file
$ vcftools --gzvcf [original_gziped_vcf_file] --exclude-positions [bad_variants_positions] --recode --recode-INFO-all -c | gzip -c [output_vcf]

And VCFtools would need to be installed.

File format

Note that all files are tab-separated. If you came across any IndexError, please check your input files and make sure they are all tab-separated.

Output file

  • Statistics file (tab-separated, assuming that we included user-defined features)
RSID CHR POS REF ALT MAF Mean_DP Mean_GQ SD_DP SD_GQ Outlier_DP Outlier_GQ Discordant_Geno Mendel_Error Missing_Rate HWE ABHet ABHom GC user-defined_feature1 user-defined_feature2
chr1:144  1 144 A T 0.03  54.00 54.00 23.00 13.24 0.43 0.23 0.1 0.06 0.01 1.0 0.45 0.99 0.435 2.0 4.3
chr1:145  1 145 A T 0.03  54.02 52.00 26.00 11.64 0.33 0.43 0.2 0.03 0.03 1.0 0.49 0.98 0.435 NA 4.3
  • Final result file (Prbability means the probability of a variant being good. The variant is a good variant if Good = 1 or 1.0, or a bad variant if Good = 0, grey variants do not have Good column before it is predicted)
RSID CHR POS REF ALT MAF Mean_DP Mean_GQ SD_DP SD_GQ Outlier_DP Outlier_GQ Discordant_Geno Mendel_Error Missing_Rate HWE ABHet ABHom GC user-defined_feature1 user-defined_feature2 Probability Good
chr1:144  1 144 A T 0.03  54.00 54.00 23.00 13.24 0.43 0.23 0.1 0.03 0.01 1.0 0.45 0.99 0.435 2.0 4.3 1

Input file

  • Gender file(2 columns, tab-separated. f for female, m for male):
SampleID  Gender
12312sd12  m
4342sd423  f
7hi987989  0     # Any samples whose gender is neither m nor f will be ignored
  • Pedigree file(9 columns, tab-separated). Note that SeqID is the sample ID occuring in VCF files. If control information is provided, ForestQC only calculate HWE p-value for each site with control samples.
FamilyID IndividualID FatherID MotherID Sex PhenotypeID(1 if it is control) DBPID SamID SeqID
C1  2 3 4 m  1  xxx xxx xxx
  • Discordant genotype rate file (2 columns, tab-separated, gzip-compressed or not)
SNP_ID  Discordant_Genotype_Rate
chr1:259   0.01
chr3:122   0.20
  • HWE p-value file (2 columns, tab-separated):
SNP_ID  HWE_p-value
chr3:899   1.0
chr2:900   0.77
  • User-defined features (tab-separated, data are calculated by users in advance. In practice, missing values will be imputed with the median of the sample)
RSID feature1 feature2 feature3
chr1:32 2.3 4.5 2.3
chr2:34 4.3 NA 5.6
  • User-defined filter thresholds (tab-separated, 4 columns, "rare" means filters for rare variants, "common" for common variants, "all" for both)
good all filter threshold
bad all MAF threshold
bad rare Mendel_Error threshold
bad common Missing_Rate threshold
outlier rare Missing_Rate threshold
outlier common Missing_Rate threshold
  • GC content file (tab-separated, 3 columns, including chrosome, starting positions and GC content)
chrM 1 0.3
chrM 1001 0.46

forestqc's People

Contributors

avallonking avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.