SNOWCAT: Streamlining Our Workflow: Code Alternative for TRACTOR

Snowcat is a genome-wide association study (GWAS) pipeline with correction for local ancestry as estimated with RFMIX. It follows the model originally introduced by Elizabeth G. Atkinson in the TRACTOR pipeline (see manuscript, pipeline).

Snowcat offers a number of advantages

Fast, parallel code.
- Allows one to rerun the whole GWAS in a matter of hours on a single computer.
Returns more test statistics and QC metrics for each variant
- T-statistic and p-value for each ancestry
- F-test and p-value for the whole model
- Minor allele frequency (MAF) and minor allele count (MAC) for each ancestry

Overview of SNOWCAT pipeline

Input datasets

Genotypes to run GWAS on
- file format: VCF as returned by an imputation pipeline (e.g. Michigan, TopMed)
Reference genotypes data
- VCF files, typically 1000 Genomes Reference panel (available for hg19 and GRCh38 genome builds)
Ancestry labels for the reference samples (link)
Genetic map files for the target genome (hg19 info)

Pipeline steps

Prepare reference panel files
- Convert the reference panel VCF files
  - Convert to binary vcf format BCF
  - Exclude SNPs with low MAF
  - Keep only genotype information (remove dosage and posterior probability info)
- Prepare Ancestry labels in RFMIX-friendly format
- Prepare genetic map files in RFMIX-friendly format
Convert and filter genotypes
- Convert to binary vcf format BCF
- Exclude SNPs with low MAF and poor imputation quality
- Keep only genotype information (remove dosage and posterior probability info)
Run RFMIX estimation of local ancestry
- Create MSP.TSV file with ancestry info for each genomic range.
Convert Imputed genotypes into a custom binary format for fast parallel GWAS.
- Convert BCFs from step 2 into VCF for processing in R
- Convert VCF into custom binary format.
Run PCA on imputed genotypes for use as covariates
- Convert imputed genotypes to plink format by chromosome
- Prune SNPs for each chromosome
- Combine pruned SNPs across chromosomes
- Run PCA on the combined dataset
Run GWAS with correction for local ancestry
- Run R script for every chromosome
- Combine results, QC, QQ-plot and Manhattan plot.

Here is the dependency chart for the pipeline steps:

graph TD;
    1-->3;
    2-->3;
    3-->6;
    4-->6;
    5-->6;

Install Snowcat R package

To install Snowcat R package from GitHub, run the following code within R

if(!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")
devtools::install_github("andreyshabalin/snowcat")

Running SNOWCAT pipeline

Input data

Imputed genotypes

Imputation servers typically return imputed genotypes in files named chr1.dose.vcf.gz to chr22.dose.vcf.gz. The code below assumes the imputed genotypes are stored in data_vcf_by_chr directory:

Genotypes: data_vcf_by_chr/chr"$i".dose.vcf.gz

1000 Genomes reference panel genotypes

The 1000 Genomes project genotypes are freely available online (hg19 and GRCh38) in VCF format. The code below assumes the reference panel file names to be those for hg19 genome build.

Reference: ref_vcf/ALL.chr"$i".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

Ancestry information for the reference samples

The sample map file for RFMIX. The RFMIX reference manual defines its format as follows: It is tab delimited text with two columns. The first column gives the sample name or identifier, which must match the one used in the reference VCF/BCF. The second column is a string naming a subpopulation and may contain spaces (e.g., "European", or "East African").

The code below assumes the sample map to be located in the map directory:

Sample map: map/1KG_map.txt

Genetic map file for the used genome build.

Genetic map file for RFMIX. The RFMIX reference manual defines its format as follows: The 3 columns are chromosome, physical position in bp, genetic position in cM.

Historically, the source data for creation of this file has been available at link which seems to be currently broken. You may have better luck here or googling for it.

The code below assumes the genetic map to be located in the map directory:

Genetic map: map/map_file.txt

1. Convert the reference panel files

We convert the reference panel VCF files to BCF to achieve several goals.

Convert from VCF to binary vcf format BCF to greatly speed up processing by RFMIX.
Exclude SNPs with low MAF. Very low minor allele veriants are not useful for the analysis.
Keep only genotype information (remove dosage and posterior probability info). This greatly reduces the file size and speeds up processing by RFMIX.

The converted files are saved in ref_bcf directory.

Simple bash loop (may be slow)

mkdir -p ref_bcf
for chr in {1..22}; do
  bcftools annotate \
    ref_vcf/ALL.chr"$chr".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
    -x 'FORMAT' \
    -i 'AF>.001 && AF<.999' \
    -O b9 \
    -o ref_bcf/ref_chr"$chr".bcf.gz
  bcftools index ref_bcf/ref_chr"$chr".bcf.gz
done

Parallel bash loop (needs a beefy machine)

mkdir -p ref_bcf
parallel --linebuffer "\
  bcftools annotate \
    ref_vcf/ALL.chr{}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
    -x 'FORMAT' \
    -i 'AF>.001 && AF<.999' \
    -O b9 \
    -o ref_bcf/ref_chr{}.bcf.gz && \
  bcftools index ref_bcf/ref_chr{}.bcf.gz" ::: {1..22}

2. Convert and filter genotypes

Similarly, we convert the genotype VCF files to BCF. In addition to previous filtering criteria, we exclude variants with imputation R2 below 0.5.

The converted files are saved in data_bcf_by_chr_GT_QC directory.

Simple bash loop (may be slow)

mkdir -p data_bcf_by_chr_GT_QC
for chr in {1..22}; do
  bcftools annotate \
    data_vcf_by_chr/chr"$chr".dose.vcf.gz \
    -x 'FORMAT' \
    -i 'R2>.5 & MAF>.001' \
    -O b9 \
    -o data_bcf_by_chr_GT_QC/GT_R2_.5_MAF_.001_chr"$chr".bcf.gz
  bcftools index data_bcf_by_chr_GT_QC/GT_R2_.5_MAF_.001_chr"$chr".bcf.gz
done

Parallel bash loop (needs a beefy machine)

mkdir -p data_bcf_by_chr_GT_QC
parallel --linebuffer "\
  bcftools annotate \
    data_vcf_by_chr/chr{}.dose.vcf.gz \
    -x 'FORMAT' \
    -i 'R2>.5 & MAF>.001' \
    -O b9 \
    -o data_bcf_by_chr_GT_QC/GT_R2_.5_MAF_.001_chr{}.bcf.gz && \
  bcftools index data_bcf_by_chr_GT_QC/GT_R2_.5_MAF_.001_chr{}.bcf.gz" ::: {1..22}

3. Run RFMIX estimation of local ancestry

We can now run RFMIX for each chromosome. We do not parallelize running RFMIX as it is very a memory intensive program.

The RFMIX output is saved in rfmix_out directory.

mkdir -p rfmix_out
for i in {22..1}; do
  rfmix/rfmix \
    --query-file=data_bcf_by_chr_GT_QC/GT_R2_.5_MAF_.001_chr"$i".bcf.gz \
    --reference-file=ref_bcf/ref_chr"$i".bcf.gz \
    --sample-map=map/1KG_map.txt \
    --genetic-map=map/map_file.txt \
    --chromosome="$i" \
    --n-threads=30 \
    --output-basename=rfmix_out/rfmix_chr"$i"_R2_.5_MAF_.001
done

Note: Replace 30 in --n-threads=30 to the number of cores on the machine being used.

4. Convert Imputed genotypes into a custom binary format for fast parallel GWAS.

First, we convert BCF files from step 2 to VCF for processing in R.

The converted files are saved in data_vcf_by_chr_GT_QC.

Simple bash loop (may be slow)

mkdir -p data_vcf_by_chr_GT_QC
for chr in {1..22}; do
  bcftools view \
    data_bcf_by_chr_GT_QC/GT_R2_.5_MAF_.001_chr"$chr".bcf.gz \
    -o data_vcf_by_chr_GT_QC/GT_R2_.5_MAF_.001_chr"$chr".vcf.gz \
    -O z1
done

Parallel bash loop (needs a beefy machine)

mkdir -p data_vcf_by_chr_GT_QC
parallel --linebuffer "\
  bcftools view \
    data_bcf_by_chr_GT_QC/GT_R2_.5_MAF_.001_chr{}.bcf.gz \
    -o data_vcf_by_chr_GT_QC/GT_R2_.5_MAF_.001_chr{}.vcf.gz \
    -O z1 " ::: {1..22}

Next, we convert the genotypes to a binary format for fast random access by GWAS code.

The converted files are saved in data_bcf_by_chr_GT_QC_fm.

Simple R loop (may be slow)

library(snowcat)
for( chr in 1:22 ){
  vcffilename = paste0("data_vcf_by_chr_GT_QC/GT_R2_.5_MAF_.001_chr",chr,".vcf.gz")
  fmnameroot = paste0("data_bcf_by_chr_GT_QC_fm/chr",chr)
  convertVCFtoFilematrix(vcffilename, fmnameroot)
}

Parallel bash loop (needs a beefy machine)

parallel --linebuffer "\
  Rscript -e \"\
    chr = {}; \
    library(snowcat); \
    vcffilename = paste0('data_vcf_by_chr_GT_QC/GT_R2_.5_MAF_.001_chr',chr,'.vcf.gz'); \
    fmnameroot = paste0('data_bcf_by_chr_GT_QC_fm/chr',chr); \
    convertVCFtoFilematrix(vcffilename, fmnameroot); \
  \"" ::: {1..22}

5. Run PCA on imputed genotypes for use as covariates

Any GWAS analysis should include correction for a number of ancestry PCs. Thus, we run PCA analysis on the imputed genotypes

The PCA analysis and intermediate plink files are saved in data_plink directory. \

Simple bash loop (may be slow)

mkdir -p data_plink

for chr in {1..22}; do

  # Convert imputed genotypes to plink format by chromosome
  # and filter by minor allele frequency
  plink2 \
    --vcf data_vcf_by_chr_GT_QC/GT_R2_.5_MAF_.001_chr"$chr".vcf.gz \
    --double-id \
    --make-bed \
    --maf 0.01 \
    --out data_plink/chr"$chr"
    
  # Prune SNPs for each chromosome
  plink \
    --bfile data_plink/chr"$chr" \
    --indep-pairphase 1000 10 0.2 \
    --out data_plink/prune."$chr"
    
  plink2 \
    --bfile data_plink/chr"$chr" \
    --extract data_plink/prune."$chr".prune.in \
    --make-pgen \
    --out data_plink/chr"$chr".p
    
done

# Combine pruned SNPs across chromosomes
if [ -f data_plink/1list.txt ]; then
  rm data_plink/1list.txt
fi
for chr in {1..22}; do
  echo data_plink/chr"$chr".p >> data_plink/1list.txt
done

plink2 \
  --pmerge-list data_plink/1list.txt \
  --make-pgen \
  --out data_plink/all_chr

# Run PCA in pruned SNPs, extract top 250 PCs
plink2 \
  --pfile data_plink/all_chr \
  --pca 250 \
  --out data_plink/pca

Parallel bash loop (needs a beefy machine)

mkdir -p data_plink

# Convert imputed genotypes to plink format by chromosome
# Filter by minor allele frequency
# Prune SNPs for each chromosome
parallel --linebuffer "\
  plink2 \
    --vcf data_vcf_by_chr_GT_QC/GT_R2_.5_MAF_.001_chr{}.vcf.gz \
    --double-id \
    --make-bed \
    --maf 0.01 \
    --out data_plink/chr{} && \
  plink \
    --bfile data_plink/chr{} \
    --indep-pairphase 1000 10 0.2 \
    --out data_plink/prune.{} && \
  plink2 \
    --bfile data_plink/chr{} \
    --extract data_plink/prune.{}.prune.in \
    --make-pgen \
    --out data_plink/chr{}.p" ::: {1..22}


# Combine pruned SNPs across chromosomes
if [ -f data_plink/1list.txt ]; then
  rm data_plink/1list.txt
fi
for chr in {1..22}; do
  echo data_plink/chr"$chr".p >> data_plink/1list.txt
done

plink2 \
  --pmerge-list data_plink/1list.txt \
  --make-pgen \
  --out data_plink/all_chr

# Run PCA in pruned SNPs, extract top 250 PCs
plink2 \
  --pfile data_plink/all_chr \
  --pca 250 \
  --out data_plink/pca

The PCs are saved in the data_plink/pca.eigenvec file.

6. Run GWAS with correction for local ancestry.

Running GWAS for each chromosome can be parallelized across CPU cores of the machine.

The code below assumes the cases and controls differ in the first letter of the sample name.

library(data.table)
library(snowcat)

# Number of principal components to correct for
npcs = 20

# Chromosome number
chr = 22

# File name for the filematrix with genotypes
fmifilename = paste0("data_bcf_by_chr_GT_QC_fm/chr",chr)

# Estimated local ancestry from RFMIX
mspfilename = paste0("rfmix_out/rfmix_chr",chr,"_R2_.5_MAF_.001.msp.tsv")

# Output directory
outputdir = paste0("gwas/pc",npcs, "_linear")

# Data frame with covariates
pca = fread("data_plink/pca.eigenvec",
            header = TRUE,
            data.table = FALSE)
cvrt = as.matrix(pca[ paste0("PC",seq_len(npcs)) ])
rownames(pca) = pca$IID
rm(npcs, pca)

# Outcome variable
samples = fread(file = paste0(fmifilename,".samples"),
                sep = "",
                quote = FALSE,
                header = FALSE)[[1]]
tmp = substr(samples, 1, 1)
phenotype = as.numeric(tmp == tail(tmp,1))
rm(samples, tmp)

# Run the GWAS
run.gwas.linear(
    chr,
    fmifilename,
    mspfilename,
    outputdir,
    cvrt,
    phenotype)

# Look at the first 2 variants
gwas = data.table::fread("gwas/pc20_linear/gwas_chr22.sumstat.txt", nrow = 2)
show(t(gwas))

Here is the preview of GWAS results with QC metrics:

Field	SNP1	SNP2
chr	22	22
pos	17,031,494	17,066,596
ID	22:17031494:A:G	22:17066596:A:G
ref	A	A
alt	G	G

AlleleTotal_AFR	392	392
AlleleTotal_AMR	1,345	1,345
AlleleTotal_EAS	8,806	8,806
AlleleTotal_EUR	56,642	56,642
AlleleTotal_SAS	1,511	1,511

AltAlleleCnt_AFR	29	0
AltAlleleCnt_AMR	35	6
AltAlleleCnt_EAS	95	20
AltAlleleCnt_EUR	2,750	430
AltAlleleCnt_SAS	58	2

MinorAlleleFreq_AFR	0.07397	0.00000
MinorAlleleFreq_AMR	0.02602	0.00446
MinorAlleleFreq_EAS	0.01078	0.00227
MinorAlleleFreq_EUR	0.04855	0.00759
MinorAlleleFreq_SAS	0.03838	0.00132

Nca_AFR	53.0	53.0
Nca_AMR	304.5	304.5
Nca_EAS	89.5	89.5
Nca_EUR	5119.5	5119.5
Nca_SAS	19.5	19.5

Nco_AFR	143.0	143.0
Nco_AMR	368.0	368.0
Nco_EAS	4,313.5	4,313.5
Nco_EUR	23,201.5	23,201.5
Nco_SAS	736.0	736.0

beta_AFR	-0.05349	NA
beta_AMR	0.03792	0.25132
beta_EAS	-0.00133	-0.00941
beta_EUR	0.00360	-0.02690
beta_SAS	0.00093	-0.00779

zscore_AFR	-0.779	NA
zscore_AMR	0.645	1.797
zscore_EAS	-0.037	-0.123
zscore_EUR	0.530	-1.635
zscore_SAS	0.020	-0.032

pvalue_AFR	0.4356	NA
pvalue_AMR	0.5183	0.0722
pvalue_EAS	0.9698	0.9019
pvalue_EUR	0.5959	0.1020
pvalue_SAS	0.9839	0.9742

R2	0.00013	0.00026
Ftest	0.502	1.142
Pvalue	0.873	0.330

andreyshabalin / snowcat Goto Github PK

snowcat's Introduction

SNOWCAT: Streamlining Our Workflow: Code Alternative for TRACTOR

Overview of SNOWCAT pipeline

Input datasets

Pipeline steps

Install Snowcat R package

Running SNOWCAT pipeline

Input data

1. Convert the reference panel files

2. Convert and filter genotypes

3. Run RFMIX estimation of local ancestry

4. Convert Imputed genotypes into a custom binary format for fast parallel GWAS.

5. Run PCA on imputed genotypes for use as covariates

6. Run GWAS with correction for local ancestry.

snowcat's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org