broadinstitute / gnomad_local_ancestry Goto Github PK

View Code? Open in Web Editor NEW

2.0 12.0 0.0 82 KB

Hail batch pipeline and scripts for local ancestry inference

License: MIT License

Python 100.00%

gnomad_local_ancestry's Introduction

gnomad_local_ancestry

Hail batch pipeline and scripts for local ancestry inference

gnomad_local_ancestry's People

Contributors

Stargazers

Watchers

gnomad_local_ancestry's Issues

Investigate larger mixed vs smaller homogenous panel for RFmix

Is it better to have admixed people in the reference panel(similar to the admixed pop) than homogenous people (likely more diverged) in other words does including admixture people in reference panel help or hurt?
RFMix has a flag to run on ref panel as well (--reanalyze-reference; -e 1) (We should do this once - will add time but worth it)
Incentive to have a smaller panel computationally
What is the cost trade off accuracy vs comp cost/storage?
RFMix notes it is better to run on entire genome ( we can't do this)

Phase gnomAD v3.1, 1kG, and HGDP

Phase subsetted gnomAD pops, amr and afr, with 1KG and HGDP samples.

Input: subsetted VCF - only variant QC'd variants
Tool: Eagle
Output: Phased VCF by chr
QC: switch-errors

Create test set for batch pipeline

Create a full set of test contigs using the amr probands within gnomAD trios

Add Tractor to Batch LAI pipeline

Overlay rare variants on top of local ancestry intervals (analysis) [Nice to have]

Write scripte to generate allele-frequency spectrums and quality metrics

Need to determine which quality metrics

Determine file storage for pipeline

Need to determine which files to keep from pipeline and in which bucket to keep them. Should any files be saved for future runs?

Export all per-chromosome VCFs for LAI cohort subset

Create AMR reference panel for RFMix

Dummy issue

Subset gnomAD v3.1 to AMR, AFR, HGDP, and 1KG samples from variant QC tables

Mike will perform rough variant QC on callrate and MAF

Create gnomAD LAI release

Add Eagle to Batch LAI pipeline

Run Tractor on amr pop

Create AFR reference panel for RFMix

Run full Batch pipeline on each chromosome

Analyze Eagle/RFMix output for AMR chr20 test

Run script to generate AF spectrums and quality metrics

Run RFMix on phased VCFs on amr and afr pops

Run RFmix for local ancestry on amr populations

Input: Phased sample VCF, Phased reference panel VCF
Output:
QC:

Build Docker image for Batch pipeline

Need a docker image that includes:

Eagle
RFMix
bcftools
htslib
Tractor

sudo apt-get install:

libgomp1
build-essential
python3-dev
unzip
update
autoconf
wget
zlib1g-dev
libbz2-dev
liblzma-dev

LAI results are available for download

Analyze how many variants change frequency once we have local ancestry inference

Estimate initial global ancestry proportion

Use ADMIXTURE to sanity check admixture proportions

correlation between RFMix and ADMIXTURE
Tractor paper supplementary figure 1C,E

Plot ancestry karyograms

Get plotting script: https://github.com/armartin/ancestry_pipeline (Step 2.1 in ancestry_pipeline repo)
Fig 1, supplementary material for Tractor paper

Investigate tract distributions for amr and afr pops

Want to make sure you capture all the haplotypes at the level of resolution you expect [EA]

More recently admixed populations will have larger tracts - make sure tract lengths make sense given demographic history

Run PCA on reference panels

Create subsets for phasing with and without reference panel

Generate small subset for LAI phase input test (separately phasing samples and reference) and then run through LAI pipeline

How to quantify which is better:

look at switch errors from Tractors output

Update batch to a production pipeline

Download HapMap combined for recombination map

RFmix needs this

Dummy epic

Export chr20 VCF from gnomAD v3.1 LAI cohort subset

Determine LAI breakdown in browser

Overlay rare variants on top of ancestry intervals

Compare trio tracts

Learn pattern of admixture in AMR reference panel

Many of the reference samples for amr are admixed. We do not have true Native American samples. It would be nice to get a better understanding of reference panel.

Update subsetted VCFs to variant QC'd variants

Test updated Batch pipeline on ~7k AMR samples for chr20

Run PCA on ancestry-specific components for AMR samples

PCA 2.3 from Alicia’s ancestry pipeline: https://github.com/armartin/ancestry_pipeline
check out Alicias AJHG paper

Run chr20 through complete Batch pipeline (including Tractor)

Add Tractor to Batch pipeline

Look into XGMix replacing RFMix

Elizabeth benchmarking XGMix (basically new RFMix with heavier training but faster analysis). Same outputs so we should be able to switch with minimal work if it seems advantageous.

Update RFMix Batch pipeline

From Google Doc:

Phase the cohort data informed by the reference panel rather than joint-phased with the reference panel
Remove some data harmonizing/QC steps at the front end of the pipeline that won't be needed now
Plink is currently used for QC
Replace shapeit with Eagle2 for phasing
Eagle automatically filters SNPs and individuals with missing rates exceeding thresholds of 0.1 on plink files but not on VCF - switch to hail/gnomad_methods?
Be aware of overhead in submitting jobs and copying input/output files per job

Analyze results from chr20 full-pipeline test run

Stress-test Eagle with chr1 for 21k AFR cohort [still needed?]

Pick vignettes for talk: common and rare disease example of variant known to be associated with a specific phenotype, and show how LAI identifies the ancestry of that allele