szhan / onekg_analysis Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 524 KB

Evaluation of genotype imputation methods using the unified genealogy dataset

License: MIT License

Jupyter Notebook 99.96% Shell 0.04%

bioinformatics genomics

onekg_analysis's People

Contributors

Watchers

onekg_analysis's Issues

Split `prepare_dataset.ipynb` into separate notebooks

Moved from szhan/tsimpute#93

Right now, this one notebook does the following:

Download unified genealogies.
Simplify the trees down to only high-coverage individuals.
Split the individuals into reference panel and target cohort (one set of trees per group).
Prepare data objects and files (VCFs and samples) for imputation.
Impute using BEAGLE.
Impute using tskit.lshmm.

It is easier to divide them up into the following stages, one per notebook:

Steps 1 to 3.
Step 4. This involves writing to VCF and making samples compatible, but it should be soon accelerated using sgkit.
Step 5.
Step 6.

Call `bgzip` and `bcftools` from Python

This should make pipelining run_lshmm.ipynb easier (see #14).

Compare sample paths obtained using `_tskit.lshmm` and Duncan's 'lshmm'

Compare and contrast Viterbi traceback paths from _tskit.lshmm and Duncan's lshmm.

Looking into the source code of BEAGLE

This continues the thread started at szhan/tsimpute#138. I've decided to move it over to here because it is more about analysis than tooling.

Initial analysis

This shows the entire workflow mostly in Jupyter notebooks.

Check switch site positions in HMM copying paths

Switches should happen only at chip sites, not between chip sites. This should be a sanity check. Also, I wonder if there are hotspots of switching, even though no recombination map is provided.

Choice of genotype array to generate pseudo-chip data

Currently, I'm using the genomic site positions covered by the Infinium OmniExpress-24 to generate pseudo-chip genetic variation data from the WGS data contained in the unified genealogies. I wonder if there are more appropriate array designs to use for this purpose.

Investigate differences in imputation results between `tskit.lshmm` and BEAGLE

Genotype imputation using the two methods currently yield dramatically different results, with tskit.lshmm performing far worse than BEAGLE than expected.

I'm using high-coverage (>30x) individual samples (n = 876) in the unified genealogies as a case study to figure out why the imputation results are so different despite the fact that both the methods use the same underlying Li & Stephens HMM model for sample matching (albeit different implementations, obviously).

I've randomly partitioned the high-coverage samples into a mock reference panel (n = 700) and a mock target study cohort (n = 176). Also, I'm focussing on only chip-like sites (n = 7,899), which are covered by a commercially available genotyping array (see #10 ). This means that I'm subsetting both the reference panel and target cohort to only the genetic variation data at the chip-like sites.

The idea is to run tskit.lshmm to match the target samples against the reference panel under different parameters and then to see how the number of wrongly imputed sites varies. Also, the results are to be compared with those results obtained using BEAGLE.

Replace use of `tsinfer.SampleData` with `sgkit`

This should greatly facilitate all sorts of analyses of the genotype data.

Use Dask to distribute runs of `tskit.lshmm`

Instead of running tskit.lshmm sequentially on each sample genome, use Dask to distribute the jobs locally, one job per sample genome.

Test sensitivity of sample matching to exact values of mismatch probabilities and switch probabilities

Right now, small values of 1e-8 and 1e-20 are used for mismatch and switch probabilities, respectively. A previous preliminary analysis shows that precision set to get HMM path likelihood values can affect results (going from 10 to 20, noticeably). So, even the set values of mismatch and switch probabilities may have an effect. It would be good to check by setting the mismatch and switch probabilities to, say, 1e-6 and 1e-18, respectively, or possibly higher. It should be quick to test it on a handful of sample genomes anyway.

Convert `run_lshmm.ipynb` into pipeline

It is more convenient to run run_lshmm.ipynb as a pipeline. It makes it easier when I want to compare imputation results obtained under different parameter settings (see #11 ).

szhan / onekg_analysis Goto Github PK

onekg_analysis's People

Contributors

Watchers

onekg_analysis's Issues

Split `prepare_dataset.ipynb` into separate notebooks

Call `bgzip` and `bcftools` from Python

Compare sample paths obtained using `_tskit.lshmm` and Duncan's 'lshmm'

Looking into the source code of BEAGLE

Initial analysis

Check switch site positions in HMM copying paths

Choice of genotype array to generate pseudo-chip data

Investigate differences in imputation results between `tskit.lshmm` and BEAGLE

Replace use of `tsinfer.SampleData` with `sgkit`

Use Dask to distribute runs of `tskit.lshmm`

Test sensitivity of sample matching to exact values of mismatch probabilities and switch probabilities

Convert `run_lshmm.ipynb` into pipeline

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent