szhan / onekg_analysis Goto Github PK
View Code? Open in Web Editor NEWEvaluation of genotype imputation methods using the unified genealogy dataset
License: MIT License
Evaluation of genotype imputation methods using the unified genealogy dataset
License: MIT License
Moved from szhan/tsimpute#93
Right now, this one notebook does the following:
BEAGLE
.tskit.lshmm
.It is easier to divide them up into the following stages, one per notebook:
This should make pipelining run_lshmm.ipynb
easier (see #14).
Compare and contrast Viterbi traceback paths from _tskit.lshmm
and Duncan's lshmm.
This continues the thread started at szhan/tsimpute#138. I've decided to move it over to here because it is more about analysis than tooling.
This shows the entire workflow mostly in Jupyter notebooks.
Switches should happen only at chip sites, not between chip sites. This should be a sanity check. Also, I wonder if there are hotspots of switching, even though no recombination map is provided.
Currently, I'm using the genomic site positions covered by the Infinium OmniExpress-24 to generate pseudo-chip genetic variation data from the WGS data contained in the unified genealogies. I wonder if there are more appropriate array designs to use for this purpose.
Genotype imputation using the two methods currently yield dramatically different results, with tskit.lshmm
performing far worse than BEAGLE than expected.
I'm using high-coverage (>30x
) individual samples (n = 876
) in the unified genealogies as a case study to figure out why the imputation results are so different despite the fact that both the methods use the same underlying Li & Stephens HMM model for sample matching (albeit different implementations, obviously).
I've randomly partitioned the high-coverage samples into a mock reference panel (n = 700
) and a mock target study cohort (n = 176
). Also, I'm focussing on only chip-like sites (n = 7,899
), which are covered by a commercially available genotyping array (see #10 ). This means that I'm subsetting both the reference panel and target cohort to only the genetic variation data at the chip-like sites.
The idea is to run tskit.lshmm
to match the target samples against the reference panel under different parameters and then to see how the number of wrongly imputed sites varies. Also, the results are to be compared with those results obtained using BEAGLE.
This should greatly facilitate all sorts of analyses of the genotype data.
Instead of running tskit.lshmm
sequentially on each sample genome, use Dask to distribute the jobs locally, one job per sample genome.
Right now, small values of 1e-8
and 1e-20
are used for mismatch and switch probabilities, respectively. A previous preliminary analysis shows that precision set to get HMM path likelihood values can affect results (going from 10
to 20
, noticeably). So, even the set values of mismatch and switch probabilities may have an effect. It would be good to check by setting the mismatch and switch probabilities to, say, 1e-6
and 1e-18
, respectively, or possibly higher. It should be quick to test it on a handful of sample genomes anyway.
It is more convenient to run run_lshmm.ipynb as a pipeline. It makes it easier when I want to compare imputation results obtained under different parameter settings (see #11 ).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.