Giter Site home page Giter Site logo

onekg_analysis's People

Contributors

szhan avatar

Watchers

 avatar

onekg_analysis's Issues

Split `prepare_dataset.ipynb` into separate notebooks

Moved from szhan/tsimpute#93

Right now, this one notebook does the following:

  • Download unified genealogies.
  • Simplify the trees down to only high-coverage individuals.
  • Split the individuals into reference panel and target cohort (one set of trees per group).
  • Prepare data objects and files (VCFs and samples) for imputation.
  • Impute using BEAGLE.
  • Impute using tskit.lshmm.

It is easier to divide them up into the following stages, one per notebook:

  • Steps 1 to 3.
  • Step 4. This involves writing to VCF and making samples compatible, but it should be soon accelerated using sgkit.
  • Step 5.
  • Step 6.

Check switch site positions in HMM copying paths

Switches should happen only at chip sites, not between chip sites. This should be a sanity check. Also, I wonder if there are hotspots of switching, even though no recombination map is provided.

Investigate differences in imputation results between `tskit.lshmm` and BEAGLE

Genotype imputation using the two methods currently yield dramatically different results, with tskit.lshmm performing far worse than BEAGLE than expected.

I'm using high-coverage (>30x) individual samples (n = 876) in the unified genealogies as a case study to figure out why the imputation results are so different despite the fact that both the methods use the same underlying Li & Stephens HMM model for sample matching (albeit different implementations, obviously).

I've randomly partitioned the high-coverage samples into a mock reference panel (n = 700) and a mock target study cohort (n = 176). Also, I'm focussing on only chip-like sites (n = 7,899), which are covered by a commercially available genotyping array (see #10 ). This means that I'm subsetting both the reference panel and target cohort to only the genetic variation data at the chip-like sites.

The idea is to run tskit.lshmm to match the target samples against the reference panel under different parameters and then to see how the number of wrongly imputed sites varies. Also, the results are to be compared with those results obtained using BEAGLE.

Test sensitivity of sample matching to exact values of mismatch probabilities and switch probabilities

Right now, small values of 1e-8 and 1e-20 are used for mismatch and switch probabilities, respectively. A previous preliminary analysis shows that precision set to get HMM path likelihood values can affect results (going from 10 to 20, noticeably). So, even the set values of mismatch and switch probabilities may have an effect. It would be good to check by setting the mismatch and switch probabilities to, say, 1e-6 and 1e-18, respectively, or possibly higher. It should be quick to test it on a handful of sample genomes anyway.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.