Giter Site home page Giter Site logo

mauerjh / tracts Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sgravel/tracts

0.0 0.0 0.0 16.5 MB

A set of tools for modelling ancestry patterns along the genome.

License: GNU General Public License v2.0

Python 44.81% Shell 0.10% Mathematica 55.09%

tracts's Introduction

Tracts

Tracts is a set of classes and definitions used to model migration histories based on ancestry tracts in admixed individuals. Time-dependent gene-flow from multiple populations can be modeled.

Examples

Examples contains sample hapmap data and scripts to analyze them, including two different gene flow models. It also contains a 3-population model for 1000 genomes puerto Rican data

Installation

Installation support for tracts is rudimentary. Apologies.

Copy all files and folders locally.

"tracts.py" is a python module. It is currently compatible with python 2.7.

To load the package, you currently have to tell python where to look for the package by doing something like

import sys tractspath = path_to_tracts_directory # the path to tracts if not in your default pythonpath sys.path.append(tractspath) import tracts.

I have added tracts_conda_env.yml, an anaconda environment file with the relevant dependencies. If you are using anaconda, you can load this environment using

conda env create -f tracts_conda_env.yml conda activate py2_tracts

Input

Tracts input is a set bed-style file describing the local ancestry of segments along the genome. The file has 2 extra columns for the cM positions of the segments. There are two input files per individuals (for each haploid genome copy).

chrom		begin		end			assignment	cmBegin	cmEnd
chr13		0			18110261	UNKNOWN	0.0			0.19
chr13		18110261	28539742	YRI			0.19		22.193
chr13		28539742	28540421	UNKNOWN	22.193		22.193
chr13		28540421	91255067	CEU		22.193		84.7013

Driver File

To maintain maximum flexibility, the options and models in tracts are set up in a driver file and a "model" file. Examples of both are provided in the distribution; these examples are the best starting points for the first-time. Tracts can be used interactively--when using the (i)python console, it is easy to examine and plot the different variables.

Output

The 3-population exemple files produce 5 output files, e.g.

 boot0_-252.11_bins	boot0_-252.11_liks	boot0_-252.11_ord	boot0_-252.11_pred
 boot0_-252.11_dat	boot0_-252.11_mig	boot0_-252.11_pars

boot0 means that this is bootstrap iteration 0, which in the convention used here means the fit with the real data (in the two-population example, there is no bootstrap, so the output is named "out" and "out2" instead) -252.11 is the likelihood of the best-fit model

  • _bins: the bins used in the discretization
  • _dat: the observed counts in each bins
  • _pred: the predicted counts in each bin, according to the model
  • _mig: the inferred migration matrix, with the most recent generation at the top, and one column per migrant population. Entry i,j in the matrix represent the proportion of individuals in the admixed population who originate from the source population j at generation i in the past.
  • _pars: the optimal parameters. I.e., if these models are passed to the admixture model, it will return the inferred migration matrix.
  • _liks: the likelihoods in the model parameter space in the output format of scipy.optimizes' "brute" function: the first number is the best likelihood, the top matrices define the grid of parameters usedin the search, and the last matrix defines the likelihood at all grid points. see http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.brute.html

Setting up a demographic model

The space of possible incoming migration matrices is quite large; if we have p migrant populations over g generations, there can be n*g different migration rates. To simplify this, we introduce simplified parametrized models that describe the full migration matrix in terms of a few parameters. These models may, for example, involve a discrete number of admixture pulses, or periods of constant migrations rate. The user has full flexibility in defining these models; in python, one needs to write a function that takes parameters as an input (such as the time of the onset of migration, migration rate p), and returns a migration matrix.

Here is the simplest example of such a function, implementing a single pulse of migration:

def pp((init_Eu,tstart)):
        """ A simple model in which populations Eu and AFR arrive
            discretely at first generation. If a time is not integer, the
            migration is divided between neighboring times proportional to
            the non-integer time fraction.  """

        # the time is scaled by a factor 100 in this model to ease
        # optimization with some routines that expect all parameters to
        # have the same scale
        tstart *= 100

        if  tstart < 0:
                #time shouldn't be negative: that should be caught by
                #constraint function (below). Return empty matrix
                gen = int(numpy.ceil(max(tstart, 0))) + 1
                mig = numpy.zeros((gen+1, 2))
                return mig

        # number of generations in the migration matrix
        gen  = int(numpy.ceil(tstart)) + 1
        # how close we are to the integer approximation
        frac = gen - tstart - 1
        # placeholder migration matrix
        mig  = numpy.zeros((gen + 1, 2))

        #initial migration rates must sum up to one.
        initNat = 1 - init_Eu

        # Replace a fraction at second generation to ensure a continuous
        # model distribution with generation
        mig[-1,:] = numpy.array([init_Eu, initNat])
        mig[-2,:] = frac * numpy.array([init_Eu, initNat])

        return mig

Some parameter values are inconsistent: times must be positive, and proportions of migrants must be between 0 and 1. We define an auxiliary function that verifies whether these conditions are met It returns a number that is nonnegative if constraints are satisfied, and gets increasingly negative when they are more strongly violated.

def outofbounds_pp(params):
        """ Constraint function evaluating below zero when constraints not
            satisfied. """
        ret = 1 #initialize the return variable to a positive value.
        (init_Eu, tstart) = params

        # migration proportion must be between 0 and 1
        ret = min(1, 1 - init_Eu)
        ret = min(ret, init_Eu)


        # generate the migration matrix and test for possible issues
        func = pp #specify the model
        mig = func(params) #get the migration matrix
        # calculate the migration rate per generation
        totmig = mig.sum(axis=1)

        # first generation migration must sum up to 1
        ret = min(ret, -abs(totmig[-1] - 1) + 1e-8)
        # no migrations are allowed in the first two generations
        ret = min(ret, -totmig[0], -totmig[1])

        # migration at any given generation cannot be greater than 1
        ret = min(ret, 10 * min(1 - totmig), 10 * min(totmig))

        # start time must be at least two generations ago
        ret = min(ret, tstart - .02)

        return ret

The population is founded when two populations meet; at the first generation, we consider all individuals in the population as “migrants”, so the sum of migration frequencies at the first generation must be one. If it isn’t, tracts will complain.

Importantly, the optimizers in tracts assume that all parameters are continuous, but the underlying markov model uses discrete generations. When a time falls between two integers, the migrants are distributed across the neighboring integers, in such a way that the migration matrix changes “continuously”, in the sense that expected number of migrants. Continuous change is important, because likelihood optimizers can really struggle if the model is discontinuous in parameter space.

Contact

See the example files for example usage. If something isn't clear, please let me know by filing an "new issue", or emailing me.

FAQ

The distribution of tract lengths decreases as a function of tract length, but increases at the very last bin. This was not seen in the original paper. What is going on?

In tracts, the last bin represents the number of chromosomes with no ancestry switches. It does not correspond to a specific length value, and for this reason was not plotted in the tracts paper.

When I have a single pulse of admixture, I would expect an exponential distribution of tract length, but the distribution of tract lengths shows steps in the expected length. Why is that?

"Tracts" takes into account the finite length of chromosomes. Since ancestry tracts cannot extend beyond chromosomes, we expect this departure from an exponential distribution

I have migrants from the last generation. "tracts" tells me that migrants in the last two generations are not allowed. Why is that?

Haploid genomes from the last two generations have no ancestry switches and should be easy to identify in well-phased data--they should be removed from the sample before running tracts. If this is impossible (e.g., because of inaccurate phasing across chromosomes), tracts will likely attempt to assign last-generation migrants to two generations ago. This should be observable by an excess of very long tracts in the data compared to the model.

Individuals in my population vary considerably in their ancestry proportion. Is that a problem?

It is not a problem as long as the population was close to random mating. If admixture is recent, random mating is not inconsistent with ancestry variance. If admixture is ancient, however, variation in ancestry proportion may indicate population structure, and the random mating assumption may fail.

I ran the optimization steps many times, and found different optimal likelihoods. Why is that?

Optimizing functions in many dimensions is hard, and sometimes optimizers get stuck in local maxima. If you haven tried already, you can attempt to fix the ancestry proportions a priori (see the _fix examples in the documentation). In most cases, the optimization will converge to the global maximum a substantial proportion of the time: running the optimization a few times from random starting positions and comparing the best values may help control for this.

If you fail to revisit the same minimum after running say, 10 optimizations, then something else might be going on. If the model is not continuous as a function of a parameter, it could make the optimization much harder. Defining a continuous model would help, or you could try the brute-force optimization method if the number of parameters is small.

tracts's People

Contributors

sgravel avatar tsani avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.