Giter Site home page Giter Site logo

pixy's Introduction

pixy

pixy is a command-line tool for painlessly and correctly estimating average nucleotide diversity within (π) and between (dxy) populations from a VCF. In particular, pixy facilitates the use of VCFs containing invariant (AKA monomorphic) sites, which are essential for the correct computation of π and dxy in the face of missing data (i.e. always).

pixy is currently in active development and has not undergone peer review. Our manuscript describing pixy can be found here: https://www.biorxiv.org/content/10.1101/2020.06.27.175091v1

Authors

Kieran Samuk and Katharine Korunes

Duke University

Citation

If you use pixy, please cite our preprint here: https://www.biorxiv.org/content/10.1101/2020.06.27.175091v1

Documentation

https://pixy.readthedocs.io/en/latest/

Installation

pixy is currently available for installation on Linux/OSX systems via conda-forge. To install pixy using conda, you will first need to add conda-forge as a channel (if you haven't already):

conda config --add channels conda-forge

Then install pixy:

conda install pixy

You can test your pixy installation by running:

pixy --help

For information in installing conda, see here:

anaconda (more features and initial modules): https://docs.anaconda.com/anaconda/install/

miniconda (lighter weight): https://docs.conda.io/en/latest/miniconda.html

Background

Population geneticists are often interested in quantifying nucleotide diversity within and nucleotide differences between populations. The two most common summary statistics for these quantities were described by Nei and Li (1979), who discuss summarizing variation case of two populations (denoted 'x' and 'y'):

  • π - Average nucleotide diversity within populations, also sometimes denoted πx and πy to indicate the population of interest.
  • dxy - Average nucleotide difference between populations, also sometimes denoted πxy (pixy, get it?), to indicate that the statistic is a comparison between populations x and y.

Both of these statistics use the same basic formula:

xi and xj are the respective frequencies of the ithand jth sequences, πij is the number of nucleotide differences per nucleotide site between the ith and jth sequences, and n is the number of sequences in the sample. (Source: Wikipedia)

In the case of π, all comparisons are made between sequences from same population, wherease for dxy all comparisons are made between sequences from two different populations.

The problem

In order to be comparable across samples/studies/genomic regions, π and dxy are generally standardized by dividing by the total number of sites in the sequences being compared. As such, one must explicitly keep track of variable vs. missing vs. monomorphic (invariant) sites. Failure to do this results in biased estimates of pi and dxy. Prior to the genomic era, missing/invariant sites were almost always explicitly included in datasets because sequence data was in FASTA format (e.g. Sanger reads). However, most modern genomics tools encode variants as VCFs which by design often omit invariant sites. With variants-only VCFs, there is often no way to distinguish missing sites from invariant sites. Further, when one does include invariant sites in a VCF, it generally results in very large files that are difficult to manipulate with standard tools.

The solution

pixy provides the following solutions to problems inherent in computing π and dxy from a VCF:

  1. Fast and efficient handing of invariant sites VCFs via conversion to on-disk chunked databases (Zarr format).
  2. Standardized individual-level filtration of variant and invariant genotypes.
  3. Computation of π and dxy for arbitrary numbers of populations
  4. Computes all statistics in arbitrarily sized windows, and output contains all raw information for all computations (e.g. numerators and denominators).

The majority of this is made possible by extensive use of the existing data structures and functions found in the brilliant python library scikit-allel.

pixy's People

Contributors

kkorunes avatar ksamuk avatar marcelotrevisani avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.