`pixy`

pixy is a command-line tool for painlessly and correctly estimating average nucleotide diversity within (π) and between (d_xy) populations from a VCF. In particular, pixy facilitates the use of VCFs containing invariant (AKA monomorphic) sites, which are essential for the correct computation of π and d_xy in the face of missing data (i.e. always).

pixy is currently in active development and has not undergone peer review. Our manuscript describing pixy can be found here: https://www.biorxiv.org/content/10.1101/2020.06.27.175091v1

Authors

Kieran Samuk and Katharine Korunes

Duke University

Citation

If you use pixy, please cite our preprint here: https://www.biorxiv.org/content/10.1101/2020.06.27.175091v1

Documentation

https://pixy.readthedocs.io/en/latest/

Installation

pixy is currently available for installation on Linux/OSX systems via conda-forge. To install pixy using conda, you will first need to add conda-forge as a channel (if you haven't already):

conda config --add channels conda-forge

Then install pixy:

conda install pixy

You can test your pixy installation by running:

pixy --help

For information in installing conda, see here:

anaconda (more features and initial modules): https://docs.anaconda.com/anaconda/install/

miniconda (lighter weight): https://docs.conda.io/en/latest/miniconda.html

Background

Population geneticists are often interested in quantifying nucleotide diversity within and nucleotide differences between populations. The two most common summary statistics for these quantities were described by Nei and Li (1979), who discuss summarizing variation case of two populations (denoted 'x' and 'y'):

π - Average nucleotide diversity within populations, also sometimes denoted π_x and π_y to indicate the population of interest.
d_xy - Average nucleotide difference between populations, also sometimes denoted π_xy (pixy, get it?), to indicate that the statistic is a comparison between populations x and y.

Both of these statistics use the same basic formula:

x_i and x_j are the respective frequencies of the i_thand j_th sequences, π_ij is the number of nucleotide differences per nucleotide site between the i_th and j_th sequences, and n is the number of sequences in the sample. (Source: Wikipedia)

In the case of π, all comparisons are made between sequences from same population, wherease for d_xy all comparisons are made between sequences from two different populations.

The problem

In order to be comparable across samples/studies/genomic regions, π and d_xy are generally standardized by dividing by the total number of sites in the sequences being compared. As such, one must explicitly keep track of variable vs. missing vs. monomorphic (invariant) sites. Failure to do this results in biased estimates of pi and dxy. Prior to the genomic era, missing/invariant sites were almost always explicitly included in datasets because sequence data was in FASTA format (e.g. Sanger reads). However, most modern genomics tools encode variants as VCFs which by design often omit invariant sites. With variants-only VCFs, there is often no way to distinguish missing sites from invariant sites. Further, when one does include invariant sites in a VCF, it generally results in very large files that are difficult to manipulate with standard tools.

The solution

pixy provides the following solutions to problems inherent in computing π and d_xy from a VCF:

Fast and efficient handing of invariant sites VCFs via conversion to on-disk chunked databases (Zarr format).
Standardized individual-level filtration of variant and invariant genotypes.
Computation of π and d_xy for arbitrary numbers of populations
Computes all statistics in arbitrarily sized windows, and output contains all raw information for all computations (e.g. numerators and denominators).

The majority of this is made possible by extensive use of the existing data structures and functions found in the brilliant python library scikit-allel.

huangsunan / pixy Goto Github PK

pixy's Introduction

`pixy`

Authors

Citation

Documentation

Installation

Background

The problem

The solution

pixy's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent