mfumagalli / ngspopgen Goto Github PK

View Code? Open in Web Editor NEW

26.0 7.0 7.0 94 KB

Population genetics analyses from NGS data

License: GNU General Public License v3.0

Makefile 0.45% C++ 71.16% R 24.54% Perl 3.85%

ngspopgen's Introduction

ngsPopGen

Several tools to perform population genetic analyses from NGS data:

ngsFst - Quantificate population genetic differentiation
ngsCovar - Population structure via PCA (principal components analysis)
ngs2dSFS - Estimate 2D-SFS from posterior probabilities of sample allele frequencies
ngsStat - Estimates number of segregating sites, expected average heterozygosity, and number of fixed differences and Dxy (if 2 populations provided).

IMPORTANT NOTE i):

In all analises involving 2 populations, input data must refer to the exact same sites. If they differ (e.g. because of different filtering), you must first get the overlapping subset of sites for both populations. To achieve this, you can follow instructions given in the tutorial (here).

IMPORTANT NOTE ii):

The use of folded data (spectrum or sample allele frequencies probabilities) is no longer supported. In case you do not have a reliable ancestral information, please use your reference sequence to polarise your data and follow all the steps as documented here. However, do not attempt to make any inference based on the resulting unfolded reference/non-reference based - site frequency spectrum.

IMPORTANT NOTE iii):

It may be practical to perform a non-stringent SNP calling before running the following analyses, in order to reduce the computational time and data dimensions. Moreover, this will reduce noise due to monomorphic sites, especially when the species' polymorphic rate is very low.

Installation

To install the entire package just download the source code:

% git clone https://github.com/mfumagalli/ngsPopGen.git

and run:

% cd ngsPopGen
% make

To run the tests (only if installed through ngsTools):

% make test

Executables are built into the main directory. If you wish to clean all binaries and intermediate files:

% make clean

However, we recommend to download and install the whole ngsTools package.

ngsFST

Program to estimate FST from NGS data. It computes expected genetic variance components and estimate per-site FST from those using methods-of-moments estimator. See Fumagalli et al. Genetics 2013 for more details. In input it receives sample allele frequencies likelihoods for each population and a 2D-SFS as a prior.

The output is a tab-separated text file. Each row represents a site. Columns are ordered as: A, AB, f, FST, Pvar; where A is the expectation of genetic variance between populations, AB is the expectation of the total genetic variance, f is the correcting factor for the ratio of expectations, FST is the per-site FST value, Pvar is the probability for the site of being variable.

Usage

Examples:

using a 2D-SFS as a prior, estimated using ngs2dSFS:

% ./ngsFST -postfiles pop1.saf pop2.saf -priorfile spectrum2D.txt -nind 20 20 -nsites 100000 -outfile pops.fst -verbose 0

Parameters:

-postfiles FILE_1 FILE_2: files with sample allele frequencies likelihoods for each population
-priorfile FILE: 2D-SFS to be used as a prior; you can use ngs2dSfs with parameter -relative set to 1
-nind INT: number of individuals for each population
-nsites INT: total number of sites; in case of a site subset this is the upper limit
-firstbase INT: in case of a site subset, this is the lower limit
-outfile FILE: name of the output file
-block_size INT: number of sites in each chunk (for memory reasons, increase it if you can use more RAM)
-verbose INT: level of verbosity

ngsCovar

Program to compute the expected correlation matrix between individuals from genotype posterior probabilities. It receives as input genotype posterior probabilities. It can receive in input also posterior probabilities of sample allele frequencies for computing the probability of each site to be variant.

Usage

Examples:

not calling genotypes but with SNP calling (preferred way under most circumstances):

% ./ngsCovar -probfile pop.geno -outfile pop.covar -nind 40 -nsites 100000 -block_size 20000 -call 0 -minmaf 0.05

not calling genotypes and no SNP calling (weighting by each site's probability of being variable; recommended if depth is extremely low):

% gunzip -f pop.saf.gz
% ./ngsCovar -probfile pop.geno -outfile pop.covar -nind 40 -nsites 100000 -block_size 20000 -call 0 -norm 0 -sfsfile pop.saf

calling genotypes (this is kept for compatibility but should not be used unless you have high-depth data, > 20X):

% ./ngsCovar -probfile pop.geno -outfile pop.covar -nind 40 -nsites 100000 -block_size 20000 -call 1

Parameters:

-probfile FILE: file with genotype posterior probabilities
-sfsfile FILE: file with per site allele frequency posterior probabilities
-nind INT: number of individuals
-nsites INT: total number of sites; in case of a site subset this is the upper limit
-offset INT: in case of a site subset, this is the lower limit
-genoquality FILE: text file with 'nsites' lines stating whether to use (1) or ignore (0) the site
-norm INT: normalization procedure; either "0" for no normalization (recommended if no SNP calling was performed), "1" for normalization by p(1-p) (Patterson et al, 2006) or "2" for normalization by 2p(1-p)
-minmaf FLOAT: ignore sites below this threshold of minor allele frequency
-call: call genotypes based on the maximum posterior probability
-outfile FILE: name of output file
-block_size INT: number of sites in each chunk (for memory reasons)
-verbose INT: level of verbosity

ngs2dSFS

Program to estimate 2D-SFS from posterior probabilities of sample allele frequencies. Output file reports the occurrence of sites at distinct joint allele frequencies. This spectrum is a (2N1+1)x(2N2+2) matrix with N1 and N2 number of individuals at the two populations. Please note that cells are zero-based ordered. As an example, value reported in the cell [4,3] represents the frequency of sites with allele frequency 3 and 2 at population 1 and 2 respectively.

Example:

% ./ngs2dSFS -postfiles pop1.saf pop2.saf -outfile spectrum.txt -relative 1 -nind 20 20 -nsites 100000

Parameters:

-postfiles FILE: file with sample allele frequency posterior probabilities (or likelihoods) for each population
-nind INT: number of individuals
-nsites INT: total number of sites; in case of a site subset this is the upper limit
-offset INT: in case of a site subset, this is the lower limit
-outfile FILE: name of output file
-maxlike INT: how to compute the MLE; either, "1" (preferred) to sum across sites' joint allele frequency, or "0" to sum of the products of likelihoods
-relative INT: whether input are absolute counts of sites with a specific joint allele frequency (0) or relative frequencies (1)
-block_size INT: number of sites for each chunk (for memory efficiency only)

ANGSD can compute a ML estimate of the 2D-SFS which should be preferred when many sites are available. However, ANGSD output file should be transformed (from log to un-log and from space-separated to tab-separated) before being used in ngsFST.

ngsStat

Program to compute estimates of the number of segregating sites, the expected average heterozygosity, and the number of fixed differences (if 2 populations data is provided). It receives as input sample allele frequency posterior probabilities (from ANGSD) from 1 or 2 populations. Output is a text file with columns: start, end, segregating sites (pop 1), heterozygosity (pop 1), segregating sites (pop 2), heterozygosity (pop 2), fixed differences, dxy.

Example:

2 populations, sliding windows of 100 sites (latter recommended only if no missing data is present, however in most cases you will have some missing sites so this command should not be used):

./ngsStat -npop 2 -postfiles pop1.saf pop2.saf -nsites 1000 -iswin 1 -nind 10 10 -outfile pops.stat -verbose 0 -block_size 100

1 populations, sliding windows of 100 sites (latter recommended only if no missing data is present, so again in many cases this should not be used):

./ngsStat -npop 1 -postfiles pop1.saf -nsites 1000 -iswin 1 -nind 10 -outfile pops.stat -block_size 100

1 population, values estimated at each site (recommended in case of missing data, and then the computation of values in sliding windows will be performed using the R script provided):

./ngsStat -npop 1 -postfiles pop1.saf -nsites 1000 -iswin 0 -nind 10 -outfile pops.stat

Parameters:

-npop INT: number of populations (should be the first one to be specified)
-postfiles: file with sample allele frequency posterior probabilities for each population
-nind INT: number of individuals
-nsites INT: total number of sites; in case of a site subset this is the upper limit
-firstbase INT: in case of a site subset, this is the lower limit
-iswin INT: if set to 1, chuncks are considered non-overlapping sliding-windows
-outfile FILE: name of output file
-block_size INT: number of sites in each chunk (for memory reasons)
-verbose INT: level of verbosity

Further examples can be found in the Tutorial.

ngspopgen's People

Contributors

Stargazers

Watchers

Forkers

harvardinformatics jiangchb idiv-biodiversity robertdigital sh1ne111 callithrix-omics

ngspopgen's Issues

Can't install on Mac

I'm trying to install ngsPopGen on a Mac, running OS X 10.11.4 and I get the following error:

agro200953237:ngsPopGen hoffmanp$ make
g++ -lm -lz -O3 -Wall ngs2dSFS.cpp -o ngs2dSFS
In file included from ngs2dSFS.cpp:10:
In file included from ./ngs2dSFS.hpp:1:
./shared.hpp:41:1: error: unknown type name 'array'
array<int> getStart(int nsites, int firstbase, int block_size) {
^
./shared.hpp:41:6: error: expected unqualified-id
array<int> getStart(int nsites, int firstbase, int block_size) {
     ^
./shared.hpp:65:1: error: unknown type name 'array'
array<int> getEnd(int nsites, int firstbase, int block_size) {
^
./shared.hpp:65:6: error: expected unqualified-id
array<int> getEnd(int nsites, int firstbase, int block_size) {
     ^
./shared.hpp:147:1: error: unknown type name 'array'
array<double> readArray(const char *fname, int nInd) {
^
./shared.hpp:147:6: error: expected unqualified-id
array<double> readArray(const char *fname, int nInd) {
     ^
./shared.hpp:185:3: error: reference to 'array' is ambiguous
  array<double> allvalues;
  ^
./shared.hpp:18:8: note: candidate found by name lookup is 'array'
struct array{
       ^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/__tuple:116:65: note: 
      candidate found by name lookup is 'std::__1::array'
template <class _Tp, size_t _Size> struct _LIBCPP_TYPE_VIS_ONLY array;
                                                                ^
In file included from ngs2dSFS.cpp:10:
In file included from ./ngs2dSFS.hpp:1:
./shared.hpp:186:3: error: use of undeclared identifier 'allvalues'
  allvalues.x = nrow*ncol;
  ^
./shared.hpp:187:3: error: use of undeclared identifier 'allvalues'
  allvalues.data = tmp;
  ^
./shared.hpp:193:15: error: use of undeclared identifier 'allvalues'
      tmp2[k]=allvalues.data[index];
              ^
./shared.hpp:202:13: error: use of undeclared identifier 'allvalues'
  delete [] allvalues.data;
            ^
./shared.hpp:270:1: error: unknown type name 'array'
array<int> readGenoQuality(const char *fname, int nsites) {
^
./shared.hpp:270:6: error: expected unqualified-id
array<int> readGenoQuality(const char *fname, int nsites) {
     ^
./shared.hpp:333:35: error: unknown type name 'array'
void getPvar(matrix<double> &sfs, array<double> pvar) {
                                  ^
./shared.hpp:333:40: error: expected ')'
void getPvar(matrix<double> &sfs, array<double> pvar) {
                                       ^
./shared.hpp:333:13: note: to match this '('
void getPvar(matrix<double> &sfs, array<double> pvar) {
            ^
./shared.hpp:335:5: error: use of undeclared identifier 'pvar'
    pvar.data[i]=1-sfs.data[i][0]-sfs.data[i][2*(sfs.y-1)];
    ^
./shared.hpp:339:6: error: variable has incomplete type 'void'
void writearray(array<double> &m,FILE *fp) {
     ^
./shared.hpp:339:17: error: reference to 'array' is ambiguous
void writearray(array<double> &m,FILE *fp) {
                ^
./shared.hpp:18:8: note: candidate found by name lookup is 'array'
struct array{
       ^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/__tuple:116:65: note: 
      candidate found by name lookup is 'std::__1::array'
template <class _Tp, size_t _Size> struct _LIBCPP_TYPE_VIS_ONLY array;
                                                                ^
In file included from ngs2dSFS.cpp:10:
In file included from ./ngs2dSFS.hpp:1:
./shared.hpp:339:29: error: expected '(' for function-style cast or type
      construction
void writearray(array<double> &m,FILE *fp) {
                      ~~~~~~^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.
make: *** [ngs2dSFS] Error 1

I can, however, install on Linux. Is there a dependency that I'm missing?

ngs2dSFS and ngsFst giving "Possible error reading SFS, binary file might be broken..." even after using old SAF

Hi Matteo,

I was trying ngsFst for getting some Fst estimates that was mentioned in this paper. My motivation trying this was I can't find any clean outlier from two populations that are supposed to be different when using other Fst estimators in ANGSD.

So what I did was this:

~/bin/ngsPopGen/ngsFST -postfiles wildPop.folded.oldsaf.gz domesticPop.folded.oldsaf.gz -priorfile wildPop.domePop.unfolded.2dsfs -outfile wildPop.domePop.ngsFST -nind 17 39 -nsites 100000 -verbose 1

The .oldsaf.gz files were coming from realSFS print -oldfile 1 as was said in the "Using NGStools/NGSpopgen" section of this page.
The .2dsfs file was coming from realSFS of the saf.idx of the 2 populations used there.
The wildPop has 17 diploid individuals and the domesticPop has 39 diploid individuals.

The message I get is this

        ->Dumping file: wildPop.domePop.ngsFST
        ->Using some of these args: -nind 56 -nind1 17 -nind2 39 -nsites 100000 -postfiles wildPop.folded.oldsaf.gz domesticPop.folded.oldsaf.gz -priorfiles (null) (null) -priorfile wildPop.domePop.unfolded.2dsfs -outfile wildPop.domePop.ngsFST -verbose 1 -firstbase 1 -block_size 20000

Reading 2D prior...
 num win 5 win0 is 0 19999
Block 0 out of 4 from 0 to 19999

        -> Possible error reading SFS, binary file might be broken...

I thought that was because the 2dSFS not right (unfolded 2DSFS made with ANGSD) so I change into folded 2DSFS made with ANGSD. The message I got was still the same.

Then I was trying to make a 2D SFS file using ngs2dsfs:

~/bin/ngsPopGen/ngs2dSFS -postfiles wildPop.folded.oldsaf.gz domesticPop.folded.oldsaf.gz -outfile wildPop.domePop.ngs2dSFS -nind 34 78 -nsites 100000 -relative 1

        -> Possible error reading SFS, binary file might be broken...

Still the same message.

Do you know how can I get ngsFst working?

ngsStat Floating point exception (core dumped)

Hello,
I'm trying to use ngsStat by doing:
./ngsStat -npop 1 -postfiles file.saf -nind 11 -outfile file.stat
where my .saf (out of angsd) looks like:

NC_007897.1     2       0.000000        -5.094189       -10.313600      -15.673835      -21.194906      -26.905237      -32.847622      -39.091530      -45.764591      -53.156807      -62.342842      -73.078079      -inf    -inf    -inf    -inf    -inf    -inf    -inf    -inf    -inf    -inf    -inf
NC_007897.1     3       0.000000        -5.217402       -10.582065      -16.106823      -21.804443      -27.695183      -33.814251      -40.223988      -47.043053      -54.543686      -63.845039      -75.157547      -inf    -inf    -inf    -inf    -inf    -inf    -inf    -inf    -inf    -inf    -inf
NC_007897.1     4       0.000000        -5.217465       -10.582220      -16.107111      -21.804922      -27.695948      -33.815476      -40.226013      -47.046764      -54.552528      -63.910259      -75.850700      -inf    -inf    -inf    -inf    -inf    -inf    -inf    -inf    -inf    -inf    -inf
NC_007897.1     5       0.000000        -5.217465       -10.582220      -16.107111      -21.804922      -27.695948      -33.815476      -40.226013      -47.046764      -54.552528      -63.910259      -75.850700      -inf    -inf    -inf    -inf    -inf    -inf    -inf    -inf    -inf    -inf    -inf

but I'm getting the error:

Floating point exception (core dumped)

Any ideas what could be wrong?

Covar and new ANGSD 2D SFS

I am having trouble getting ngsCovar to play nice with the 2D SFS from the recent version of ANGSD.

As mentioned, I am trying to use a 2D SFS from the new ANGSD output because it is quite simple to generate. (It does output all values in a single line, but I converted it to matrix of (2N1+1)x(2N2+2) as noted in the ngs2dSFS documentation.) The consistently produced error is -> Possible error reading SFS, binary file might be broken....

Granted the file may be incorrect because I notice that in the ngsCovar documentation it states the SFS file contains sample allele frequency posterior probabilities while the ANGSD 2D SFS documentation states its output ANGSD contains the expected values. Additionally, the file is plain text.

Any help or suggestions would be greatly appreciated and please let me know if you need additional information.

Error reading SFS file in ngsStat

Hi,

I would like to calculate expected heterozygosity for my populations using .saf files and ngsStat, but I keep getting the error: "Possible error reading SFS, binary file might be broken".

I have followed the ngsTutorial in creating the .sfs and .saf files required, and haven't had this issue in previous analyses. I've seen online that someone else having this problem had to use realSFS to generate the SFS files, but I have been using that from the beginning.

The command where the error occurs (from the ngsTutorial):
$NGSTOOLS/ngsPopGen/ngsStat -npop 1 -postfiles Results/PEL.saf -nsites $N_SITES -nind 20 -outfile Results/PEL.stats.txt

Unfortunately, my saf.gz file is too big to be added on here. I have included my SFS file in text format, if that is any help.
MIS01.sfs.txt

I hope this can be solved!
Diede

Order of samples in covariance matrix

Greetings,

I've got a probably obvious question that I can't find an answer to after some digging. When I calculate the genotype posterior probabilities with ANGSD and feed them into ngsCovar, I get a matrix of numbers where the number of columns and rows equals the number of samples. Am I right to assume that column/row #1 corresponds to the first bam file in the list of bam files that I input into ANGSD, column/row #2 corresponds to the second bam file, etc...?

Thanks so much,
Evan

getDxy.pl comment

If I am not mistaken, according to the file input from maf,

if(($parts[6]>=$minInd)&&($parts2[6]>=$minInd))
should read

if(($parts[7]>=$minInd)&&($parts2[7]>=$minInd))

since nInd is column 8,

mfumagalli / ngspopgen Goto Github PK

ngspopgen's Introduction

ngsPopGen

Installation

ngsFST

Usage

Examples:

Parameters:

ngsCovar

Usage

Examples:

Parameters:

ngs2dSFS

Example:

Parameters:

ngsStat

Example:

Parameters:

ngspopgen's People

Contributors

Stargazers

Watchers

Forkers

ngspopgen's Issues

Recommend Projects

Recommend Topics

Recommend Org