fgvieira / ngsf Goto Github PK

18.0 6.0 8.0 95 KB

Estimation of per-individual inbreeding coefficients under a probabilistic framework

License: Other

C++ 29.21% Makefile 3.64% C 64.86% Shell 2.28%

population-genomics inbreeding ngs genotype-likelihoods

ngsf's Introduction

ngsF

ngsF is a program to estimate per-individual inbreeding coefficients under a probabilistic framework that takes the uncertainty of genotype's assignation into account. It avoids calling genotypes by using genotype likelihoods or posterior probabilities.

Citation

ngsF was published in 2013 at Genome Research, so please cite it if you use it in your work:

Vieira FG, Fumagalli M, Albrechtsen A, Nielsen R
Estimating inbreeding coefficients from NGS data: Impact on genotype calling and allele frequency estimation.
Genome Research (2013) 23: 1852-1861

Installation

ngsF can be easily installed but has some external dependencies:

Mandatory:
- gcc: >= 4.9.2 tested on Debian 7.8 (wheezy)
- zlib: v1.2.7 tested on Debian 7.8 (wheezy)
- gsl : v1.15 tested on Debian 7.8 (wheezy)
Optional (only needed for testing or auxilliary scripts):
- md5sum

To install the entire package just download the source code:

% git clone https://github.com/fgvieira/ngsF.git

and run:

% cd ngsF
% make

To run the tests (only if installed through ngsTools):

% make test

Executables are built into the main directory. If you wish to clean all binaries and intermediate files:

% make clean

Usage

% ./ngsF [options] --n_ind INT --n_sites INT --glf glf/in/file --out output/file

Parameters

--glf FILE: Input GL file.
--init_values CHAR or FILE: Initial values of individual F and site frequency. Can be (r)andom, (e)stimated from data assuming a uniform prior, (u)niform at 0.01, or read from a FILE.
--calc_LRT: estimate MAFs and calculate lkl assuming F=0 (H0; null hypothesis) for a Likelihood Ratio Test (LRT); if parameters from a previous run (H1; alternative hypothesis) are provided (through --init_values), checks if estimates of F are significantly different from 0 through a LRT assuming a chi-square distribution with one degree of freedom.
--freq_fixed: assume initial MAF as fixed parameters (only estimates F)
--out FILE: Output file name.
--n_ind INT: Sample size (number of individuals).
--n_sites INT: Total number of sites.
--chunk_size INT: Size of each analysis chunk. [100000]
--approx_EM: Use the faster approximated EM ML algorithm
--call_geno: Call genotypes before running analyses.
--max_iters INT: Maximum number of EM iterations. [1500]
--min_iters INT: Minimum number of EM iterations. [10]
--min_epsilon FLOAT: Maximum RMSD between iterations to assume convergence. [1e-5]
--n_threads INT: Number of threads to use. [1]
--seed: Set seed for random number generator.
--quick: Quick run.
--verbose INT: Selects verbosity level. [1]

Input data

As input ngsF needs a Genotype Likelihood (GL) file, formatted as 3*n_ind*n_sites doubles in binary. It can be uncompressed [default] or in BGZIP format. If "-", reads uncompressed stream from STDIN. Currently, all sites in the file must be variable, so a previous SNP calling step is needed. This file is the output of ANGSD when option -doGlf 3 is specified; this is the recommended way but, if you ran ANGSD with the option -doGlf 2 and have a beagle.gz file, you can try to convert it to GLF. However, keep in mind that this is sort of a "hack" and the output file might not be completely correct:

% zcat INPUT.beagle.gz | tail -n +2 | perl -an -e 'for($i=3; $i<=$#F; $i++){print(pack("d",($F[$i]==0 ? -inf : log($F[$i]))))}' > OUTPUT.glf

Ouput files

ngsF prints out two (or three) output files: the output file (specified with option --out), the parameters file (same name plus the suffix .pars), and (if --calc_LRT and --init_values have been specified) the LRT file (same name plus the suffis .lrt). The output file is a text file with the per-individual inbreeding coefficients, one per line. The parameters file is a binary file with the final parameters, namely global log-likelihood (double), per-individual log-likelihood (N_IND doubles), per-individual inbreeding coefficients (N_IND doubles), and per-site minor allele frequencies (N_SITES doubles); this is the same format as the --init_values input file. The LRT file is a text file with the global and per-individual likelihoods for H1 (alternative hypothesis; 1st column), H0 (null hypothesis; 2nd column), and p-value for rejection of H0 (following a chi2 distribution adn 1 degree of freedom; 3rd column).

Stopping Criteria

An issue on iterative algorithms is the stopping criteria. ngsF implements a dual condition threshold: relative difference in log-likelihood and estimates RMSD (F and freq). As for which threshold to use, simulations show that 1e-5 seems to be a reasonable value. However, if you're dealing with low coverage data (2x-3x), it might be worth to use lower thresholds (between 1e-6 and 1e-9).

Debug

Some available options are intended for debugging purposes only and should not be used in any real analysis!

--verbose: verbose values above 4
--quick: Only computes initial "freq" and "indF" values with no EM optimization.

Hints

Dataset: as a rule of thumb, use at least 1000 high confidence independent SNP sites.
Low coverage data: since the initial estimates are not reliable, it is recommended to use random starting points and more strict stopping criteria (eg. -init_values r -min_epsilon 1e-9).
High coverage data: although F is not really useful in the prior, it seems lower initial values perform better (-init_values u).
Memory Usage: By default ngsF loads the entire file into memory. However, if the file is too big and not enough memory is available, ngsF can also load chunks as they are needed. This is implemented on the BGZF library (from SAMTOOLS package), which allows for fast random access to BGZIP compressed files through an internal virtual index. This library can only deal with BGZIP files but a binary to compress them is provided. If you want to use this library just add -D_USE_BGZF to the FLAGS on the Makefile.

Contact

For questions on the usage of ngsF please check the tutorial or contact Dr Filipe G. Vieira.

ngsf's People

Contributors

Stargazers

Watchers

Forkers

mfumagalli lepisorus mojaveazure zhensydow harvardinformatics idiv-biodiversity wangnan9394

ngsf's Issues

ReferenceID err

Dear ANGSD group,

I tried to run ANGSD to estimate a bunch of things, however, it stopped at some point with the error message as follows:
ReferenceID:(r:0) for read:CL100033396L1C017R046_454501 is not: 254 but is:253

and my command for running is:
-P 1 -b list.bam -trim 3 -out angsd -gl 1 -minQ 20 -minMapQ 30 -uniqueOnly 1 -only_proper_pairs 1 -setMaxDepthInd 100 -doMajorMinor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.01 -doIBS 1 -doCounts 1 -makeMatrix 1 -doCov 1 -doGlf 2 -doDepth 1 -doGeno 3 -dopost 1

Allow for non-variable sites

Currently ngsF only works when all sites are variables (call SNPs first). One way would be to use a weighted average but it only seems to work with extreme weights (pvar^200)

ngsF leads to unexpected inbreeding estimate of zero

Thank you very much for providing this package! I just applied it to an inbred species using the parameters that are recommended for low-coverage data ("--init_values r -min_epsilon 1e-9") since my individuals show quite a spread of coverage (between ~ 3 and 20). For a few individuals, I receive an estimated inbreeding coefficient of zero, which is not expected according to what we know about the species. May I ask if you have an idea about why this might be the case? Thank you very much for your help!

Calculate individual loglkl

Currently, individual loglkl is disabled since, for sake of speed, it only calculates the fast_lkl.
Should implement the correct individual loglkl

pthreads for LKL calculation

LKL currently not threaded.

[main] ERROR: wrong number of sites or invalid/corrupt file!

Is there a way to calculate the number of sites within ngsF? I am currently getting "[main] ERROR: wrong number of sites or invalid/corrupt file!" and I have no idea why. I'm trying to find a way around it

Problem with gsl during installation

Hello, I have a problem with gsl as a dependency of ngsF.
I work with conda 4.12.0 in a dedicated environment.
The installation of ngsF warned that the gsl.pc and gsl/gsl_rng.h files are not found:
In the ngsF directory:

make
make -C htslib bgzip
make[1]: Entering directory '/mnt/shared/home/usr/ngsF/htslib'
cc -c -Wall -O3  -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_USE_KNETFILE -D_CURSES_LIB=1 -I. bgzip.c -o bgzip.o
cc -c -Wall -O3  -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_USE_KNETFILE -D_CURSES_LIB=1 -I. bgzf.c -o bgzf.o
bgzf.c: In function ‘bgzf_close’:
bgzf.c:630:8: warning: variable ‘count’ set but not used [-Wunused-but-set-variable]
    int count, block_length = deflate_block(fp, 0);
        ^~~~~
cc -c -Wall -O3  -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_USE_KNETFILE -D_CURSES_LIB=1 -I. knetfile.c -o knetfile.o
cc -Wall -O3  -o bgzip bgzf.o bgzip.o knetfile.o -lz
make[1]: Leaving directory '/mnt/shared/home/usr/ngsF/htslib'
Package gsl was not found in the pkg-config search path.
Perhaps you should add the directory containing `gsl.pc'
to the PKG_CONFIG_PATH environment variable
No package 'gsl' found
g++ -O3 -Wall -I -I/mnt/shared/home/usr/ngsF/htslib  -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_USE_KNETFILE  -c parse_args.cpp
Package gsl was not found in the pkg-config search path.
Perhaps you should add the directory containing `gsl.pc'
to the PKG_CONFIG_PATH environment variable
No package 'gsl' found
g++ -O3 -Wall -I -I/mnt/shared/home/usr/ngsF/htslib  -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_USE_KNETFILE  -c read_data.cpp
read_data.cpp:1:10: fatal error: gsl/gsl_rng.h: No such file or directory
 #include <gsl/gsl_rng.h>
          ^~~~~~~~~~~~~~~

Therefore I have re-installed gsl with conda, and the installed version is 2.28.
conda install -c conda-forge gsl

I have checked as well the presence of gl/gsl_rng.h
/mnt/shared/scratch/usr/apps/conda/pkgs/gsl-2.7-he838d99_0/include/gsl/gsl_rng.h

But the same issue happen again.
Have you seen this issue before?

Supplementary figures 7, 8 and 9

Dear @fgvieira,

I have run ngsF and got as output a text file with per individual inbreeding values and a .pars binary file. I would like to know how did you manage with those files to obtain a figure similar to the supplementary figures 7, 8 and 9 of the paper (https://academic.oup.com/bioinformatics/article/32/14/2096/1743296). I have read that you only used ggplot2 but I am not sure how to process the binary file to get a table with the values and so on.

Thanks a lot,

Sincerely,

Marc

Number of runs to avoid convergence to local maxima

Hello Dr. Vieira,

I have read the tutorial and it says that ngsF should be run several times to avoid convergence to local maxima. My question is how many runs should be performed to avoid this. Should the seed argument be different for each run? (my guess is that no)

Additionally, after the certain number of runs are done, should we obtain the mean value for inbreeding for each sample?

For example, if I run ngsF (with the same parameters and seed) 10 times, I will have 10 output files with the different values for inbreeding coefficients. Then, I would have to put together the inbreeding values for each sample and get the mean value per sample, right?

Thank you in advance!

construction of init_values input file

Hi Felipe,
I was wondering if you could tell me how to construct the input file for --init_values?
I would like to use --fixed_freq and a list of MAF values that I previously estimated from a subset of my data. The total dataset contains a lot of close relatives which I think would skew the MAF estimates.
Thanks a lot,
-Lauren

How to enable reading gz output from angsd --glf 3

I generated my genotype likelihood file using the following flags in ANGSD,

# generating the genotype likelihood files
"angsd -b "$BAMS" \
-ref "$REFERENCE_FASTA" \
-doMajorMinor 4 -doMaf 1 -doCounts 1 \
- "$chr" -P 10 \
-skipTriallelic 1 \
-minMapQ 25 -minQ 25 -remove_bads 1 \
-GL 1 -doGlf 3 \
-setMinDepth 1 -SNP_pval 1e-6 \
-out "$out"

but I get the following error when attempting to run my input through ngsF. Reading the front page of the github repository indicates that uncompressed is the default but compressed files are still accepted. Is there a flag I am missing from my ngsF script? The manual seems to indicate it can take both compressed and uncompressed GLF files.

# error
'[main] ERROR: Standard library only supports UNCOMPRESSED GLF files!

# ngsF
~/bigdata/ngsF/ngsF -glf out.glf.gz --n_ind "$number_indvs" --n_sites "$sites\
-init_values r --min_epsilon 1e-9 --n_threads 15 --out "$out"

Allow for negative F

Allow for negative F (-1 < F < 1) to accommodate other deviations from HWE

When should I set --Freq_fixed=TRUE in calculating inbreeding Coefficient?

Hi, Filipe,

I am using ngsF to estimate the inbreeding coefficient for a population of giraffes. I am not sure about the parameter "--freq_fixed". The readMe stated "assume initial MAF as fixed parameters (only estimates F)". However, the examples shown in the ngsTools tutorials didn't mention this parameter in either initial values estimate (with --approx_EM) or the final run.

I tried running ngsF with and without setting "--freq_fixed=TRUE". The results are quite different. I would like to learn more about this parameter.

Number of sites in .mafs.gz different from angsd

Hi, is the number of sites in .mafs.gz file supposed to be different from those in angsd log file? For one of my populations I get a different value when counted with wc -l of .mafs.gz. For all other populations the number of sites in angsd log file and .mafs.gz are same

Suggestions for dealing with batch effects

Dear Dr. Viera,

I'm a new ngsF user, and I just ran the program using my samples. I found a very stark pattern in the resulting inbreeding coeffiecients:

These data are ddRAD-seq from a plant species subject to considerable amounts of selfing, so I expect fairly high inbreeding coefficients. Notably, the first six samples all have very low F, and these six samples were all sequenced in a separate run from the remaining samples and also have lower coverage (~ 5-10x) than the rest of the samples (~15-20x).

I am guessing that this batch effect is somehow responsible for the lower F scores in the first six samples. Does this seem plausible? If so, I figure that I should separate these first six samples and analyze them separately. Does that seem like a reasonable approach?
Thanks!
Dave

Alow for arbitrary ploidy

Arbitrary ploidy to account for X/Y and poliploid.
See: http://www.genetics.org/content/186/4/1367.full

Merge glf.gz from different chromosome

Hi @fgvieira

I have run doGlf 3 for each chromosome using angsd. In order to estimate inbreeding coefficient, I think I should merge the binary glf.gz together from each chromosome, but I do not know which software can do this.

Best
Zhuqing

cannot read glf

Hi Felipe - I am sure it is something silly, but I am out of ideas... For some reason I cannot get this to work:

NIND=cat bams.nr | wc -l
NS=zcat g3.mafs.gz | wc -l
zcat g3.glf.gz | ngsF --glf - --n_ind $NIND --n_sites $NS --out inbr

I'm getting "ERROR: cannot read GLF file!"

I made sure to make g3.glf.gz with -doGlf 3. What am I doing wrong?
I have the most recent angsd, version 0.933-25-g5955d69 (htslib: 1.9-44-g80f3557)

cheers
Misha

Difference estimation ANGSD / ngsF

Hi!

I tried to estimate the inbreeding coefficient for different populations using both ANGSD -doHWE option and ngsF. I use the exact same command to generate the hwe.gz and the glf.gz files (including -SNP_pval to keep only sites that are variable in my pop). I average over the different sites for the ANGSD results, and over the different individuals for the ngsF results.
And I get different results... Here an example for a few populations:

ANGSD doHWE	ngsF
0.020	0.047
0.029	0.050
0.028	0.053
0.018	0.054
0.026	0.079
0.028	0.067
0.002	0.044

The difference is maybe not so large (the ngsF values are twice as large, but still very close to 0), but how can we explain it? Is there one method that would be more reliable than the other in this case?

Thanks for your comments!

example of angsd --> ngsF ?

Dear Felipe - does ngsF take the same input GLF file as ngsLD?
For some reason I cannot seem to find the right combination of ANGSD's -doGeno and/or -doGlf to produce a file that ngsF would accept...
many thanks in advance
Misha

using custom glf instead of angsd output

Hi Filipe

When extracting GL tags from a vcf (called with freebayes) and converting to glf I am encountering some problems. NgsF does not want to process, I am wondering why. I know, it's not described in the readme and it's sort of file hacking but I don't see why it would be wrong. Maybe you can help.

Here my workflow

bcftools query -f '[%GL,]\n' my.vcf > my.gl
sed -i 's/\.,/0,0,0,/g' my.gl        # replace missing GL with 0,0,0,
sed -i 's/,/\t/g' my.gl                 # make tab sep
cat my.gl | perl -an -e 'for($i=1; $i<=$#F; $i++){print(pack("d",$F[$i]))}' > my.glf
N_IND=$(awk '{print NF/3; exit}' my.gl)
N_SITES=$(cat my.gl | wc -l)
cat my.glf | $NGSF --n_ind $N_IND --n_sites $N_SITES  --glf - --min_epsilon 0.001 \
       --out my.approx_indF --n_threads 1 --verbose 2 --approx_EM --seed 12345 --init_values r

output

 ==> Input Arguments:
    glf file: -
    init_values: r
    calc_LRT: false
    freq_fixed: false
    out file: fb_run5_all_nn_bi_mj_sn_mm.gl.0s.t.glf.approx_indF
    n_ind: 339
    n_sites: 489999
    chunk_size: 100000
    approx_EM: true
    call_geno: false
    max_iters: 1500
    min_iters: 10
    min_epsilon: 0.0010000000
    n_threads: 1
    seed: 12345
    quick: false
    verbose: 2
    version: 1.2.0-STD (Feb 23 2021 @ 12:18:32)

 ==> Analysis will be run in 5 chunk(s)
 ==> Using native I/O library

 [main] ERROR: cannot read GLF file!

 [main] ERROR: cannot read GLF file!
    : No such file or directory

Accelerated EM

Implement accelerated EM method
see: http://www.jstor.org/stable/2290716