illumina / akt Goto Github PK

Ancestry and Kinship Tools

License: GNU General Public License v3.0

CMake 0.24% C++ 77.52% C 17.67% Makefile 0.59% Shell 0.52% M4 0.36% R 0.05% Python 2.83% Roff 0.22% Objective-C 0.01%

akt's Introduction

akt - ancestry and kinship toolkit

Ancestry and Kinship Tools (AKT) provides a handful of useful statistical genetics routines using the htslib API for input/output. This means it can seamlessly read BCF/VCF files and play nicely with bcftools.

Please cite the AKT paper if you find this software useful.

License and dependencies

AKT is freely available under the GPL3 license.

AKT relies on HTSlib and Eigen. Eigen is a header-only library for matrix algebra released under the MPL2 license. HTSlib is a library for efficently parsing vcf/bcf files released under the MIT/Expat License. Both libraries are included with AKT.

Installation instructions

git clone https://github.com/Illumina/akt.git
cd akt/
make

If you get a warning about 'omp.h not found' (especially on OSX) you can try

make no_omp

Everything will be run on a single thread, so the -n option does nothing in akt kin and akt ibd.

Quick start

akt uses the syntax

./akt COMMAND <options>

To see a list of available functionality use:

Program:        akt (Ancestry and Kinship Tools)
Version:        v0.3.9-33-g2f4bfea
Copyright (c) 2017, Illumina, Inc. All rights reserved. See LICENSE for further details.

Usage:  akt <command> [options]

        pca                      principal component analysis
        kin                      calculate kinship coefficients
        relatives                discover pedigrees
        unrelated                generate a list of unrelated individuals
        pedphase                 Mendelian transmission phasing for duos/trios

Some useful routine analyses (assumes you are in the akt directory):

#Do a PCA:
./akt pca -R data/wgs.grch37.vcf.gz input.bcf > pca.txt
Rscript scripts/pca.R pca.txt

#Project samples onto 1000G principal components
./akt pca -W data/wgs.grch37.vcf.gz input.bcf > 1000G.pca.txt
Rscript scripts/1000G_pca.R 1000G.pca.txt

#Calculate some (robust) kinship coefficients
./akt kin -R data/wgs.grch37.vcf.gz -M 1 input.bcf > kinship.txt

#give me duplicated sample pairs:
awk '{if($6>.4) print $0}' kinship.txt

#reconstruct pedigrees:
akt relatives kinship.txt -p pedigree

#give me a list of unrelated samples
akt unrelated kinship.txt > unrelated.ids

Full documentation is available at http://illumina.github.io/akt.

akt's People

Contributors

Stargazers

Watchers

Forkers

ramabit vivekgopalan haoziyeung jaredo jxshi skyclub3 bpow wook2014 yyxql snijesh matty234 dridk xiaosheng8361

akt's Issues

Segmentation fault error

I am running PCA for 1kGP merged SV vcf files and I am getting the seg fault error when I run the PBS script with 150G memory. The run command is in the last line. It worked fine when I extracted a single chromosome (chr21) from the VCF and run the same script.

Input: bcftool_merge.sv_chr1_22_maf_ge0.05.annotated.vcf.gz
MAF lower bound: 0
Thin: 1
Number principle components: 20
Reading data...
1793 samples
Kept 2028657 markers out of 2028657
/var/spool/torque/mom_priv/jobs/3377732.sug-moab.SC: line 12: 11059 Segmentation fault      /hgsc_software/akt/akt-0.3.3/bin/akt pca ${vcf}.vcf.gz --force -Oz -o ${vcf}.pca.cf.gz > ${vcf}.txt

documentation on changes to wgs.grch37.vcf.gz site vcf

I'm looking at the history of the "reliable" sites (on many arrays, in linkage equilibrium, reasonable MAF, etc): https://github.com/Illumina/akt/commits/master/data/wgs.grch37.vcf.gz

At 87906b0 this site vcf had 17535 SNPs. At 81be1c5 it has 17491. But I couldn't find any information about what/why this changed. Can you offer a brief update on the remapping, what got dropped? Thank you!

update docs

-R data/1000G.snps.nochr.vcf.gz

this file doesn't exist any more and apparently -F is now to be used instead of -R; so the documentation should be updated.

explanation of basic resource requirements?

Hello Jaredo:

Is there a discuss available of resource requirements (memory and threads)?

I see the threads are referred to as compression associated (presumably IO).

But how might one think about the chromosome and the need to chunk it?

I wonder if the constraints are discuss somewhere?

Cheers,
Chuck

Bring back prune?

In earlier versions of akt, the "prune" subcommand provided a nice way to prune/thin variants on the basis of linkage disequilibrium. However, the command has been deprecated and is no longer available in akt.

My understanding is that you favour using a pre-pruned set of human variants that you provide in a VCF. However, this has several limitations: most profoundly, it's of course completely inapplicable to non-human organisms - people do work in plants, yeast, dogs, cattle, flies, basically anything with DNA! But even for human data on the right reference build, the set of provided sites will come with some complex ascertainment bias, might not be optimal when working in more diverse populations with ancestry poorly represented in the 1000 Genomes dataset, and might not overlap well with all present or future genotyping array sites. The "prune" command also provided the additional flexibility of choosing the LD cutoff. There might also be additional use cases for LD pruning other than just obtaining a suitable set of SNPs for genome-wide ancestry analyses.

I would therefore like to request bringing back "prune", which I think would make an already great software even greater. Thanks for your work on akt!

recent github release tarball?

Thanks for akt. I'm currently working on building an akt package for Bioconda based on the latest available release tarball from github (bioconda/bioconda-recipes#8661), which is v0.2.0.

I'd like to also eventually build an updated version, since I see there have been quite a few changes since the 2016 release of v0.2.0.

Would you be able to create a github release of akt for a more recent version so we can build a bioconda recipe for it? Thanks.

format for `--freq-file FILE`?

What's the format for the --freq-file FILE, a file containing population allele frequencies to use in kinship calculation

thanks!

Problem opening input file.

Hi there,

I'm having an issue with akt using the pca option.

It gives me the following error when it looks to open my input bcf file.

command string : $ /g/data/a32/Software/akt/akt pca /short/a32/exi569/GB_CHD_NDD/CHD_NDD_all_samples_SNPS_PCA_ID.bcf -R /g/data/a32/Software/akt/data/wes.hg19.vcf.gz -O b -o /short/a32/exi569/GB_CHD_NDD/pca.bcf > /short/a32/exi569/GB_CHD_NDD/pca.txt

Console reply :
Input: /short/a32/exi569/GB_CHD_NDD/CHD_NDD_all_samples_SNPS_PCA_ID.bcf
MAF lower bound: 0
Thin: 1
Number principle components: 20
Reading data...
Problem opening /short/a32/exi569/GB_CHD_NDD/CHD_NDD_all_samples_SNPS_PCA_ID.bcf
Input: /short/a32/exi569/GB_CHD_NDD/CHD_NDD_all_samples_SNPS_PCA_ID.bcf

I have no problem opening and viewing the "CHD_NDD_all_samples_SNPS_PCA_ID.bcf" input file using bcftools view.

[exi569@raijin3 GB_CHD_NDD]$ bcftools view CHD_NDD_all_samples_SNPS_PCA_ID.bcf | more ##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##source=VarScan2
##INFO=<ID=ADP,Number=1,Type=Integer,Description="Average per-sample depth of bases with Phred score >= 20">
##INFO=<ID=WT,Number=1,Type=Integer,Description="Number of samples called reference (wild-type)">
##INFO=<ID=HET,Number=1,Type=Integer,Description="Number of samples called heterozygous-variant">
##INFO=<ID=HOM,Number=1,Type=Integer,Description="Number of samples called homozygous-variant">
##INFO=<ID=NC,Number=1,Type=Integer,Description="Number of samples not called">
##FILTER=<ID=str10,Description="Less than 10% or more than 90% of variant supporting reads on one strand">
##FILTER=<ID=indelError,Description="Likely artifact due to indel reads at this position">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=SDP,Number=1,Type=Integer,Description="Raw Read Depth as reported by SAMtools">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Quality Read Depth of bases with Phred score >= 20">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting bases (reads2)">
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
##FORMAT=<ID=PVAL,Number=1,Type=String,Description="P-value from Fisher's Exact Test">
##FORMAT=<ID=RBQ,Number=1,Type=Integer,Description="Average quality of reference-supporting bases (qual1)">
##FORMAT=<ID=ABQ,Number=1,Type=Integer,Description="Average quality of variant-supporting bases (qual2)">
##FORMAT=<ID=RDF,Number=1,Type=Integer,Description="Depth of reference-supporting bases on forward strand (reads1plus)">
##FORMAT=<ID=RDR,Number=1,Type=Integer,Description="Depth of reference-supporting bases on reverse strand (reads1minus)">
##FORMAT=<ID=ADF,Number=1,Type=Integer,Description="Depth of variant-supporting bases on forward strand (reads2plus)">
##FORMAT=<ID=ADR,Number=1,Type=Integer,Description="Depth of variant-supporting bases on reverse strand (reads2minus)">
##contig=<ID=chr1>
##contig=<ID=chr2>
##contig=<ID=chr3>
##contig=<ID=chr4>
##contig=<ID=chr5>
##contig=<ID=chr6>
##contig=<ID=chr7>
##contig=<ID=chr8>
##contig=<ID=chr9>
##contig=<ID=chr10>
##contig=<ID=chr11>
##contig=<ID=chr12>
##contig=<ID=chr13>
##contig=<ID=chr14>
##contig=<ID=chr15>
##contig=<ID=chr16>
##contig=<ID=chr17>
##contig=<ID=chr18>
##contig=<ID=chr19>
##contig=<ID=chr20>
##contig=<ID=chr21>
##contig=<ID=chr22>
##contig=<ID=chrX>
##bcftools_mergeVersion=1.3.1+htslib-1.3.1
##bcftools_mergeCommand=merge -O z -o CHD_NDD_samples_SNPS.vcf.gz 02766-recalBAMS_SNPS.vcf.gz 00618-recalBAMS_SNPS.vcf.gz 00714-recalBAMS_SNPS.vcf.
gz 01152-recalBAMS_SNPS.vcf.gz 01299-recalBAMS_SNPS.vcf.gz 00773-recalBAMS_SNPS.vcf.gz 00928-recalBAMS_SNPS.vcf.gz 01023-recalBAMS_SNPS.vcf.gz 0185
0-recalBAMS_SNPS.vcf.gz 00626-recalBAMS_SNPS.vcf.gz 00734-recalBAMS_SNPS.vcf.gz 02503-recalBAMS_SNPS.vcf.gz 00720-recalBAMS_SNPS.vcf.gz 01330-recal
BAMS_SNPS.vcf.gz 00748-recalBAMS_SNPS.vcf.gz
##bcftools_mergeCommand=merge -O z -o CHD_NDD_all_samples_SNPS.vcf.gz CHD_NDD_samples_SNPS.vcf.gz CHD_samples_SNPS.vcf.gz CONTROL_samples_SNPS.vcf.
gz
##bcftools_viewVersion=1.3+htslib-1.3
##bcftools_viewCommand=view -o CHD_NDD_all_samples_SNPS.vcf CHD_NDD_all_samples_SNPS.vcf.gz
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_viewCommand=view -o CHD_NDD_all_samples_SNPS_chr_sort_PCA.vcf -S samples_for_CHD_PCA.csv CHD_NDD_all_samples_SNPS_chr_sort.vcf
##bcftools_viewCommand=view CHD_NDD_all_samples_SNPS_PCA_ID.bcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 02766 00618 00714 01152 01299 00773 00928 01023 01850 006
26 00734 02503 00720 01330 00748 00646 01506 01537 02261 01365 01388 01832 03062 01925 01162 00918 02237 018
33 00786 00768 01581 01584 01589 01697 01941 01953 01955 01957 01962 01963 01967 01970 01972 01974 01979
chr1 107683715 . C T . PASS ADP=542;WT=0;HET=1;HOM=0;NC=0;AC=2;AN=4 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RD
R:ADF:ADR ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.
:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.
:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.
:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.
:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.
:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. 0/1:255:551:542:269:273:50.37%:3.6098E-103:36:39:52
:217:73:200 ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.
:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.
:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. 0/1:255:719:707:367:340:48.09%:8.9707E-127:36:40:92:275:76:264 ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.
:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.
:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:.
chr1 107690692 . G A . PASS ADP=285;WT=0;HET=1;HOM=0;NC=0;AC=1;AN=2 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RD
R:ADF:ADR ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.
:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.
:.:.:.:.:. 0/1:255:293:285:139:146:51.23%:9.9479E-56:40:34:128:11:135:11 ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.
:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.
:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.
:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.
:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.
:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.
:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.
:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.
:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:.

I have also attached the input file and hope you can help me in resolving this file opening error.

Thanks.
Eddie
CHD_NDD_all_samples_SNPS_PCA_ID.zip

Bad genotypes at ...?

Hi,

I have run 'akt pca' with the following command line:
~/sources/akt/akt pca --force chr1_all_sample.bcf > pca.txt

But I get the following error:

Input: chr1_all_sample.bcf
MAF lower bound: 0
Thin: 1 
Number principle components: 20
Reading data...
12 samples
Bad genotypes at chr1:10004

chr1_all_sample.bcf has been indexed. The genotypes are the result of a 'bcftool call' analysis, restricted here to chr1.

Can you suggest a cause for the error?

Thank you,
Ian

ibd options

Hello
This document describes calling the ibd option.
The following error occurs, when I call ibd option:
alexander@statistics-dev:/mnt/fs/akt$ ./akt ibd input.bcf -R panel.bcf -r 2 > sharing
ERROR: ibd is deprecated
Exiting...
So, how can I run searching for IBD segments?

Problem opening input.bcf -- No debug log

I converted a vcf to bcf and tried running your tool with the following command:
./akt pca -W data/wgs.grch37.vcf.gz input.bcf
The only logs I get are:

Input: input.bcf
Using file data/wgs.grch37.vcf.gz for PCA weights
Problem opening input.bcf

Is see no debug option to make this message more verbose and figure out what the issue is, does a flag for verbose output exist ?

I don't believe the problem is permission related, here's the stats output for input.bcf:

  File: input.bcf
  Size: 557747163       Blocks: 1089360    IO Block: 4096   regular file
Device: fd01h/64769d    Inode: 10640983    Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/  george)   Gid: ( 1000/  george)
Access: 2020-02-10 15:23:13.371180294 +0200
Modify: 2020-02-10 15:22:44.071013086 +0200
Change: 2020-02-10 15:22:44.071013086 +0200
 Birth:

Allow both -F and -r/-R to be used in `kin`

I'm not sure of the rationale for preventing the use of both the freq-file option and limitation to a region or set of regions. It seems like one might want to use analysis based on a file of population-reference frequencies, but also limit to certain regions (perhaps even just for performance reasons).

Furthermore, the fact that the current code defaults to using the frq_file as also being the regions file contributes to #29 (at least with some of my files). the bcf_sr code should be reasonably efficient even without specifying regions.

Error on suse11.1

Hi. Any ideas here ?

Thanks.

make
echo '#define VERSION "c966c25"' > version.h
echo '#define BCFTOOLS_VERSION "1.3"' >> version.h
g++ -O3 -fopenmp -mpopcnt -c tdt.cpp -Ihtslib-1.3 -I./
g++ -O3 -fopenmp -mpopcnt -c admix.cpp -Ihtslib-1.3 -I./
In file included from htslib-1.3/htslib/synced_bcf_reader.h:56,
from akt.hpp:44,
from admix.cpp:3:
htslib-1.3/htslib/vcf.h: In function ‘void bcf_format_gt(bcf_fmt_t*, int, kstring_t*)’:
htslib-1.3/htslib/vcf.h:817: error: ‘INT8_MIN’ was not declared in this scope
htslib-1.3/htslib/vcf.h:818: error: ‘INT16_MIN’ was not declared in this scope
htslib-1.3/htslib/vcf.h:819: error: ‘INT32_MIN’ was not declared in this scope
htslib-1.3/htslib/vcf.h: In function ‘int bcf_enc_inttype(long int)’:
htslib-1.3/htslib/vcf.h:848: error: ‘INT8_MAX’ was not declared in this scope
htslib-1.3/htslib/vcf.h:848: error: ‘INT8_MIN’ was not declared in this scope
htslib-1.3/htslib/vcf.h:849: error: ‘INT16_MAX’ was not declared in this scope
htslib-1.3/htslib/vcf.h:849: error: ‘INT16_MIN’ was not declared in this scope
htslib-1.3/htslib/vcf.h: In function ‘void bcf_enc_int1(kstring_t*, int32_t)’:
htslib-1.3/htslib/vcf.h:855: error: ‘INT32_MIN’ was not declared in this scope
htslib-1.3/htslib/vcf.h:857: error: ‘INT8_MIN’ was not declared in this scope
htslib-1.3/htslib/vcf.h:860: error: ‘INT8_MIN’ was not declared in this scope
htslib-1.3/htslib/vcf.h:861: error: ‘INT8_MAX’ was not declared in this scope
htslib-1.3/htslib/vcf.h:861: error: ‘INT8_MIN’ was not declared in this scope
htslib-1.3/htslib/vcf.h:864: error: ‘INT16_MAX’ was not declared in this scope
htslib-1.3/htslib/vcf.h:864: error: ‘INT16_MIN’ was not declared in this scope
make: *** [admix.o] Error 1

AKT PCA....Bad genotypes

Hi,

When I used akt pca, I got such message

MAF lower bound: 0
Thin: 1
Number principle components: 20
Reading data...
984 samples
Bad genotypes at 1:758351

Ca you explain to me what does "Bad genotype" mean in the message? How should I handle with this issue?

Thank you!

Kinship coefficient calculation inconsistent with missing data

Hi @jaredo -- I realize you haven't committed to this repo in a few years, but just hoping you could point me in the right direction for fixing a problem I think I've discovered with the kinship calculation here.

I have a VCF with two individuals, id1 and id2, (simulated parent-child). Each sample has a call rate of about 80%, so there are many spots where one sample is genotyped but the other isn't. When I run akt kin using the KING estimator with missing data present the kinship coefficient is underestimated compared to using akt kin when the missing SNPs are removed. In both examples akt correctly reports the actual number of genotypes compared (6238 out of the total 17,530).

The two-sample VCF with ~20% missing SNPs in each sample: sim20mis_all.vcf.gz

When I run akt kin using the KING estimator, I underestimate the kinship coefficient at .19:

akt kin -M1 --force sim20mis_all.vcf.gz

id1        id2      0.00000 1.00000 0.00000 0.19066 6238

I can remove all the missing genotypes from this VCF (bcftools view -g ^miss): sim20mis_nomiss.vcf.gz

And re-run:

akt kin -M1 --force sim20mis_nomiss.vcf.gz

Which gives me the correct kinship coefficient, .24, for parent-offspring:

id1        id2      0.00000 1.00000 0.00000 0.24072 6238

I'm trying to go through the code in kin.cpp and elsewhere but I'm unsure where to start. Thanks for any help you might be able to provide!

new kinship matrices

We should add options to akt kin to calculate the:

genetic relationship matrix
a robust kinship estimator such as king or REAP

akt pca error : Failed to read the regions: pca.bcf

Hi there,

Having an issue using the akt pca when projecting another set of samples into a previously created set of principal components.

I'm trying to follow the procedure given in the akt/doc/usage.md, where an initial sample is used to create a set of precalculated PC values, and then a new set of samples are projected on to it.

So my steps:

$ /g/data/a32/Software/akt/akt pca 1000G_match_CHD_NDD_chr.bcf -R /g/data/a32/Software/akt/data/wes.hg19.vcf.gz -O b -o pca.bcf > pca.txt
Input: 1000G_match_CHD_NDD_chr.bcf
MAF lower bound: 0
Thin: 1
Number principle components: 20
Reading data...
2504 samples
166/32136 of study markers were in the sites file
Printing coefficients to pca.bcf

$ /g/data/a32/Software/akt/akt pca CHD_NDD_all_samples_SNPS_PCA_ID.bcf -W pca.bcf > projections
Input: /short/a32/exi569/GB_CHD_NDD/CHD_NDD_all_samples_SNPS_PCA_ID.bcf
Using file pca.bcf for PCA weights
Failed to read the regions: pca.bcf

Hope I'm not doing something stupid.

Thanks for your assistance in advance.
Eddie

`--freq-file` not using all available SNPs present in both study and freq-file VCF

I'm having an issue supplying allele frequencies as answered in #28. Unfortunately I can't provide example data, but I'll try to provide as much as I can without supplying sensitive data.

background

I have a study VCF (bgzip+tabix indexed). I'm using your set of "reliable" SNPs that happen to be on my platform. Some stats on my study VCF:

$ bcftools stats study.vcf.gz | grep ^SN | head -n2
SN      0       number of samples:      65
SN      0       number of records:      17508

Some stats on my frequencies VCF (also bgzip'd and tabix indexed):

$ bcftools stats frequencies.vcf.gz | grep "^SN" | head -n2
SN      0       number of samples:      0
SN      0       number of records:      17160

Just to be sure, I'm using bcftools isec to check the number of SNPs intersecting between these files.

bcftools isec study.vcf.gz frequencies.vcf.gz --prefix isec

According to the readme produced with isec, I'm primarily interested in how many SNPs are in 0002 and 0003 - the SNPs shared between the files.

$ cat isec/README.txt
This file was produced by vcfisec.
The command line was:   bcftools isec  --prefix isec study.vcf.gz frequencies.vcf.gz

Using the following file names:
isec/0000.vcf   for records private to  study.vcf.gz
isec/0001.vcf   for records private to  frequencies.vcf.gz
isec/0002.vcf   for records from study.vcf.gz shared by both    study.vcf.gz frequencies.vcf.gz
isec/0003.vcf   for records from frequencies.vcf.gz shared by both      study.vcf.gz frequencies.vcf.gz

I have 16,137 overlapping.

$ bcftools stats isec/0002.vcf | grep "^SN" | head -n2
SN      0       number of samples:      65
SN      0       number of records:      16137

$ bcftools stats isec/0003.vcf | grep "^SN" | head -n2
SN      0       number of samples:      0
SN      0       number of records:      16137

problem demonstration

However, when I run akt kin supplying this VCF with frequency data, its telling me that it's only using 9 SNPs in total.

$ akt kin -M 1 --freq-file frequencies.vcf.gz study.vcf.gz > /dev/null
Taking allele frequencies from frequencies.vcf.gz using INFO/AF
WARNING: threading is disabled
WARNING: threading is disabled
WARNING: threading is disabled
Using 0 threads
WARNING: threading is disabled
65 samples
Reading genotypes...done.
Kept 9 markers out of 9 in panel.
9/9 of study markers were in the sites file
Calculating kinship values...done.

I'm using akt 0.3.2.

$ akt

Program:        akt (Ancestry and Kinship Tools)
Version:        0.3.2

I also tried converting everything to BCF:

bcftools view study.vcf.gz -Ob -o study.bcf
bcftools index study.bcf
bcftools view frequencies.vcf.gz -Ob -o frequencies.bcf
bcftools index frequencies.bcf

And running akt kin again, but this time, I run into a different error. Not sure what this is about.

$ akt kin -M 1 --freq-file frequencies.bcf study.bcf > /dev/null
Taking allele frequencies from frequencies.bcf using INFO/AF
WARNING: threading is disabled
WARNING: threading is disabled
WARNING: threading is disabled
Using 0 threads
WARNING: threading is disabled
[E::tbx_index_load2] Invalid index header for frequencies.bcf
[E::bcf_sr_regions_init] Could not parse the file frequencies.bcf, using the columns 1,2[,-1]
ERROR: Failed to read the regions: frequencies.bcf
Exiting...

I did ensure that the frequencies VCF and the study VCF are both sorted, but that didn't appear to be the issue. Any idea what could be happening here?

akt mendel error

Hello,
I am having an error with akt mendel when running the following command:

./akt mendel -p ~/akt/exomes_menonites/out.fam ~/akt/exomes_menonites/all_merged_exomes_mennonites_hg38_GRCh38.bcf > ~/akt/exomes_menonites/mendel.txt

ERROR: mendel command is deprecated. See the bcftools +mendelian plugin for equivalent functionality.
Exiting...

Is there any solution?

pca index assertion error

Any ideas on what could be causing this problem? I get this error message when I run on all 17 samples, 1 sample, filtered/unfiltered (e.g. for PASS, AC, AN).

./akt.v0.3.2/akt pca -R ./akt.v0.3.2/data/wgs.grch38_chr.vcf.gz $bcf
Input: filtered_4033.bcf
MAF lower bound: 0
Thin: 1
Number principle components: 20
Reading data...
1 samples
6002/17491 of study markers were in the sites file
akt: Eigen/src/Core/DenseCoeffsBase.h:425: Eigen::DenseCoeffsBase<Derived, 1>::Scalar& Eigen::DenseCoeffsBase<Derived, 1>::operator()(Eigen::Index) [with Derived = Eigen::Matrix<float, -1, 1>; Eigen::DenseCoeffsBase<Derived, 1>::Scalar = float; Eigen::Index = long int]: Assertion `index >= 0 && index < size()' failed.
Aborted (core dumped)

won't compile on OSX

clang does not have omp

clang: warning: argument unused during compilation: '-fopenmp'
In file included from admix.cpp:3:
./akt.hpp:20:10: fatal error: 'omp.h' file not found
#include <omp.h>

kin with unindexed bcf/vcf files

Hi!
In truth this isn't a bug report but it is instead a feature request (it's a small lift however).
It'd be very nice if we could estimate the kinship coefficient on a stream-- eg, we're using some whole genome simulators and it's much more convenient to keep a summary statistic than a bazillion whole genome simulations that need to be created, indexed, summarized and then deleted.

Setting line 420 in kin.cpp to:

sr->require_index = 0;

Appears to do the trick.
Much thanks.
-August