freeseek / score Goto Github PK
View Code? Open in Web Editor NEWTools to work with GWAS-VCF summary statistics files
License: MIT License
Tools to work with GWAS-VCF summary statistics files
License: MIT License
Could the sample overlap correction be added to the Metal plugin?
Several European research groups release their meta association results which include our collected samples. We would like to include their European samples in our meta analyses but those separate summary association results are not available to us. So the overlap correction would be useful in this situation.
From the Metal documentation: Sample Overlap Correction
Correction for sample overlap in sample size weighted meta-analysis (developed by Sebanti Sengupta and implemented by Daniel Taliun).
First, METAL estimates the number of individuals that are common among two or more studies based on Z-statistics from each study. Then, METAL adjusts for sample overlap when calculating overall Z-statistics by correcting the weights with the estimated number of individuals in common.
To enable correction for sample overlap in your sample size weighted meta-analysis, use OVERLAP ON command (valid only with SCHEME SAMPLESIZE). By default, METAL uses Z-statistics <1 for esimating the number of individuals that are common among studies. To change this threshold, use ZCUTOFF [number] command.
Thanks for the tool, it's extremely fast and seems to work well so far (currently evaluating it)
The output VCF appears to be written variant-by-variant from the source file
Sometimes, the relative order of variants can change. For instance:
GRCh38 input (correct order):
1 13115599 11730 G A . . .
1 13259273 12448 G A . . .
GRCh37 output:
1 13183071 11730 G A . . .
1 13112676 12448 C T . . FLIP
This produces a file that has
Warning: The file is not sorted, for example 1:13112676 comes after 1:13183071
Workaround
Don't use "-o" on liftover but instead pipe into bcftools sort then output to file
Run command:
bcftools +liftover -Oz -o testvcf_output.vcf.gz testvcf_input.vcf.gz -- -s source_fasta.fa -f reference.fa -c chainfile.chain.gz --reject testvcf.rejected.vcf.gz
Error #1:
INFO/AF is handled by AF rule
Number of elements in the VCF record AF should be 1 but 2 found
Use --drop-tags INFO/AF to skip this error
Which arrises in the case of needing to swap alleles. All files do have two values for the INFO/AF field , i.e. AF=0.5,0.3
Attempted fix:
bcftools +liftover -Oz -o testvcf_output.vcf.gz testvcf_input.vcf.gz -- -s source_fasta.fa -f reference.fa -c chainfile.chain.gz --reject testvcf.rejected.vcf.gz --drop-tags [INFO/AF]
Error#2:
No matching tag [INFO/AF]
As a check, generally querying that tag in the VCF does work :
bcftools query -f '%INFO/AF' testvcf_input.vcf.gz
Maybe there is some trick to identifying the drop-tags
? I have tried various inputs/string combinations for the drop-tags
option without any luck.
Is there any plan to get these plugins integrated into the defaults for bcftools so we do not have to separately compile with them?
Excellent collection of plugins, thank you for your work.
About the liftover
plugin which I started to use heavily in the last few days, unless my understanding of the meaning of GT field notation is flawed, I don't think that the GT fields should be "complemented" when the strand is flipped after conversion.
Here is an example of some genotype data with rs1287637 being converted from GRCh37/hg19 (original SNP array data) to hg38. The reference base for this SNV happens to change strands (so it is accordingly marked with SWAP=1 in the INFO field, which is great). The original record looks like this (only first 9 sample columns shown):
$ bcftools view -H gt_5M_gencalls_n208.bcf 1:5935162 | cut -f1-5,8,10-18
1 5935162 kgp1477084 A T . 1/1 0/1 1/1 0/1 1/1 1/1 1/1 0/1 1/1
After the bcftools +liftover to hg38:
$ bcftools view -H hg38_gt_5M.bcf chr1:5875102 | cut -f1-5,8,10-18
chr1 5875102 kgp1477084 T A SWAP=1 0/0 1/0 0/0 1/0 0/0 0/0 0/0 1/0 0/0
My understanding is that the notation 0/0 refers to the reference base, whatever that is (A in the old genome build, T in the new one).
If that reference base changes between genome builds, its meaning should stay the same, correct? i.e. the 0/0 genotype should still be left as 0/0, instead of becoming homozygous for the alt allele, which can alter AF calculations etc.
Liftover can fail for a variety of reasons...
Many of the liftover functions return -1 on any error, and the check is < 0
If you defined a range of negative constants, you could return specifically what went wrong, then look it up then add that to the INFO of rejected variants
For the munge -C colheaders.tsv option, what is the format of the colheaders.tsv file? The -c options does not include ourput from GENESIS (https://github.com/UW-GAC/GENESIS).
FYI, score does not build under clang 16, which is stricter about function pointer assignments by default:
clang 14:
plugins/munge.c:123:74: warning: incompatible function pointer types initializing 'const int (*)(tsv_t *, bcf1_t *, void )' (aka 'const int ()(struct _tsv_t *, struct bcf1_t *, void *)') with an expression of type 'int (tsv_t *, bcf1_t *, void *)' (aka 'int (struct _tsv_t *, struct bcf1_t *, void *)') [-Wincompatible-function-pointer-types]
static const int (*tsv_setters[])(tsv_t *tsv, bcf1_t *rec, void *usr) = {tsv_setter_id_flexible, // SNP
clang 16:
^~~~~~~~~~~~~~~~~~~~~~
plugins/munge.c:123:74: error: incompatible function pointer types initializing 'const int (*)(tsv_t *, bcf1_t *, void )' (aka 'const int ()(struct _tsv_t *, struct bcf1_t *, void *)') with an expression of type 'int (tsv_t *, bcf1_t *, void *)' (aka 'int (struct _tsv_t *, struct bcf1_t *, void *)') [-Wincompatible-function-pointer-types]
static const int (*tsv_setters[])(tsv_t *tsv, bcf1_t *rec, void *usr) = {tsv_setter_id_flexible, // SNP
^~~~~~~~~~~~~~~~~~~~~~
Hi @freeseek ,
Thanks a lot for the tools provided in this repo.
I have a question with regard to the munging plugin, it appears that indel variants are systematically flagged when the alleles are not trimmed (examples attached)
is this an expected behavior of the tool ?
Thanks
Example:
Variants in GRCh37
7 101200669 . A AC . PASS .
7 101200669 . A AG . PASS .
After liftover to GRCh38
chr7 101557388 . AG A,AC . PASS SWAP=-1
chr7 101557388 . AG A . PASS SWAP=1
It should be:
chr7 101557388 . A AC . PASS .
chr7 101557388 . A AG . PASS .
Command I run:
bcftools +liftover <grch37.vcf.gz> -- -s <GRCh37.ref.fa.gz> -f <GRCh38.ref.fa.gz> --reject rejected.vcf
Hello,
I am trying to run liftover on a moderate size (500K PacBio-derived SVs) to lift data from the macaque genome to GRCh37. I am running the command below, but it is reporting:
Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the reference allele at position 1:248821847
however, I think I am providing "-s" to the command. Is there a different argument I am missing, or is the tool expecting some other kind of input? Thanks for any ideas.
bcftools +liftover \
--no-version \
-Ou \
--threads 18 \
-o $VCF_LIFTED \
$VCF_NORM \
-- \
-s <MMul10_Genome_Fasta> \
-f <GRCH37_FASTA> \
-c $CHAIN_FILE \
--reject $UNMAPPED \
--reject-type z \
--write-src \
--fix-tags
Also, I would not have thought 500K SVs is that big a dataset, but this has been running for days (even with 18 threads), which seems rather extreme.
A single record out of range (chr17 contig size is 83,257,441):
17 143074440 45 G C . . .
Expected:
Actual:
Entire program exits with message:
Unable to fetch sequence at 17:143074439-83257441
bcftools +liftover --version
Output:
bcftools 1.20-9-g4bd57e53 using htslib 1.20-4-gc93f5a57
plugin at 1.19 using htslib 1.20-4-gc93f5a57
Workaround
Hello,
I am having issues running bcftools +liftover
for some of our samples, which are exome VCF files from single individuals. I am running bcftools 1.16 / Using htslib 1.16
.
I get the following error for 2/3 of our VCFs:
Assertion failed: (len == n_als * (n_als + 1) / 2), function update_AGR_record, file liftover.c, line 1471.
I was able to identify some of the variants that were causing this error, and it seemed to be cases with irregular genotype GT=1
where one would actually expect a diploid call (not on chromosome X, Y or M). After some more testing, the problem seemed to be with the value of the PL
key and only for some cases.
In the VCF header, PL is defined as "Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification", and should have 3 values for biallelic sites corresponding to AA,AB,BB
. The problematic variants only have 2 values, but only some of the described cases are problematic and others are not, depending on the genomic position.
For example (only showing VCF columns CHROM, POS, INFO, FORMAT, SAMPLE):
## this variant is causing the error
chr14 106055787 SNVHPOL=4;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1:99:19:17:0:0,17:0,6:0,11:-28.3:PASS:297,0
## this variant is not
chr14 76959815 SNVHPOL=3;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1:99:16:16:0:0,16:0,10:0,6:-24.5:PASS:255,0
Thanks in advance for any help!
I am running bcftools +liftover to convert a vcf file generated with hg19 to hg38
Got these errors/warnings:
Merging 1466 temporary files
[E::hts_open_format] Failed to open file "TempDk63j4/01022.bcf" : Too many open files
Could not read TempDk63j4/01022.bcf: Too many open files
[W::vcf_parse_format_fill5] Extreme FORMAT/PL value encountered and set to missing at chr1:1654143
[E::faidx_adjust_position] The sequence "X" was not found
Unable to fetch sequence at X:200820--1
I've used hg19, hg38 and chain file from UCSC
I can not paste any snippet of vcf as it might raise a copyright issue
here is an example, an insertion was liftovered to a deletion. hg37 chr19:54754990_C/CGG, was liftovered to be: hg38 chr19:54251125:CGG/C
Hi @freeseek ,
I am trying to use the liftover
plugin on a vcf file I generated from WGS of a non-model organism. The dataset was generated with gatk, but the pipeline did not give the SNP ID to my vcf dataset. I though that I could use the liftover
plugin to assign SNP id to the vcf file using the old and new reference genome of my species of interest, plus the chain file.
My understanding of the liftover
is that it converts genome coordinates and annotation files from the one to another assembly.
I attempted to use the liftover
plugin of the binaries I downloaded:
bcftools +$BCFTOOLS_PLUGINS/liftover.so -Oz -o out.vcf.gz input.vcf -s $reference_genome -f $old_reference_genome -c $chain_file
But I do get the following error:
$BCFTOOLS_PLUGINS/liftover.so: invalid option -- 's'
$BCFTOOLS_PLUGINS/liftover.so: invalid option -- 'f'
$BCFTOOLS_PLUGINS/liftover.so: invalid option -- 'c'
First of all, sorry if I am actually misunderstanding what liftover is for. Second, in case it makes sense, what do you think I am doing wrong?
Thanks,
Best
Gabriele
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.