freeseek / score Goto Github PK

View Code? Open in Web Editor NEW

98.0 98.0 7.0 1.08 MB

Tools to work with GWAS-VCF summary statistics files

License: MIT License

R 3.64% C 96.32% Makefile 0.04%

score's People

Contributors

Stargazers

Watchers

Forkers

crerecombinase sounkou-bioinfo bigwanglab pwu4 sreeashwinas gaochengprc

score's Issues

Add Sample Overlap Correction for Metal

Could the sample overlap correction be added to the Metal plugin?

Several European research groups release their meta association results which include our collected samples. We would like to include their European samples in our meta analyses but those separate summary association results are not available to us. So the overlap correction would be useful in this situation.

From the Metal documentation: Sample Overlap Correction
Correction for sample overlap in sample size weighted meta-analysis (developed by Sebanti Sengupta and implemented by Daniel Taliun).

First, METAL estimates the number of individuals that are common among two or more studies based on Z-statistics from each study. Then, METAL adjusts for sample overlap when calculating overall Z-statistics by correcting the weights with the estimated number of individuals in common.

To enable correction for sample overlap in your sample size weighted meta-analysis, use OVERLAP ON command (valid only with SCHEME SAMPLESIZE). By default, METAL uses Z-statistics <1 for esimating the number of individuals that are common among studies. To change this threshold, use ZCUTOFF [number] command.

Liftover should probably sort output VCF

Thanks for the tool, it's extremely fast and seems to work well so far (currently evaluating it)

The output VCF appears to be written variant-by-variant from the source file

Sometimes, the relative order of variants can change. For instance:

GRCh38 input (correct order):

1	13115599	11730	G	A	.	.	.
1	13259273	12448	G	A	.	.	.

GRCh37 output:

1	13183071	11730	G	A	.	.	.
1	13112676	12448	C	T	.	.	FLIP

This produces a file that has

Warning: The file is not sorted, for example 1:13112676 comes after 1:13183071

Workaround

Don't use "-o" on liftover but instead pipe into bcftools sort then output to file

Liftover drop-tags finds no matching tags

Run command:
bcftools +liftover -Oz -o testvcf_output.vcf.gz testvcf_input.vcf.gz -- -s source_fasta.fa -f reference.fa -c chainfile.chain.gz --reject testvcf.rejected.vcf.gz

Error #1:

INFO/AF is handled by AF rule
Number of elements in the VCF record AF should be 1 but 2 found
Use --drop-tags INFO/AF to skip this error

Which arrises in the case of needing to swap alleles. All files do have two values for the INFO/AF field , i.e. AF=0.5,0.3
Attempted fix:
bcftools +liftover -Oz -o testvcf_output.vcf.gz testvcf_input.vcf.gz -- -s source_fasta.fa -f reference.fa -c chainfile.chain.gz --reject testvcf.rejected.vcf.gz --drop-tags [INFO/AF]

Error#2:
No matching tag [INFO/AF]

As a check, generally querying that tag in the VCF does work :
bcftools query -f '%INFO/AF' testvcf_input.vcf.gz

Maybe there is some trick to identifying the drop-tags? I have tried various inputs/string combinations for the drop-tags option without any luck.

integration as default bcftools plugins

Is there any plan to get these plugins integrated into the defaults for bcftools so we do not have to separately compile with them?

liftover: should GT stay the same when strand is flipped?

Excellent collection of plugins, thank you for your work.
About the liftover plugin which I started to use heavily in the last few days, unless my understanding of the meaning of GT field notation is flawed, I don't think that the GT fields should be "complemented" when the strand is flipped after conversion.

Here is an example of some genotype data with rs1287637 being converted from GRCh37/hg19 (original SNP array data) to hg38. The reference base for this SNV happens to change strands (so it is accordingly marked with SWAP=1 in the INFO field, which is great). The original record looks like this (only first 9 sample columns shown):

$ bcftools view -H gt_5M_gencalls_n208.bcf 1:5935162 | cut -f1-5,8,10-18
1	5935162	kgp1477084	A	T	.	1/1	0/1	1/1	0/1	1/1	1/1	1/1	0/1	1/1

After the bcftools +liftover to hg38:

$ bcftools view -H hg38_gt_5M.bcf chr1:5875102 | cut -f1-5,8,10-18
chr1	5875102	kgp1477084	T	A	SWAP=1	0/0	1/0	0/0	1/0	0/0	0/0	0/0	1/0	0/0

My understanding is that the notation 0/0 refers to the reference base, whatever that is (A in the old genome build, T in the new one).
If that reference base changes between genome builds, its meaning should stay the same, correct? i.e. the 0/0 genotype should still be left as 0/0, instead of becoming homozygous for the alt allele, which can alter AF calculations etc.

Feature request: Add INFO about cause of rejection in reject VCF

Liftover can fail for a variety of reasons...

Many of the liftover functions return -1 on any error, and the check is < 0

If you defined a range of negative constants, you could return specifically what went wrong, then look it up then add that to the INFO of rejected variants

colheaders.tsv format

For the munge -C colheaders.tsv option, what is the format of the colheaders.tsv file? The -c options does not include ourput from GENESIS (https://github.com/UW-GAC/GENESIS).

clang 16 compatibility

FYI, score does not build under clang 16, which is stricter about function pointer assignments by default:

clang 14:

plugins/munge.c:123:74: warning: incompatible function pointer types initializing 'const int (*)(tsv_t *, bcf1_t *, void )' (aka 'const int ()(struct _tsv_t *, struct bcf1_t *, void *)') with an expression of type 'int (tsv_t *, bcf1_t *, void *)' (aka 'int (struct _tsv_t *, struct bcf1_t *, void *)') [-Wincompatible-function-pointer-types]
static const int (*tsv_setters[])(tsv_t *tsv, bcf1_t *rec, void *usr) = {tsv_setter_id_flexible, // SNP

clang 16:
^~~~~~~~~~~~~~~~~~~~~~
plugins/munge.c:123:74: error: incompatible function pointer types initializing 'const int (*)(tsv_t *, bcf1_t *, void )' (aka 'const int ()(struct _tsv_t *, struct bcf1_t *, void *)') with an expression of type 'int (tsv_t *, bcf1_t *, void *)' (aka 'int (struct _tsv_t *, struct bcf1_t *, void *)') [-Wincompatible-function-pointer-types]
static const int (*tsv_setters[])(tsv_t *tsv, bcf1_t *rec, void *usr) = {tsv_setter_id_flexible, // SNP
^~~~~~~~~~~~~~~~~~~~~~

Systematic IFFY tag when indels are not trimmed

Hi @freeseek ,
Thanks a lot for the tools provided in this repo.
I have a question with regard to the munging plugin, it appears that indel variants are systematically flagged when the alleles are not trimmed (examples attached)

is this an expected behavior of the tool ?
Thanks

liftover erroneously swap ref/alt for insertions

Example:
Variants in GRCh37

7 101200669 . A AC . PASS .
7 101200669 . A AG . PASS .

After liftover to GRCh38

chr7 101557388 . AG A,AC . PASS SWAP=-1
chr7 101557388 . AG A . PASS SWAP=1

It should be:

chr7 101557388 . A AC . PASS .
chr7 101557388 . A AG . PASS .

Command I run:

bcftools +liftover <grch37.vcf.gz> -- -s <GRCh37.ref.fa.gz> -f <GRCh38.ref.fa.gz> --reject rejected.vcf

Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the reference allele

Hello,

I am trying to run liftover on a moderate size (500K PacBio-derived SVs) to lift data from the macaque genome to GRCh37. I am running the command below, but it is reporting:

Warning: as option --src-fasta-ref is missing it is impossible to infer which allele is the reference allele at position 1:248821847

however, I think I am providing "-s" to the command. Is there a different argument I am missing, or is the tool expecting some other kind of input? Thanks for any ideas.

	bcftools +liftover \
		--no-version \
		-Ou \
		--threads 18 \
		-o $VCF_LIFTED \
		$VCF_NORM \
		-- \
		-s <MMul10_Genome_Fasta> \
		-f <GRCH37_FASTA> \
		-c $CHAIN_FILE \
		--reject $UNMAPPED \
		--reject-type z \
		--write-src \
		--fix-tags

Also, I would not have thought 500K SVs is that big a dataset, but this has been running for days (even with 18 threads), which seems rather extreme.

Single out of range variant causes program to exit

A single record out of range (chr17 contig size is 83,257,441):

17      143074440       45      G       C       .       .       .

Expected:

Record is written to reject file

Actual:

Entire program exits with message:

Unable to fetch sequence at 17:143074439-83257441

bcftools +liftover --version

Output:

bcftools  1.20-9-g4bd57e53 using htslib 1.20-4-gc93f5a57
plugin at 1.19 using htslib 1.20-4-gc93f5a57

Workaround

preprocess VCF to ensure all variants are in range

Assertion error for some variants with irregular GT value

Hello,

I am having issues running bcftools +liftover for some of our samples, which are exome VCF files from single individuals. I am running bcftools 1.16 / Using htslib 1.16.

I get the following error for 2/3 of our VCFs:

Assertion failed: (len == n_als * (n_als + 1) / 2), function update_AGR_record, file liftover.c, line 1471.

I was able to identify some of the variants that were causing this error, and it seemed to be cases with irregular genotype GT=1 where one would actually expect a diploid call (not on chromosome X, Y or M). After some more testing, the problem seemed to be with the value of the PL key and only for some cases.

In the VCF header, PL is defined as "Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification", and should have 3 values for biallelic sites corresponding to AA,AB,BB. The problematic variants only have 2 values, but only some of the described cases are problematic and others are not, depending on the genomic position.

For example (only showing VCF columns CHROM, POS, INFO, FORMAT, SAMPLE):

## this variant is causing the error
chr14   106055787  SNVHPOL=4;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL    1:99:19:17:0:0,17:0,6:0,11:-28.3:PASS:297,0

## this variant is not
chr14  76959815  SNVHPOL=3;MQ=60  GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL	   1:99:16:16:0:0,16:0,10:0,6:-24.5:PASS:255,0

Thanks in advance for any help!

Too many open files in temp, could not read temp

I am running bcftools +liftover to convert a vcf file generated with hg19 to hg38
Got these errors/warnings:
Merging 1466 temporary files
[E::hts_open_format] Failed to open file "TempDk63j4/01022.bcf" : Too many open files
Could not read TempDk63j4/01022.bcf: Too many open files
[W::vcf_parse_format_fill5] Extreme FORMAT/PL value encountered and set to missing at chr1:1654143
[E::faidx_adjust_position] The sequence "X" was not found
Unable to fetch sequence at X:200820--1

I've used hg19, hg38 and chain file from UCSC
I can not paste any snippet of vcf as it might raise a copyright issue

insertion became deletion after liftover

here is an example, an insertion was liftovered to a deletion. hg37 chr19:54754990_C/CGG, was liftovered to be: hg38 chr19:54251125:CGG/C

Error liftover.so: invalid option -- 's': using liftover for assigning coordinates and annotation to vcf

Hi @freeseek ,

I am trying to use the liftover plugin on a vcf file I generated from WGS of a non-model organism. The dataset was generated with gatk, but the pipeline did not give the SNP ID to my vcf dataset. I though that I could use the liftover plugin to assign SNP id to the vcf file using the old and new reference genome of my species of interest, plus the chain file.

My understanding of the liftover is that it converts genome coordinates and annotation files from the one to another assembly.

I attempted to use the liftover plugin of the binaries I downloaded:

bcftools +$BCFTOOLS_PLUGINS/liftover.so -Oz -o out.vcf.gz input.vcf -s $reference_genome -f $old_reference_genome -c $chain_file

But I do get the following error:

$BCFTOOLS_PLUGINS/liftover.so: invalid option -- 's'
$BCFTOOLS_PLUGINS/liftover.so: invalid option -- 'f'
$BCFTOOLS_PLUGINS/liftover.so: invalid option -- 'c'

First of all, sorry if I am actually misunderstanding what liftover is for. Second, in case it makes sense, what do you think I am doing wrong?

Thanks,

Best
Gabriele