ga4gh / benchmarking-tools Goto Github PK

Repository for the GA4GH Benchmarking Team work developing standardized benchmarking methods for germline small variant calls

License: Apache License 2.0

Shell 0.52% Python 0.96% HTML 96.98% JavaScript 1.15% CSS 0.39%

ga4gh variant-calls benchmarking genomics genome-sequencing standardization reference-materials

benchmarking-tools's Introduction

Germline Small Variant Benchmarking Tools and Standards

This repository hosts the work of the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team, which is developing standardized performance metrics and tools for benchmarking germline small variant calls. This Team includes representatives from sequencing technology developers, government agencies, academic bioinformatics researchers, clinical laboratories, and commercial technology and bioinformatics developers. We have worked towards solutions for several challenges faced when benchmarking variant calls, including (1) defining high-confidence variant calls and regions that can be used as a benchmark, (2) developing tools to compare variant calls robust to differing representations, (3) defining performance metrics like false positive and false negative with respect to different matching stringencies, and (4) developing methods to stratify performance by variant type and genome context. We also provide links to our reference benchmarking engines and their implementations, as well as to benchmarking datasets.

A manuscript from the GA4GH Benchmarking Team describing best practices for benchmarking germline small variant calls is on bioRxiv, and we ask that you cite this publication in any work using these tools: https://doi.org/10.1101/270157

** Note: This site is still a work in progress. **

Standards and Definitions

See doc/standards/ for the current benchmarking standards and definitions.

Reference tool implementations

The primary reference implementation of the GA4GH Benchmarking methods is hap.py, which enables users to choose between vcfeval (recommended) and xcmp as the comparison engine, and use of GA4GH stratification bed files to assess performance in different genome contexts. A web-based implementation of this tool is available in GA4GH Benchmarking app from peter.krusche on precisionFDA.

Other reference implementations following the standards outlined above are available at tools/. These are submodules which link to the original tool repositories.

Benchmarking Intermediate Files

The benchmarking process contains a variety of steps and inputs. In doc/ref-impl/, we standardise intermediate formats for specifying truth sets, stratification regions, and intermediate outputs from comparison tools.

Benchmarking resources

In resources/, we provide files useful in the benchmarking process. Currently, this includes links to benchmarking calls and datasets from Genome in a Bottle and Illumina Platinum Genomes, as well as standardized bed files describing potentially difficult regions for performance stratification.

benchmarking-tools's People

Contributors

Stargazers

Watchers

Forkers

pkrusche andrewjesaitis david4096 tfmorris snashraf simexin justwalking2017 kange2014 raonyguimaraes xtmgah dvitsios scchess polyomica kchennen shahirb lchmo444 wave-wu enformatik jo6fu3 nate-d-olson aneeshpanoli aerowild imneuro pythseq avallonking stephanholgerd him72 zzygyx9119 ducatimonster916 rimanb kathmu wangdi2014 wf-frank2019 novapyth baozg tasneemna jmonlong alphadera-labs liserjrqlxue huyuxi08 oldzhg violethaze74 dhruvaorchidhealth qihaowei89 genieus

benchmarking-tools's Issues

GRCh38 / hg38 stratification BED files

Do the BED files under resources/stratification-bed-files exist for GRCh38? If so it would be great if they could be added to the repo

Benchmarking my dataset

Hi,

I want to benchmark my variant calls (not GIAB or platinum dataset). How should i do with the help of this tool?

Thanks

Comparison around confident region boundaries

@jzook said in #4:

In terms of prefiltering, my one concern with this is where part of a complex variant is excluded by the bed file in one representation and included in another representation. I've found this to be quite common in FPs/FNs, and it can be helped by doing the filtering after comparison, but it gets complicated - maybe this should be a different issue?

And @pkrusche also said:

W.r.t. prefiltering I also agree with @jzook -- the main reasoning for doing the UNK regions after comparison in hap.py is that this allows us to be more accurate at the boundaries of confident regions -- when variants have different representations in the query, they might not necessary fall into the confident call regions, so filtering before comparison would wrongly give FNs.

This is quite interesting, and I've not previously thought a great deal about the confident regions, and had assumed that they were relatively few in number so that these edge effects would be fairly minor. I was surprised to find that the confident regions BED that comes with Illumina PG 8.0.1 truth set (what I have handy) contains 5.8 million regions (and this for only 4.5 million variants in the corresponding truth VCF).

The confident regions are being used to say: "Within the regions, we are confident that the truth set contains all the variants." The implication is that the genomic sequence outside of the regions could not be confidently determined (and so may be variant or non-variant). What are some of the reasons why regions are excluded?

The region was not sequenced to enough depth to confidently determine the site is non-variant. This would probably lead to relatively large gaps between confident regions.
There were calls made in the region which were inconsistent (either with each other, or with phasing, etc). Presumably these are not included in the truth set (lets call these truth-rejects)

When doing comparisons near the edges of these regions, some of the cases we need to be careful about (there are more, but let's just worry about these):

a) A call made outside the confident region could not be matched to a truth variant (thus assigned match status UNK). (These may or may not match truth-rejects.)

b) A truth variant can be matched by a call made outside the confident region. (This should be a TP, but which pre-filtering calls to the regions would incorrectly result in an FN)

c) A call made inside the confident region could not be matched to a truth variant. (Seems like this should be a FP, but hang on, what if the call would be equivalent to a truth-reject -- it should be UNK, right? What if it was "nearly equivalent" such as a zygosity error -- should also be UNK since we're not confident in the truth-reject correctness?)

Looking at the distribution of region lengths, the Illumina PG 8 dataset has over 600k confident regions that are a single base long and another 340k that are only two bases long. Why are they so short? How can you be confident at a single base and yet not be confident at the bases on either side? These locations seem like they would be vulnerable to pre-filtering FN (case b above).

Similarly, this dataset has 3.6 million cases where the gap between consecutive confident regions is only a single base long (and another 400k where the gap is 2 bases long). This suggests the position is being excised due to call inconsistency during truth set construction. These cases seem like they would be very vulnerable to generating FP during comparison time for any callers that don't happen to use the same variant representation as went into the PG construction (whereas a caller using the same positional representation would get UNK instead) (case c above). This is regardless of whether region filtering is pre or post comparison, with the exception that it turns out that PG 8 has 200k "silver" variants that are outside the confident regions (which will alleviate things for alternate representations that do match these decoys, but it seems like a drop in the bucket).

It would be interesting to look further at the magnitude of these effects. Just going by the numbers above, case c should be four times larger than case b. Case b is the easier to evaluate, just compare pre and post filtering results, although as mentioned above I'm quite sceptical of those confident regions that are only one or two bases long in the first place. Case c effects, is harder. It would also be interesting to give truth variants a score which is the distance from the variants to their nearest confident region boundary, and plot a "ROC curve" with this score - I would assume that those with a larger distance are more likely to be TP in a typical query set.

Intermediate VCF, BD clarification

For the decision (BD) field, it is fairly obvious what TP/FP/FN mean, but it could be good to clarify the meaning of 'N' (and what the intended use is).

Am I right in assuming a 'N' is used to indicate that the call was not assessed / skipped / other? And I assume where there is no call (i.e. empty GT), the field should just have a regular VCF missing value '.'.

In vcfeval, calls may have non TP/FP/FN status either because they were ignored from comparison (e.g. outside of BED region; non-variant GT; SV or symbolic; having non-pass FILTER; exceeding maximum length; off-reference), or if the variant could not be assessed due to being in a region that turned out to exceed maximum evaluation complexity -- I assume these would need to end up with BD as 'N' or perhaps missing.

Representing and benchmarking variant calls on X, Y, and MT

The previous GIAB high-confidence callsets have ignored Y, since NA12878 is a female, they have ignored MT, because we haven't developed methods to call high confidence variants there, and they have treated X the same as the other chromosomes. Our next GIAB genome is a son in a trio, so we will need to treat X and Y differently both in the creation of high-confidence calls and in benchmarking. As I understand, there are a few ways that different tools express variants in males in X and Y.

In pseudo-autosomal regions, call variants as heterozygous or homozygous in X and mask Y. Eventually, I assume long reads, linked reads, or other methods will be able to phase variants and determine which variants are on X and which are on Y, so this might be only a short term solution, but perhaps this is best with current methods?
In other regions of X and Y, for males it seems best to output variants with "1" in the GT field, rather than "1/1", but many variant callers do not do this. Should benchmarking tools allow either representation when doing comparisons and output a warning, or should they reject the "1/1" representation?
Similarly, for MT, should benchmarking tools reject "1/1" in the GT field? How should heteroplasmy be represented?

Add links to datasets for benchmarking?

David Haussler suggested including links to datasets that can be used for benchmarking in this repo, which I think is a good idea. I suggest we might want 2 categories of data for each genome - high-confidence vcf/bed files and raw data files. Does that make sense?

Here are the genomes I'll propose as a start:

NA12878
- high-confidence vcf/bed from GIAB, Platinum Genomes, and RTG; maybe SVs from NIST, MetaSV, Bashir
- fastq from Illumina WGS, WES, Proton WES, others?
NA12877 from Platinum Genomes?
Venter genome from Bina?
Baylor genome?
GIAB PGP trios when calls are available next year

Broken link in docs

Hi Justin,

The link
https://github.com/ga4gh/benchmarking-tools/blob/master/doc/ref-impl/doc/resources/stratification-bed-files

in https://github.com/ga4gh/benchmarking-tools/blob/master/doc/ref-impl/stratification.md currently goes to a 404.

Thanks!
-Kaushik

Truth set high-confidence regions and shiftable indels

Gold standard benchmarking datasets such as the GIAB/NIST NA12878 and the Illumina Platinum Genomes truth sets provide both a VCF containing the variant records themselves, plus a BED file of "high confident regions". The confident regions are used to indicate regions where there are no variants present other than those contained within the associated VCF. There may be several reasons why some parts of the genome are regarded as low-confidence, but one category is regions where variant calls are made that are inconsistent between callers or inconsistent with pedigree.

The actual variant comparison is usually carried out using haplotype-aware tools like vcfeval/hap.py which can identify equivalence between variants that use alternative representations. Note that there can sometimes be large positional differences between equivalent variants. These tools are smart enough to know that if a query variant outside of the confident regions matches a truth variant inside a confident region, this should be regarded as a true positive. However, we would ideally want the opposite to occur -- that is if a query variant inside a confident region is equivalent to a truth variant outside a confident region it should not be considered a false positive (otherwise this would bias the accuracy metrics based on which representation they elected to use). However, we can't generally do this, because the truth sets rarely include variants outside of the high confident regions, and secondly, by the nature of low confidence regions an exact match is unlikely anyway. (Some discussion is at #11)

Ideally, confident regions should not be defined such that this type of thing is likely to occur. It seems like you would not want to create a narrow low-confidence region around a variant if that variant could equally be placed at an alternative location -- the low-confidence region should cover the range of positions at which the variant could have been represented.

I mentioned this potential issue on one of the GA4GH benchmarking calls and considered and experiment to right-align variants to see when this would shift them across confident region boundaries. However, a tool was recently published that makes this experiment easier. UPS-Indel aims to provide a universal positioning coordinate for indels regardless of their representation, in order to permit fast detection of redundant/equivalent indels by exact comparison of this coordinate. Essentially the UPS-Coordinate contains the reference span across which the indel could be placed, plus the sequence of the indel (there is also some light decomposition that is carried out). While the UPS-Indel comparison method is not capable of identifying complex equivalences involving multiple variants, for one-to-one variants the authors report it working well. As a side note, their preprint reports vcfeval performing poorly in comparison to ups-indel, however, it turns out that UPS-indel looks only at ALTs (i.e. sample-free), while their vcfeval runs were requiring diploid sample genotype matches, so obviously it would report fewer matches. I obtained the VCFs from the authors and running vcfeval in sample-free mode did yield very similar results. (Additionally, their dataset was a subset containing deletions only, which limited the likelihood of encountering situations requiring more than one-to-one matching ability).

Anyway, the UPS-Coordinates provide an easy way to look for cases where variants can shift across confident region boundaries, allowing us to identify potentially problematic sites. To start off with, we create a BED file corresponding to the UPS-coordinates of all indels in the variant set, and then intersect this with a BED file corresponding only to the borders of the confident regions.

Some interesting sites that turned up

I ran the intersection process on both GIAB 3.3.2 and PG 2016-1.0 and had a browse around. Since the inputs were truth VCFs rather than low-confidence indels, most of the examples were cases of a confident variant that could also be represented as a variant in a low-confidence region (i.e. the case already handled by our comparison tools). This would be better done using a VCF corresponding to the low-confidence indels that Justin or Illumina use when creating their truth sets, but there are still some interesting cases.

In the following screenshots we have the GIAB truth VCF at the top, the UVCF BED regions showing the indel variant spans, the high confidence regions, and the intersection between UVCF and confident region boundaries ("unsafe"). This sequence of tracks is repeated for the PG set in the lower half of the display.

1:5,869,472-5,869,526 - PG contains an indel which is just outside a confident region. The indel is capable of sliding into the downstream region, so a caller that made the same call but positioned it here to the right would be assigned a FP.

1:6,187,990-6,188,099 - PG contains an indel inside a two-base confident region which is inside roughly 50bp of low confidence. The indel can be repositioned inside the low-confidence zone. How confident are those two bases really, given the context?

1:12,141,963-12,142,110 - GIAB looks to have included a call that PG has spliced out. The call itself partially overlaps the GIAB high confidence regions. Should the GIAB high confidence region end mid-call? In addition, the call also has enough wiggle that it could be repositioned entirely within or entirely outside the PG high-conf regions. Can PG really claim to be be confident about that adjacent region?

1:50,938,017-50,938,164 - GIAB looks to have excluded a call that PG has also spliced out. However, the call has enough wiggle that it could be repositioned to overlap the GIAB high confidence region, and it could happily shift through some PG high-conf regions.

A simple way of making the high-confidence regions a little more robust to these issues would be to compute the union of the UPS-coordinates of rejected indels and subtract this from the existing high-confidence regions.

script.zip containing a script as a starting point if someone wants to try this out themselves.
@pkrusche @jzook @RebeccaTruty

Add metadata to VCF-I header

In the interest of promoting data provenance and reproducibility, I propose to add some headers to the VCF-I spec:

##program=<ID=hap.py, version=0.3.9, commandline=...>
##comparator=<ID=vcfeval, version=3.5.1>
##truthVCF=<ID=GIAB, file=giab_truth.vcf.gz, url=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/>
##queryVCF=<ID=Sample1, file=sample1.vcf.gz, sex=M>
##truthRegions=<ID=GIAB_default, file=giab_truth_regions.bed, url=..., insertionPadded=true>
##targetRegions=<ID=AgilentExome6.0, file=Exome-Agilent_V6.bed, url=https://github.com/AstraZeneca-NGS/reference_data/blob/master/hg19/bed/Exome-Agilent_V6.bed?raw=true>

Intermediate VCF, decision sub-types (BK)

For decision subtype, I am not clear what values to use for vcfeval results, as the terminology seems to be more oriented toward calls that have been through normalization.

I would probably put any TP as complex match since vcfeval doesn't have any notion of a "direct" match. I guess we could try to determine after-the-fact consider direct to be where the GT fields are the same, although this seems a bit pointless (and would have to be extra careful to handle cases where the input VCFs had multiple records at the same position). Slightly more natural (and useful) would be to count the number of TRUTH/QUERY variants in the sync region and take any 1:1 cases as "direct", even though they may actually be at different coordinates.

For non-TP the vcfeval semantics are really "this variant could not be included in a matching haplotype pair", which I would probably denote as complex mismatch. Our latest version can automatically run haploid matching on the FN/FP in order to find calls that have a common allele (e.g. typically matching calls that differ due to zygosity errors or half-calls), the closest semantics for any of these "haploid TP" here may be genotype mismatch?

I'm also not clear what BK=miss means (does it just mean that the call in this cell is '.'?)

Standardizing performance metrics output file

Since we're getting close to outputting performance metrics, I thought we should have a discussion around the output format, which will allow us to exchange outputs from different comparison workflows on different systems.

@pkrusche proposed what I assume is probably the current hap.py output format in https://github.com/ga4gh/benchmarking-tools/blob/master/doc/ref-impl/outputs.md, and @goranrakocevic discussed Seven Bridges' benchmarking output schema in a presentation a few weeks ago https://drive.google.com/open?id=0B29EEcQ2PgqjRE91LV9yUTNSRHAtTC1sSTJsUWw5TUszZ0Fj. We've also defined performance metrics for Comparison Methods with different stringencies here https://github.com/ga4gh/benchmarking-tools/blob/master/doc/standards/GA4GHBenchmarkingPerformanceMetricsDefinitions.md and stratification methods https://github.com/ga4gh/benchmarking-tools/blob/master/doc/standards/GA4GHBenchmarkingPerformanceStratification.md.

In looking at our current outputs file, I think we might want to expand it a bit to incorporate some of the ideas from @goranrakocevic and from our performance metrics and stratification methods documents.

If we want a single flat file, I think having more columns may be useful in addition to the Metrics columns:
Test Call Set (md5?)
Benchmark call set (md5?)
Zygosity
Variant type
Stratification bed
ROC field
ROC threshold
Comparison Method (stringency of match from our spec)

For metrics columns, I'd suggest we take definitions and names from spec in https://github.com/ga4gh/benchmarking-tools/blob/master/doc/standards/GA4GHBenchmarkingPerformanceMetricsDefinitions.md. I also think we should add columns for 95% Confidence intervals for metrics.

A couple questions:
Should any of these fields be moved to the header and output a separate file for each distinct value?
How should we report if multiple stratification bed files are intersected?
I think it may also be useful to have some standardized values for some of these fields (e.g., snp, indel, complex, etc. for Variant type). Do others agree?

Can we call a "superlocus" a "window"?

From how a "superlocus" was defined to me during the NY meeting, I think the traditional term "window" may be more appropriate. The term "window" is a very standard term for any one doing serial processing and, to me, is more to the point than superlocus. Thanks!

CalledProcessError

I am getting a CalledProcessError regarding the benchmark vcf, even though it has been indexed.

2021-05-05 18:27:33,391 WARNING  No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
Hap.py v0.3.14
[E::vcf_hdr_read] no sample line
**_**Failed to open or file not indexed: /u/home/n/neharajk/scratch/HG002_GRCh37_1_22_v4.2.1_benchmark_19.vcf**_**

2021-05-05 18:27:41,571 ERROR    Command 'vcfcheck /u/home/n/neharajk/scratch/HG002_GRCh37_1_22_v4.2.1_benchmark_19.vcf --check-bcf-errors 1' returned non-zero exit status 1
2021-05-05 18:27:41,571 ERROR    Traceback (most recent call last):
2021-05-05 18:27:41,571 ERROR      File "./hap.py", line 540, in <module>
2021-05-05 18:27:41,573 ERROR        main()
2021-05-05 18:27:41,573 ERROR      File "./hap.py", line 316, in main
2021-05-05 18:27:41,573 ERROR        convert_gvcf_to_vcf=convert_gvcf_truth)
2021-05-05 18:27:41,573 ERROR      File "/u/scratch/m/mdistler/hap.py-build/bin/pre.py", line 125, in preprocess
2021-05-05 18:27:41,576 ERROR        mf = subprocess.check_output("vcfcheck %s --check-bcf-errors 1" % pipes.quote(vcf_input), shell=True)
2021-05-05 18:27:41,576 ERROR      File "/u/local/apps/python/2.7.2/lib/python2.7/subprocess.py", line 544, in check_output
2021-05-05 18:27:41,599 ERROR        raise CalledProcessError(retcode, cmd, output=output)
2021-05-05 18:27:41,600 ERROR    CalledProcessError: Command 'vcfcheck /u/home/n/neharajk/scratch/HG002_GRCh37_1_22_v4.2.1_benchmark_19.vcf --check-bcf-errors 1' returned non-zero exit status 1

Any help?

Records in intermediate VCF format

Apologies for jumping into this discussion late. My question is what are the records in the intermediate VCF format? e.g. given two inputs

Truth

1 10 AT GT,AC 0/1

Query

1 10 A G 0/1
1 11 T C 0/1

What does the intermediate output look like for this matching case?

And for this non-matching case:

Truth

1 10 AT GT,AC 0/1

Query

1 10 A G 1/1
1 11 T C 1/1

Intermediate VCF, what is BT field for?

Currently the BT description is "Type of comparison performed" and it is in the ROC section. I'm not sure what the intended contents of this field are, so some examples would help. Is it just a textual name for where the QQ value was derived perhaps for graph labelling? (e.g. "QUAL", "My Favourite Callset Name")

Intermediate VCF, SUPERLOCUS_ID

There is a potential complication with respect to the SUPERLOCUS_ID, (at least as it applies to vcfeval). The boundaries of vcfeval sync regions are dependent on the variant positions after some internal trimming (e.g. removal of VCF anchor base where applicable, and the "relaxed-reference" mode potentially trims many leading same-as-reference bases).

What this means is that for a side-by-side output VCF record that preserves the input variant representation, occasionally the TRUTH may be in one sync region, and the QUERY may be in another.

In the combined output of dev vcfeval these cases are handled by setting the equivalent field to just contain two ids (comma separated), which is not really easy to deal with if you plan on doing downstream superlocus-oriented statistics. Reasonable alternatives would be to move SUPERLOCUS_ID to FORMAT (seems a bit ugly since it's fairly rare), or when such cases arise just veto the merge and output two separate records (probably a good compromise).