Partial reimplementation of bcftools for VCF Zarr
This is an early prototype: DO NOT USE!
Partial reimplementation of bcftools for VCF Zarr
License: Apache License 2.0
Per-contig variant counts can be easily calculated using the regions index introduced in #37.
This is needed when there are lots of samples, because command lines get too long for the shell.
Note that we should be careful to match the semantics of bcftools here when the file is empty (see also #61)
Shall we move to sgkit-dev so we can reduce the bus-factor @tomwhite? I'll be away for a few weeks and I don't want to block you and others from contributing.
Should be an easy one
bcftools has a mini-language of expressions which support inclusion/exclusion of variants in the output via the -i and -e operators.
We would like to support a small subset of this initially, so that we can support the query that we used in the VCF Zarr preprint ( bcftools view -I –include "FORMAT/DP>10 & FORMAT/GQ>20
).
Supporting the entire mini-language is probably a non-goal, and we should restrict to supporting pieces of it as we go as they are needed. Nonetheless, some initial analysis on what's involved in supporting the various aspects of the language would be very helpful.
While performance on large genotype only VCFs is excellent (better than bcftools in terms of throughput in my tests), it is quite poor on complicated VCFs. For chromosome 2 on the recent 1000 genomes data I'm getting less than 1MB per second, which is 50X less than we need (bcftools view is doing around 60 MB/s).
My sense is that it's probably not worth chasing perf here using numba. Jax also doesn't seem like a good fit. I'm actually inclined to write C extension that follows the logic of the current buffer-based numba approach, as I think it would be less work in the long run, get rid of the nasty latency issues involved in JIT compiling. For something like this, I think a well written C extension is less maintenance work than fancy python based stuff. Once you write a C extension, it really doesn't need much maintenance.
We should do some profiling first to see where the bottlenecks are, though.
When the user selects an empty subset of samples in bcftools, bcftools calculates the INFO/AC and INFO/AN fields on the whole genotype matrix. vcztools should do the same.
vcztools currently supports a subset of bcftools' expression language for filtering variant sites. However, vcztools does not support genotype comparisons yet (e.g. GT="ref"
). The task is to add support for genotype comparisons.
I kept this lying around while coding up the C version for comparison. I don't think there's much left to translate over now though, as they take quite different approaches and we can let go of the numba code now.
It would be good to add some tests that check we get consistent results with bcftools.
It would be good to have some tests to check the performance of various vcztools commands. The thing to check is that running highly-selective queries (i.e. ones that just return a few records) on large datasets is fast.
They could be like the ones @jeromekelleher ran in #8. These are not amenable to running on CI, so it's probably fine to have some instructions to run them manually on 1000G chromosome 2 or 20 data.
As predicted @tomwhite we've hit sgkit-dev/vcf-zarr-spec#11 pretty quickly when adding tests on real VCF files. I added a quick cludge to workaround the first case I hit, but it would be good to handle this more systematically. Unfortunately, I think it will just need a some ugly code like we have in the validation code in vcf2zarr here
Currently you can only use -i/-e/-r/-s/-t with the view command. This would add these options to the query command.
This might be a good opportunity to factor out Click options like we do in bio2zarr: https://github.com/sgkit-dev/bio2zarr/blob/main/bio2zarr/cli.py#L26
Does it make sense to develop this under the sgkit-dev
org?
We should add the equivalent of the bcftools -r/--regions
option to filter by regions.
The work in sgkit-dev/sgkit#658 (which was never merged) could form the basis for the implementation.
Also, it would be simpler to implement -t/--targets
first, since unlike -r/--regions
it only needs to check position, and not the variant length. To support the latter we probably need to store the variant length as a separate Zarr array in order to do efficient queries.
It would be good to copy across some of the tests for the VCF writing code from sgkit. Hopefully we can do some easy round-trip tests using bio2zarr?
@tomwhite, what would be involved here?
To write directly to a VCF file.
The decimal value for FLT_MAX is 340282346638528859811704183484516925440, which is much larger than we currently support. We need some way to either support this, or to protect the underlying C code.
When the user specifies a sample selection in vcztools view, vcztools recalculates the AC and AN INFO fields. This is consistent with bcftools' behavior. vcztools calculates these INFO fields using all of the samples in a variant-wise chunk of genotype data. The current implementation in pure Python using NumPy may be slow and create a lot of overhead. This issue is to improve the computation and memory efficiency. The solution may require calculating AC and AN in a C extension module.
The original code was added in #77.
We can use hypothesis-vcf to generate VCF files, convert to Zarr with vcf2zarr, then convert back using vcztools - and check that they are equivalent.
Add to CI like we have in bio2zarr
By prefixing the list (or file for #67) with a ^
character.
There is a difference between the vcztools' output and bcftools' output in the following commands:
vcztools view -s NA00001 vcz_test_cache/sample.vcf.vcz
bcftools view -s NA00001 tests/data/vcf/sample.vcf.gz
vcztools does not appear to write the INFO fields correctly.
This issue also affects #67.
vcztools view
does not recalculate the AC and AN INFO fields likebcftools view
does when the user specifies a sample selection.
--no-update
option.I didn't notice this when reviewing #49, but FilterExpressionParser
and FilterExpressionEvaluator
don't really belong in a general utils.py
module, but should be moved to expression.py
or filter.py
(or similar).
There are quite a few options that bcftools supports: tab-delimited, VCF, BED (compressed and uncompressed) - we may not need all of them straightaway.
vcztools view needs to produce the appropriate header in all situations, and having an input header as an option is not really very helpful. We need to subset the samples and fields in all sorts of ways, and parsing the source header here is not the right approach. Header parsing should be done in bio2zarr.
Note that do this fully, we will need to update the VCF Zarr specification and vcf2zarr in order to store sufficient information about the header, rather than the header itself. This will require some coordinated updates across repos.
I think the simplest approach here is to just delete the option for proving a complete header here, and see what we're missing.
See:
In the same way that bcftools view does, e.g.:
##bcftools_viewVersion=1.13+htslib-1.13+ds
##bcftools_viewCommand=view /home/jk/work/github/bio2zarr/tests/data/vcf/sample.vcf.gz; Date=Tue Aug 6 10:48:28 2024
@tomwhite would you mind opening a PR where you copy the files vcf_writer.py and vcf_writer_utils.py into the vcztools
directory here? Don't bother making any changes, I just want to track that you've contributed the starting point for this code.
bcftools recomputes some INFO fields after subsetting by samples. We need to do what it does.
See also this option:
-I, --no-update
do not (re)calculate INFO fields for the subset (currently INFO/AC and INFO/AN)
For example:
vcztools query -f '%POS\n' sample.vcz
Don't need looping over samples ([]
notation) yet: https://samtools.github.io/bcftools/bcftools.html#query
In #66, @tomwhite suggests that vcztools should fail if the user tries to write the output (which is uncompressed VCF) to a file path with an extension that does not correspond to uncompressed VCF (e.g. .bcf
or .vcz.gz
).
In bcftools, the user can specify both the ouput file (with -o
) and the output format (with -O
). If the user specifies an output type that is different from the output type indicated by the output file's extension, then bcftools uses the output type indicated by the extension.
Since vcztools only supports uncompressed VCF at the moment, vcztools should fail if the output type indicated by the output file's extension is not uncompressed VCF.
I think we can post an initial 0.0.1 package to pypi to make this easier to try out/squat on the name.
Hopefully we can copy the cd workflow from bio2zarr and plug in to that?
Although it might be easier if I set up the initial pypi package first manually, and then we get the cd workflow going for subsequent releases?
it's because the headers aren't matching due to the new local alleles support in bio2zarr. Should we just disable local alleles for the time being?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.