sgkit-dev / vcztools Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 3.0 253 KB

Partial reimplementation of bcftools for VCF Zarr

License: Apache License 2.0

Python 55.99% Makefile 0.11% C 43.54% Meson 0.36%

vcztools's Introduction

vcztools

Partial reimplementation of bcftools for VCF Zarr

This is an early prototype: DO NOT USE!

vcztools's People

Contributors

Stargazers

Watchers

Forkers

tomwhite will-tyler jakehagen

vcztools's Issues

Add view -H/--no-header option

Add support for index `-s/--stats`

Per-contig variant counts can be easily calculated using the regions index introduced in #37.

Add support for reading samples from a file (-S)

This is needed when there are lots of samples, because command lines get too long for the shell.

Note that we should be careful to match the semantics of bcftools here when the file is empty (see also #61)

Move to sgkit-dev?

Shall we move to sgkit-dev so we can reduce the bus-factor @tomwhite? I'll be away for a few weeks and I don't want to block you and others from contributing.

Add query command to list samples (`-l/--list-samples`)

Should be an easy one

Add basic exclude/include expression support

bcftools has a mini-language of expressions which support inclusion/exclusion of variants in the output via the -i and -e operators.

We would like to support a small subset of this initially, so that we can support the query that we used in the VCF Zarr preprint ( bcftools view -I –include "FORMAT/DP>10 & FORMAT/GQ>20).

Supporting the entire mini-language is probably a non-goal, and we should restrict to supporting pieces of it as we go as they are needed. Nonetheless, some initial analysis on what's involved in supporting the various aspects of the language would be very helpful.

Performance problems

While performance on large genotype only VCFs is excellent (better than bcftools in terms of throughput in my tests), it is quite poor on complicated VCFs. For chromosome 2 on the recent 1000 genomes data I'm getting less than 1MB per second, which is 50X less than we need (bcftools view is doing around 60 MB/s).

My sense is that it's probably not worth chasing perf here using numba. Jax also doesn't seem like a good fit. I'm actually inclined to write C extension that follows the logic of the current buffer-based numba approach, as I think it would be less work in the long run, get rid of the nasty latency issues involved in JIT compiling. For something like this, I think a well written C extension is less maintenance work than fancy python based stuff. Once you write a C extension, it really doesn't need much maintenance.

We should do some profiling first to see where the bottlenecks are, though.

Support bcftools' empty sample subset behaviour

Description

When the user selects an empty subset of samples in bcftools, bcftools calculates the INFO/AC and INFO/AN fields on the whole genotype matrix. vcztools should do the same.

References

bcftools view

Support genotypes in filtering expressions

Description

vcztools currently supports a subset of bcftools' expression language for filtering variant sites. However, vcztools does not support genotype comparisons yet (e.g. GT="ref"). The task is to add support for genotype comparisons.

References

bcftools expressions

Drop numba implementation

I kept this lying around while coding up the C version for comparison. I don't think there's much left to translate over now though, as they take quite different approaches and we can let go of the numba code now.

bcftools validation tests

It would be good to add some tests that check we get consistent results with bcftools.

Performance tests

It would be good to have some tests to check the performance of various vcztools commands. The thing to check is that running highly-selective queries (i.e. ones that just return a few records) on large datasets is fast.

They could be like the ones @jeromekelleher ran in #8. These are not amenable to running on CI, so it's probably fine to have some instructions to run them manually on 1000G chromosome 2 or 20 data.

assert_vcfs_close doesn't handle missingness subtlety

As predicted @tomwhite we've hit sgkit-dev/vcf-zarr-spec#11 pretty quickly when adding tests on real VCF files. I added a quick cludge to workaround the first case I hit, but it would be good to handle this more systematically. Unfortunately, I think it will just need a some ugly code like we have in the validation code in vcf2zarr here

Add variant and sample filtering to query command

Currently you can only use -i/-e/-r/-s/-t with the view command. This would add these options to the query command.

This might be a good opportunity to factor out Click options like we do in bio2zarr: https://github.com/sgkit-dev/bio2zarr/blob/main/bio2zarr/cli.py#L26

Move to sgkit-dev

Does it make sense to develop this under the sgkit-dev org?

Basic regions support

We should add the equivalent of the bcftools -r/--regions option to filter by regions.

The work in sgkit-dev/sgkit#658 (which was never merged) could form the basis for the implementation.

Also, it would be simpler to implement -t/--targets first, since unlike -r/--regions it only needs to check position, and not the variant length. To support the latter we probably need to store the variant length as a separate Zarr array in order to do efficient queries.

Test suite

It would be good to copy across some of the tests for the VCF writing code from sgkit. Hopefully we can do some easy round-trip tests using bio2zarr?

@tomwhite, what would be involved here?

Add view `-o/--output` option for uncompressed VCF only

To write directly to a VCF file.

ftoa fails (and unsafe) for floats > INT32_MAX

The decimal value for FLT_MAX is 340282346638528859811704183484516925440, which is much larger than we currently support. We need some way to either support this, or to protect the underlying C code.

INFO field computation performance

Description

When the user specifies a sample selection in vcztools view, vcztools recalculates the AC and AN INFO fields. This is consistent with bcftools' behavior. vcztools calculates these INFO fields using all of the samples in a variant-wise chunk of genotype data. The current implementation in pure Python using NumPy may be slow and create a lot of overhead. This issue is to improve the computation and memory efficiency. The solution may require calculating AC and AN in a C extension module.

The original code was added in #77.

References

bcftools view

Roundtrip testing with hypothesis-vcf

We can use hypothesis-vcf to generate VCF files, convert to Zarr with vcf2zarr, then convert back using vcztools - and check that they are equivalent.

Code coverage

Add to CI like we have in bio2zarr

Add support to exclude samples

By prefixing the list (or file for #67) with a ^ character.

Filtering with samples not outputing info fields correctly

Description

There is a difference between the vcztools' output and bcftools' output in the following commands:

vcztools view -s NA00001 vcz_test_cache/sample.vcf.vcz

bcftools view -s NA00001 tests/data/vcf/sample.vcf.gz

vcztools does not appear to write the INFO fields correctly.

This issue also affects #67.

Root cause

vcztools view does not recalculate the AC and AN INFO fields likebcftools view does when the user specifies a sample selection.

References

bcftools view
- See the --no-update option.

Move filter expression code to separate module

I didn't notice this when reviewing #49, but FilterExpressionParser and FilterExpressionEvaluator don't really belong in a general utils.py module, but should be moved to expression.py or filter.py (or similar).

Add support for reading regions and targets from a file (-R/-T)

There are quite a few options that bcftools supports: tab-delimited, VCF, BED (compressed and uncompressed) - we may not need all of them straightaway.

Remove vcf_header option from write_vcf

vcztools view needs to produce the appropriate header in all situations, and having an input header as an option is not really very helpful. We need to subset the samples and fields in all sorts of ways, and parsing the source header here is not the right approach. Header parsing should be done in bio2zarr.

Note that do this fully, we will need to update the VCF Zarr specification and vcf2zarr in order to store sufficient information about the header, rather than the header itself. This will require some coordinated updates across repos.

I think the simplest approach here is to just delete the option for proving a complete header here, and see what we're missing.

See:

sgkit-dev/vcf-zarr-spec#15

Output command line provenance in the header

In the same way that bcftools view does, e.g.:

##bcftools_viewVersion=1.13+htslib-1.13+ds
##bcftools_viewCommand=view /home/jk/work/github/bio2zarr/tests/data/vcf/sample.vcf.gz; Date=Tue Aug  6 10:48:28 2024

Copy in sgkit vcf writer code

@tomwhite would you mind opening a PR where you copy the files vcf_writer.py and vcf_writer_utils.py into the vcztools directory here? Don't bother making any changes, I just want to track that you've contributed the starting point for this code.

Recompute INFO/AC INFO/AN when samples specified.

bcftools recomputes some INFO fields after subsetting by samples. We need to do what it does.

Add query command with basic `-f/--format` support

For example:

vcztools query -f '%POS\n' sample.vcz

Don't need looping over samples ([] notation) yet: https://samtools.github.io/bcftools/bcftools.html#query

Fail if output type does not match file extension

Description

In #66, @tomwhite suggests that vcztools should fail if the user tries to write the output (which is uncompressed VCF) to a file path with an extension that does not correspond to uncompressed VCF (e.g. .bcf or .vcz.gz).

In bcftools, the user can specify both the ouput file (with -o) and the output format (with -O). If the user specifies an output type that is different from the output type indicated by the output file's extension, then bcftools uses the output type indicated by the extension.

Since vcztools only supports uncompressed VCF at the moment, vcztools should fail if the output type indicated by the output file's extension is not uncompressed VCF.

References

bcftools common options

Add support for index `-n/--nrecords`

Add cd workflow and post initial package

I think we can post an initial 0.0.1 package to pypi to make this easier to try out/squat on the name.

Hopefully we can copy the cd workflow from bio2zarr and plug in to that?

Although it might be easier if I set up the initial pypi package first manually, and then we get the cd workflow going for subsequent releases?

Test roundtrip for c-1kg_2020_chrM.vcf.gz is failing

it's because the headers aren't matching due to the new local alleles support in bio2zarr. Should we just disable local alleles for the time being?

sgkit-dev / vcztools Goto Github PK

vcztools's Introduction

vcztools

vcztools's People

Contributors

Stargazers

Watchers

Forkers

vcztools's Issues

Description

References

Description

References

Description

References

Description

Root cause

References

Description

References

Recommend Projects

Recommend Topics

Recommend Org