Giter Site home page Giter Site logo

vcztools's Introduction

vcztools

Partial reimplementation of bcftools for VCF Zarr

This is an early prototype: DO NOT USE!

vcztools's People

Contributors

jeromekelleher avatar tomwhite avatar will-tyler avatar jakehagen avatar

Stargazers

Jeff Hammerbacher avatar

Watchers

 avatar  avatar

vcztools's Issues

Add support for reading samples from a file (-S)

This is needed when there are lots of samples, because command lines get too long for the shell.

Note that we should be careful to match the semantics of bcftools here when the file is empty (see also #61)

Move to sgkit-dev?

Shall we move to sgkit-dev so we can reduce the bus-factor @tomwhite? I'll be away for a few weeks and I don't want to block you and others from contributing.

Add basic exclude/include expression support

bcftools has a mini-language of expressions which support inclusion/exclusion of variants in the output via the -i and -e operators.

We would like to support a small subset of this initially, so that we can support the query that we used in the VCF Zarr preprint ( bcftools view -I –include "FORMAT/DP>10 & FORMAT/GQ>20).

Supporting the entire mini-language is probably a non-goal, and we should restrict to supporting pieces of it as we go as they are needed. Nonetheless, some initial analysis on what's involved in supporting the various aspects of the language would be very helpful.

Performance problems

While performance on large genotype only VCFs is excellent (better than bcftools in terms of throughput in my tests), it is quite poor on complicated VCFs. For chromosome 2 on the recent 1000 genomes data I'm getting less than 1MB per second, which is 50X less than we need (bcftools view is doing around 60 MB/s).

My sense is that it's probably not worth chasing perf here using numba. Jax also doesn't seem like a good fit. I'm actually inclined to write C extension that follows the logic of the current buffer-based numba approach, as I think it would be less work in the long run, get rid of the nasty latency issues involved in JIT compiling. For something like this, I think a well written C extension is less maintenance work than fancy python based stuff. Once you write a C extension, it really doesn't need much maintenance.

We should do some profiling first to see where the bottlenecks are, though.

Drop numba implementation

I kept this lying around while coding up the C version for comparison. I don't think there's much left to translate over now though, as they take quite different approaches and we can let go of the numba code now.

Performance tests

It would be good to have some tests to check the performance of various vcztools commands. The thing to check is that running highly-selective queries (i.e. ones that just return a few records) on large datasets is fast.

They could be like the ones @jeromekelleher ran in #8. These are not amenable to running on CI, so it's probably fine to have some instructions to run them manually on 1000G chromosome 2 or 20 data.

Basic regions support

We should add the equivalent of the bcftools -r/--regions option to filter by regions.

The work in sgkit-dev/sgkit#658 (which was never merged) could form the basis for the implementation.

Also, it would be simpler to implement -t/--targets first, since unlike -r/--regions it only needs to check position, and not the variant length. To support the latter we probably need to store the variant length as a separate Zarr array in order to do efficient queries.

Test suite

It would be good to copy across some of the tests for the VCF writing code from sgkit. Hopefully we can do some easy round-trip tests using bio2zarr?

@tomwhite, what would be involved here?

ftoa fails (and unsafe) for floats > INT32_MAX

The decimal value for FLT_MAX is 340282346638528859811704183484516925440, which is much larger than we currently support. We need some way to either support this, or to protect the underlying C code.

INFO field computation performance

Description

When the user specifies a sample selection in vcztools view, vcztools recalculates the AC and AN INFO fields. This is consistent with bcftools' behavior. vcztools calculates these INFO fields using all of the samples in a variant-wise chunk of genotype data. The current implementation in pure Python using NumPy may be slow and create a lot of overhead. This issue is to improve the computation and memory efficiency. The solution may require calculating AC and AN in a C extension module.

The original code was added in #77.

References

Roundtrip testing with hypothesis-vcf

We can use hypothesis-vcf to generate VCF files, convert to Zarr with vcf2zarr, then convert back using vcztools - and check that they are equivalent.

Filtering with samples not outputing info fields correctly

Description

There is a difference between the vcztools' output and bcftools' output in the following commands:

vcztools view -s NA00001 vcz_test_cache/sample.vcf.vcz
bcftools view -s NA00001 tests/data/vcf/sample.vcf.gz

vcztools does not appear to write the INFO fields correctly.

This issue also affects #67.

Root cause

vcztools view does not recalculate the AC and AN INFO fields likebcftools view does when the user specifies a sample selection.

References

Move filter expression code to separate module

I didn't notice this when reviewing #49, but FilterExpressionParser and FilterExpressionEvaluator don't really belong in a general utils.py module, but should be moved to expression.py or filter.py (or similar).

Remove vcf_header option from write_vcf

vcztools view needs to produce the appropriate header in all situations, and having an input header as an option is not really very helpful. We need to subset the samples and fields in all sorts of ways, and parsing the source header here is not the right approach. Header parsing should be done in bio2zarr.

Note that do this fully, we will need to update the VCF Zarr specification and vcf2zarr in order to store sufficient information about the header, rather than the header itself. This will require some coordinated updates across repos.

I think the simplest approach here is to just delete the option for proving a complete header here, and see what we're missing.

See:

Output command line provenance in the header

In the same way that bcftools view does, e.g.:

##bcftools_viewVersion=1.13+htslib-1.13+ds
##bcftools_viewCommand=view /home/jk/work/github/bio2zarr/tests/data/vcf/sample.vcf.gz; Date=Tue Aug  6 10:48:28 2024

Copy in sgkit vcf writer code

@tomwhite would you mind opening a PR where you copy the files vcf_writer.py and vcf_writer_utils.py into the vcztools directory here? Don't bother making any changes, I just want to track that you've contributed the starting point for this code.

Recompute INFO/AC INFO/AN when samples specified.

bcftools recomputes some INFO fields after subsetting by samples. We need to do what it does.

See also this option:

-I, --no-update
do not (re)calculate INFO fields for the subset (currently INFO/AC and INFO/AN)

Fail if output type does not match file extension

Description

In #66, @tomwhite suggests that vcztools should fail if the user tries to write the output (which is uncompressed VCF) to a file path with an extension that does not correspond to uncompressed VCF (e.g. .bcf or .vcz.gz).

In bcftools, the user can specify both the ouput file (with -o) and the output format (with -O). If the user specifies an output type that is different from the output type indicated by the output file's extension, then bcftools uses the output type indicated by the extension.

Since vcztools only supports uncompressed VCF at the moment, vcztools should fail if the output type indicated by the output file's extension is not uncompressed VCF.

References

Add cd workflow and post initial package

I think we can post an initial 0.0.1 package to pypi to make this easier to try out/squat on the name.

Hopefully we can copy the cd workflow from bio2zarr and plug in to that?

Although it might be easier if I set up the initial pypi package first manually, and then we get the cd workflow going for subsequent releases?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.