sgkit-dev / bio2zarr Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 6.0 5.26 MB

Convert bioinformatics file formats to Zarr

License: Apache License 2.0

Python 98.88% Makefile 1.01% Shell 0.10%

bio2zarr's People

Contributors

Stargazers

Watchers

Forkers

benjeffery tomwhite jeromekelleher ravwojdyla will-tyler jakehagen

bio2zarr's Issues

Add ``max_alt_alleles`` option to schema generation

Sometimes we do want to clip.

Probably the right thing to do here is to add a min_value and max_value values to the column schemas, and to np.clip() the values during sanitise.

Add within-VCF file partitioning

It would be useful to partition input VCF files so that we can explode them in parallel. This can directly reuse the code in sgkit, which uses the indexes to partition files.

Note, we need to be careful about indels that span partitions. This is done by filtering variants that don't start in the region, as sgkit does here

@tomwhite I think it would be simplest if we just copied the relevant code from sgkit over here. I can do this, but it it would be nice if you got the due git-cred for it. Would you be interested in opening a PR that copies in the relevant code and tests with minimal modifications, and then I can factor in to the current structure?

Doesn't matter what files they go in - could just make a "ported_vcf.py" module with new code, and then copy tests directly into tests directory or something?

Generalise "inspect" command

Would be very nice to support inspecting a set of VCFs and Zarrs also. Creating a similar summary table should be easy enough in both cases.

Add option to encode first n-chunks only

This should truncate the columns down to this size, so that we're not falsely reporting large unpopulated arrays.

Will require breaking up the fixed columns into slices, which is no harm anyway as they are quite slow.

Improve default chunk size

Default chunk sizes are chosen arbitrarily at the moment. We're current following sgkit's chunk_length=10_000, chunk_width=`1_000 (see here) but we can probably do a bit better than that

Add top-level source attr for provenance

Following sgkit, need

        # "source" attribute records which version of sgkit the dataset was created with
        # following CF Conventions 
        "source": f"sgkit-{sg.__version__}",

https://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#description-of-file-contents

Add testing and validation for multiple-vcfs

Currently not testing how we handle multiple VCFs.

For validation we should test chunking up the file in various ways, and then validate final product against the original file.

Compressor not set for contig_length

We're using Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0) for contig_length in the Zarr, presumably not setting the value somewhere.

Reorder vcf commands in cli

Currently have:

Usage: python -m bio2zarr vcf2zarr [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  convert   Convert input VCF(s) directly to vcfzarr (not recommended for...
  encode    Encode intermediate format (see explode) to vcfzarr
  explode   Convert VCF(s) to columnar intermediate format
  inspect   Inspect an intermediate format file
  mkschema  Generate a schema for zarr encoding
  validate

which is confusing

Add CLI flags to support batch array job processing

To distribute the work of explode and endode it would be useful to run commands such as:

bio2zarr vcf2zarr explode --job 5 --total-jobs 50 INFILES OUTDIR

Where the --job argument would be different on each worker.

Implementation wise I think this would mean breaking explode into 3 parts:

explode start --total-jobs M - single local task that does the scan and writes vcf partition metadata to the outdir
explode write --job N - parallel job for doing the conversion reads metadata from OUTDIR and writes partition summaries
explode finalise - single local job that writes the updated metadata, after reading the partition summaries, errors out if a partition summary is missing.

If none of these are specified then the usual explode proceeds locally as it does now.

Error from zero size array

I assume this is fixed by a quick size check, I'll fix.

Notebook progress support

Should be as simple as

from tqdm.auto import tqdm

Will need to check how well it works across both.

Should wait until we've abstracted out the "scan" bit to use the standard parallel worker pattern

Extract ``plink`` conversion and add minimal test

Make a new plink.py file, and corresponding test_plink, and do minimal test based on an example plink copied from sgkit.

Refactor encode code path

Currently we're using the ThreadedZarrEncoder to handle writing zarr chunks out, and assigning one process per column. This doesn't work very well for a few reasons. In particular, progress is very misleading when you use a lot of processes.

It would be better encode columns in vertical slices, assigning start and end coords (chunk aligned) to each worker process. These could then be chunk-buffered and written out incrementally, synchronously. This would be much simpler.

To do this we would need start, stop values for the iter_values method (which we can do using the per-partition counts). We might also need per chunk counts to avoid reading in lots of chunks unnecessarily.

Keeping track of progress is also an issue. Current approach is not working all that well (related: #48 )

Should error on INFO field that clobbers a standard field

Test on VCF/BCF indexes not produced by htslib

It's not clear how much we're depending on properties of the indexes produced by htslib. I think everything in the test suite has been produced by bcftools index or tabix, so it would be good to check if we get the same results using indexes generated by htsjdk e.g.

Add within-file VCF partitioning

We can now do partitioning within VCF files using the utilities ported over from sgkit.

Fill out test field corner cases

The info_field_type_combos.vcf.gz file copied from sgkit has a lot of wacky corner cases, and is a pretty exhaustive summary of the things that can happen. It current fails in various ways.

Also fill out all the corner cases in validate.

Review fspec dependency

The code for inspecting tabix files etc imported from sgkit in #22 depends on fspec.

If fspec is a light dependency, then we can just include it. If not, then it's probably simplest to mock it out for now, while evaluating whether it's worth updating the code to use fspec more generally.

To me, any file that's small enough the convert reliably over the network could just be downloaded and converted. Being dependent on 10s of gigabytes downloading without problems seems like an unnecessary complexity to add to this tool. Likewise for writing directly to cloud stores - handling failures and retries etc well seems like a pretty tricky thing which we can reasonably delegate to dedicated tools.

Rename chunk_width to samples_chunk_size and chunk_length to variants_chunk_size (or similar)

I find it unhelpful adding different dimensions here which don't actually line up with the named xarray dimensions. I think we should use the named dimensions here in some way, rather than "length" and "width".

Options:

chunk_variants, chunk_samples
variants_chunk_size, samples_chunk_size
...

I don't really know what's the best choice here.

Any thoughts @tomwhite @benjeffery?

Refactor basic file stucture

Currently all the code is in the vcf.py file. It would be good to make a plink.py file and maybe one other core.py file which takes care of the business of BufferedArrays etc

Remove tests and test data from distribution

We're currently putting all the tests and test data into the distribution. There's no point in this.

Insufficient testing of dimension padding

The all_info_fields example VCF doesn't have enough coverage on Number="." fields where we need to do dimension padding. Probably simplest to update upstream in sgkit to make sure it's handling these fields OK as well.

Change ``dexplode_slice`` to ``dexplode_partition``?

I'm not sure I see much point in supporting a slice of partitions in the distributed explode operation. If you're distributing across a cluster anyway, you may as well just use a simple array job and provision one core per partition. There's nothing really much to be gained from doing multiple partitions per job. Also, there's no point in over-partitioning.

It's also quite a bit easier to explain...

Am I missing something @benjeffery?

Add options for compressor in explode/dexplode-init

In #91 we added the ability to store the compressor used for the ICF, and switched the default to LZ4. In my example, I had
2m13s and 100M ICF with zstd, and 56s and 214M ICF with LZ4. So, roughly half the time and twice the size. There was no real difference in the zarr encode time in both (which isn't surprising, because both LZ4 and zstd are super fast at decoding).

Given that this is an intermediate format and we're not going to keep it around for long, it seems OK to use the faster and less compact option as the default. However, we can totally provide the option of specifying the compressor.

So, I'd suggest we add the option --compressor=[list of Blosc compressors] to the explode and dexplode-init commands.

Use Rich for UI

Rich provides some really nice utilities for CLIs, and helps give a CLI real polish. Using rich would definitely make things look nicer.

I tried out rich progress bars and log formatting in #64. The results are very pretty, but I found the progress bars lacking in some key features that I find useful:

Support for arbitrary units isn't present (which could be worked around)
There's no instantaneous rate (chunks per second, e.g.)
The estimated completion time seems much less robust than tqdm's

So, until these key features are supported in rich progress bars I think we'll stick with tqdm.

Provide mechanism to change types of fields

Some programs don't do a great job of typing or sizing the output VCF fields. E.g., this is some SnpEff output (I think)

##INFO=<ID=GDI-Phred,Number=.,Type=String,Description="Phred-scaled GDI scores">
##INFO=<ID=GDI,Number=.,Type=String,Description="gene damage index score, a genome-wide, gene-level metric of the mutational damage that has accumulated in the general population from doi 1>

This has been marked as String, Number=".", where it probably should be Integer, Number=1.

One way we could do this is to check the correspondence between the VCF types in the VCF metadata with the types in the Zarr schema, and to push that field through a "conversion" code path if they are not compatible. So, here, we'd see that the original VCF type is String and the Zarr type is int, and we'd have to pass the field through an integer parsing code path.

An nice feature to add at some point then would be to take an input Zarr schema and to automatically detect some common mistyped fields and correct them.

Tabix indexes sometimes return empty last partition

See assert_part_counts_non_zero in test_vcf_utils and where it's called. It's in the file CEUTrio.20.21.gatk3.4.g.vcf.bgz.tbi

Not sure whether this is a general thing, or just happens with this small file.

Weird progress at end of Encode 1D

On lots of files we seem to get a disappearing progreess bar for encode 1D, like

Encode 1D: 1.47kchunks [00:09, 150chunks/s]

validation-tmp/1kg_2020_chr20_annotations.vcf.gz.zarr

Progress meter does not complete

We often don't completely close the progress meter, like

 Scan:  50%|█████████████████████                     | 1.00/2.00 [00:00<00:00, 21.8files/s

Will be something to do with how we're joining the progress thread.

Test with indels overlapping VCF region boundaries

We're not currently testing this, need to do so

Check for overlapping intervals in input VCFs

It seems likely that providing multiple VCFs that have overlapping intervals would be an error. We should detect this in explode_init and raise an exception.

Add header and group metadata

Need:

vcf_zarr_version	0.2
vcf_header	The VCF header from ##fileformat to #CHROM inclusive, stored as a single string.

Reorganise with top-level modules vcf2zarr, plink2zarr

The Python API that's emerging is to do things like:

from bio2zarr import vcf

vcf.convert(["my_file.vcf.gz", "out.zarr")

It would probably be better to do it like

from bio2zarr import vcf2zarr

vcf2zarr.convert(["my_file.vcf.gz", "out.zarr")

Then, all the functions exported are in that flat namespace vcf2zarr. How stuff is split across files within the package is private.

Check ``format_version`` when reading JSONs

And be more systematic about setting them.

We can just raise an error if the version is not equal to the current version for now.

String field dimensions incorrect

Not currently getting the dimensions of String FORMAT fields appropriately.

Port sgkit's default compressor and filter settings

Sgkit has some better defaults for VCF columns. See here

Doesn't look like these will make a big difference though. Are there any defaults for FORMAT fields like GT, PL etc?

Add non-genotype data to plink conversion

Fix genotype conversion in plink

Tests added in #24 current xfailing. Probably something pretty basic

More informative errors on IntermediateColumnarFormat missing files

We use missing files as a way of ensuring integrity. We currently don't tell the user anything useful when those files are missing.

Should define some exceptions which capture the situation, and make sure that some useful information makes its way back to the user (when say, the ICF hasn't completed exploding)

Creating directories in explode_init takes 15min

Might be nice to parallelise this and add progress

Parse annotations into separate columns

Variant level annotations are often included as INFO tags with substructure, e.g.

##SnpEffVersion="4.3i (build 2016-12-15 22:33), by Pablo Cingolani"
##SnpEffCmd="SnpEff  -noStats -lof GRCh38.86 /gpfs/commons/home/usevani/compbio/CCDG/Project_CCDG_14151_B01_GRM_WGS/final_annotated_vcfs/tmp_dir_annt/CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.annotated.normalize.vcf.gz "
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA>
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">

It would be very helpful and useful to split these into their own Zarr arrays. We could add this as an option, like --parse-snpeff or something (I'm not sure how stable these formats are across versions, etc, though)

Add array metadata

Currently missing

Improve compression on PL fields

On recent 1000 genomes data, we have the following:

39G   20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz

The zarr using defaults is:

43G     tmp/1kg_chr20_all.zarr

which is 4G more.

This is dominated by the FORMAT cols:

1.1G    tmp/1kg_chr20_all.zarr/call_AB                                                        
7.1G    tmp/1kg_chr20_all.zarr/call_AD                                                        
4.5G    tmp/1kg_chr20_all.zarr/call_DP                                                        
174M    tmp/1kg_chr20_all.zarr/call_genotype                                                  
31M     tmp/1kg_chr20_all.zarr/call_genotype_mask                                             
4.7M    tmp/1kg_chr20_all.zarr/call_genotype_phased    
6.2G    tmp/1kg_chr20_all.zarr/call_GQ                                                        
4.7M    tmp/1kg_chr20_all.zarr/call_MIN_DP                                                    
4.7M    tmp/1kg_chr20_all.zarr/call_MQ0                                                       
552M    tmp/1kg_chr20_all.zarr/call_PGT
966M    tmp/1kg_chr20_all.zarr/call_PID
22G     tmp/1kg_chr20_all.zarr/call_PL
4.7M    tmp/1kg_chr20_all.zarr/call_RGQ
4.7M    tmp/1kg_chr20_all.zarr/call_SB

In particular, call_PL is 22G. Hopefully there's some reasonably straightforward combination of filters and compressor options that'll bring this down.

Rename VCF commands

Current we have:

Usage: python -m bio2zarr vcf2zarr [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  convert
  explode
  genspec
  summarise
  to-zarr
  validate

I think convert is a good name for the top-level all-in-one command (so, plink2zarr convert would just work).

We should probably change genspec to mkschema (or something; generally changing "spec" to "schema")

Not sure about summarise. Maybe there should be an inspect command that will look at either an exploded intermediate format or zarr and return some info?

I dislike to-zarr quite a lot. Maybe encode or something?

Investigate/explain size vs compressed size in FORMAT cols

1000 Genomes chr20 we get this:

name                      type       chunks  size       compressed      max_n  min_val    max_val                                                                                             
------------------------  -------  --------  ---------  ------------  -------  ---------  ---------
CHROM                     String        160  1.25 MB    138.06 KB           1  n/a        n/a                                                                                                 
POS                       Integer       160  606.64 KB  4.43 MB             1  6e+04      6.4e+07                                                                                             
QUAL                      Float         160  606.64 KB  11.33 MB            1  30         1.2e+08
ID                        String        160  16 KB      9.6 KB              0  n/a        n/a                                                                                                 
FILTERS                   String        160  1.41 MB    739.29 KB           1  n/a        n/a                                                                                                 
REF                       String        160  1.32 MB    3.57 MB             1  n/a        n/a                                                                                                 
ALT                       String        160  1.91 MB    4.82 MB             6  n/a        n/a                                                                                                 
FORMAT/AB                 Float        1097  178.45 MB  584.71 MB           1  0.04       0.95                                                                                                
FORMAT/AD                 Integer      2496  466.49 MB  7.78 GB             7  0          5.8e+03                                                                                             
FORMAT/DP                 Integer      1224  204.99 MB  6.08 GB             1  0          5.8e+03                                                                                             
FORMAT/GQ                 Integer      1224  204.99 MB  5.25 GB             1  0          99                                                                                                  
FORMAT/GT                 Integer      1788  307.2 MB   417.48 MB           3  -1         6                                                                                                   
FORMAT/MIN_DP             Integer       160  16 KB      9.6 KB              0  n/a        n/a                                                                                                 
FORMAT/MQ0                Integer       160  16 KB      9.6 KB              0  n/a        n/a                                                                                                 
FORMAT/PGT                String        849  46.13 MB   100.24 MB           1  n/a        n/a                                                                                                 
FORMAT/PID                String        849  59.74 MB   129.69 MB           1  n/a        n/a                                                                                                 
FORMAT/PL                 Integer      3992  807.39 MB  18.42 GB           28  0          1.7e+05                                                                                             
FORMAT/RGQ                Integer       160  16 KB      9.6 KB              0  n/a        n/a                                                                                                 
FORMAT/SB                 Integer       160  16 KB      9.6 KB              0  n/a        n/a

Note that cols like FORMAT/DP etc are reporting as much larger compressed than uncompressed. Investigate.

Deadlocks after ``scan`` in validation.py

Probably related to #51

Handle named VCF dimensions

We don't currently use VCF numbers of R, A and G, but instead determine dimensions from observed data.

A minimal thing we need to do is update the dimensions for these fields as described in the spec

Basic shell completion

Click supports shell completion which we should document how to set up.

It should work for basic things like paths, but we can look at how to autocomplete things like column names (if CLI commands take them as parameters)

BrokenProcessPool when using multiple workers

I've run into this issue a couple of times while attempting to convert 1000G data chr22 VCF to Zarr format. The first time I tried calling vcf2zarr convert ... directly. The second time I did it in the staggered approach suggested in the documentation. However, it still produced this error at the encode stage, particularly when encoding 2D matrices:

(bio2zarr_env) [szabad@narval2 bio2zarr_experiments]$ vcf2zarr encode -p 4 converted/chr22.exploded converted/chr22.zarr
Encode 1D: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 19.7k/19.7k [09:35<00:00, 34.2chunks/s]
Encode 2D:  79%|███████████████████████████████████████████████████████████████████████████████▎                     | 8.49k/10.8k [28:22<29:35, 1.30chunks/s]Traceback (most recent call last):
  File "/home/szabad/bio2zarr_env/bin/vcf2zarr", line 8, in <module>
    sys.exit(vcf2zarr())
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/cli.py", line 127, in encode
    vcf.encode(
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1538, in encode
    SgvcfZarr.encode(
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1498, in encode
    pwm.submit(
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/core.py", line 219, in __exit__
    wait_on_futures(self.futures)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/core.py", line 56, in wait_on_futures
    raise exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

I'm running this on a cluster and it's possible that the processes are killed silently if they used more resources than they should. But in this case, I'm just using 4 workers and it didn't seem they were using lots of memory.

Also a general question related to this: Do you think it's possible to pick up the encoding work from where it left off if things like this happen instead of starting over?