sgkit-dev / bio2zarr Goto Github PK
View Code? Open in Web Editor NEWConvert bioinformatics file formats to Zarr
License: Apache License 2.0
Convert bioinformatics file formats to Zarr
License: Apache License 2.0
Sometimes we do want to clip.
Probably the right thing to do here is to add a min_value
and max_value
values to the column schemas, and to np.clip() the values during sanitise
.
It would be useful to partition input VCF files so that we can explode them in parallel. This can directly reuse the code in sgkit, which uses the indexes to partition files.
Note, we need to be careful about indels that span partitions. This is done by filtering variants that don't start in the region, as sgkit does here
@tomwhite I think it would be simplest if we just copied the relevant code from sgkit over here. I can do this, but it it would be nice if you got the due git-cred for it. Would you be interested in opening a PR that copies in the relevant code and tests with minimal modifications, and then I can factor in to the current structure?
Doesn't matter what files they go in - could just make a "ported_vcf.py" module with new code, and then copy tests directly into tests
directory or something?
Would be very nice to support inspecting a set of VCFs and Zarrs also. Creating a similar summary table should be easy enough in both cases.
This should truncate the columns down to this size, so that we're not falsely reporting large unpopulated arrays.
Will require breaking up the fixed columns into slices, which is no harm anyway as they are quite slow.
Default chunk sizes are chosen arbitrarily at the moment. We're current following sgkit's chunk_length=10_000
, chunk_width=`1_000
(see here) but we can probably do a bit better than that
Following sgkit, need
# "source" attribute records which version of sgkit the dataset was created with
# following CF Conventions
"source": f"sgkit-{sg.__version__}",
Currently not testing how we handle multiple VCFs.
For validation we should test chunking up the file in various ways, and then validate final product against the original file.
We're using Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
for contig_length
in the Zarr, presumably not setting the value somewhere.
Currently have:
Usage: python -m bio2zarr vcf2zarr [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
convert Convert input VCF(s) directly to vcfzarr (not recommended for...
encode Encode intermediate format (see explode) to vcfzarr
explode Convert VCF(s) to columnar intermediate format
inspect Inspect an intermediate format file
mkschema Generate a schema for zarr encoding
validate
which is confusing
To distribute the work of explode
and endode
it would be useful to run commands such as:
bio2zarr vcf2zarr explode --job 5 --total-jobs 50 INFILES OUTDIR
Where the --job
argument would be different on each worker.
Implementation wise I think this would mean breaking explode into 3 parts:
explode start --total-jobs M
- single local task that does the scan and writes vcf partition metadata to the outdir
explode write --job N
- parallel job for doing the conversion reads metadata from OUTDIR and writes partition summaries
explode finalise
- single local job that writes the updated metadata, after reading the partition summaries, errors out if a partition summary is missing.
If none of these are specified then the usual explode proceeds locally as it does now.
Should be as simple as
from tqdm.auto import tqdm
Will need to check how well it works across both.
Should wait until we've abstracted out the "scan" bit to use the standard parallel worker pattern
Make a new plink.py
file, and corresponding test_plink
, and do minimal test based on an example plink copied from sgkit.
Currently we're using the ThreadedZarrEncoder to handle writing zarr chunks out, and assigning one process per column. This doesn't work very well for a few reasons. In particular, progress is very misleading when you use a lot of processes.
It would be better encode columns in vertical slices, assigning start and end coords (chunk aligned) to each worker process. These could then be chunk-buffered and written out incrementally, synchronously. This would be much simpler.
To do this we would need start, stop values for the iter_values
method (which we can do using the per-partition counts). We might also need per chunk counts to avoid reading in lots of chunks unnecessarily.
Keeping track of progress is also an issue. Current approach is not working all that well (related: #48 )
It's not clear how much we're depending on properties of the indexes produced by htslib. I think everything in the test suite has been produced by bcftools index
or tabix
, so it would be good to check if we get the same results using indexes generated by htsjdk e.g.
We can now do partitioning within VCF files using the utilities ported over from sgkit.
The info_field_type_combos.vcf.gz
file copied from sgkit has a lot of wacky corner cases, and is a pretty exhaustive summary of the things that can happen. It current fails in various ways.
Also fill out all the corner cases in validate
.
The code for inspecting tabix files etc imported from sgkit in #22 depends on fspec.
If fspec is a light dependency, then we can just include it. If not, then it's probably simplest to mock it out for now, while evaluating whether it's worth updating the code to use fspec more generally.
To me, any file that's small enough the convert reliably over the network could just be downloaded and converted. Being dependent on 10s of gigabytes downloading without problems seems like an unnecessary complexity to add to this tool. Likewise for writing directly to cloud stores - handling failures and retries etc well seems like a pretty tricky thing which we can reasonably delegate to dedicated tools.
I find it unhelpful adding different dimensions here which don't actually line up with the named xarray dimensions. I think we should use the named dimensions here in some way, rather than "length" and "width".
Options:
I don't really know what's the best choice here.
Any thoughts @tomwhite @benjeffery?
Currently all the code is in the vcf.py
file. It would be good to make a plink.py
file and maybe one other core.py
file which takes care of the business of BufferedArrays etc
We're currently putting all the tests and test data into the distribution. There's no point in this.
The all_info_fields example VCF doesn't have enough coverage on Number="." fields where we need to do dimension padding. Probably simplest to update upstream in sgkit to make sure it's handling these fields OK as well.
I'm not sure I see much point in supporting a slice of partitions in the distributed explode operation. If you're distributing across a cluster anyway, you may as well just use a simple array job and provision one core per partition. There's nothing really much to be gained from doing multiple partitions per job. Also, there's no point in over-partitioning.
It's also quite a bit easier to explain...
Am I missing something @benjeffery?
In #91 we added the ability to store the compressor used for the ICF, and switched the default to LZ4. In my example, I had
2m13s and 100M ICF with zstd, and 56s and 214M ICF with LZ4. So, roughly half the time and twice the size. There was no real difference in the zarr encode
time in both (which isn't surprising, because both LZ4 and zstd are super fast at decoding).
Given that this is an intermediate format and we're not going to keep it around for long, it seems OK to use the faster and less compact option as the default. However, we can totally provide the option of specifying the compressor.
So, I'd suggest we add the option --compressor=[list of Blosc compressors]
to the explode
and dexplode-init
commands.
Rich provides some really nice utilities for CLIs, and helps give a CLI real polish. Using rich would definitely make things look nicer.
I tried out rich progress bars and log formatting in #64. The results are very pretty, but I found the progress bars lacking in some key features that I find useful:
So, until these key features are supported in rich progress bars I think we'll stick with tqdm.
Some programs don't do a great job of typing or sizing the output VCF fields. E.g., this is some SnpEff output (I think)
##INFO=<ID=GDI-Phred,Number=.,Type=String,Description="Phred-scaled GDI scores">
##INFO=<ID=GDI,Number=.,Type=String,Description="gene damage index score, a genome-wide, gene-level metric of the mutational damage that has accumulated in the general population from doi 1>
This has been marked as String, Number=".", where it probably should be Integer, Number=1.
One way we could do this is to check the correspondence between the VCF types in the VCF metadata with the types in the Zarr schema, and to push that field through a "conversion" code path if they are not compatible. So, here, we'd see that the original VCF type is String and the Zarr type is int, and we'd have to pass the field through an integer parsing code path.
An nice feature to add at some point then would be to take an input Zarr schema and to automatically detect some common mistyped fields and correct them.
See assert_part_counts_non_zero
in test_vcf_utils and where it's called. It's in the file CEUTrio.20.21.gatk3.4.g.vcf.bgz.tbi
Not sure whether this is a general thing, or just happens with this small file.
On lots of files we seem to get a disappearing progreess bar for encode 1D, like
Encode 1D: 1.47kchunks [00:09, 150chunks/s]
validation-tmp/1kg_2020_chr20_annotations.vcf.gz.zarr
We often don't completely close the progress meter, like
Scan: 50%|█████████████████████ | 1.00/2.00 [00:00<00:00, 21.8files/s
Will be something to do with how we're joining the progress thread.
We're not currently testing this, need to do so
It seems likely that providing multiple VCFs that have overlapping intervals would be an error. We should detect this in explode_init
and raise an exception.
Need:
vcf_zarr_version 0.2
vcf_header The VCF header from ##fileformat to #CHROM inclusive, stored as a single string.
The Python API that's emerging is to do things like:
from bio2zarr import vcf
vcf.convert(["my_file.vcf.gz", "out.zarr")
It would probably be better to do it like
from bio2zarr import vcf2zarr
vcf2zarr.convert(["my_file.vcf.gz", "out.zarr")
Then, all the functions exported are in that flat namespace vcf2zarr
. How stuff is split across files within the package is private.
And be more systematic about setting them.
We can just raise an error if the version is not equal to the current version for now.
Not currently getting the dimensions of String FORMAT fields appropriately.
Sgkit has some better defaults for VCF columns. See here
Doesn't look like these will make a big difference though. Are there any defaults for FORMAT fields like GT, PL etc?
See also #19
Current plink conversion is very much proof of concept, just doing the actual genotypes. We need to convert all the other stuff too, including providing APIs to open the linked files.
Tests added in #24 current xfailing. Probably something pretty basic
We use missing files as a way of ensuring integrity. We currently don't tell the user anything useful when those files are missing.
Should define some exceptions which capture the situation, and make sure that some useful information makes its way back to the user (when say, the ICF hasn't completed exploding)
Might be nice to parallelise this and add progress
Variant level annotations are often included as INFO tags with substructure, e.g.
##SnpEffVersion="4.3i (build 2016-12-15 22:33), by Pablo Cingolani"
##SnpEffCmd="SnpEff -noStats -lof GRCh38.86 /gpfs/commons/home/usevani/compbio/CCDG/Project_CCDG_14151_B01_GRM_WGS/final_annotated_vcfs/tmp_dir_annt/CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.annotated.normalize.vcf.gz "
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA>
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
It would be very helpful and useful to split these into their own Zarr arrays. We could add this as an option, like --parse-snpeff
or something (I'm not sure how stable these formats are across versions, etc, though)
Currently missing
On recent 1000 genomes data, we have the following:
39G 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz
The zarr using defaults is:
43G tmp/1kg_chr20_all.zarr
which is 4G more.
This is dominated by the FORMAT cols:
1.1G tmp/1kg_chr20_all.zarr/call_AB
7.1G tmp/1kg_chr20_all.zarr/call_AD
4.5G tmp/1kg_chr20_all.zarr/call_DP
174M tmp/1kg_chr20_all.zarr/call_genotype
31M tmp/1kg_chr20_all.zarr/call_genotype_mask
4.7M tmp/1kg_chr20_all.zarr/call_genotype_phased
6.2G tmp/1kg_chr20_all.zarr/call_GQ
4.7M tmp/1kg_chr20_all.zarr/call_MIN_DP
4.7M tmp/1kg_chr20_all.zarr/call_MQ0
552M tmp/1kg_chr20_all.zarr/call_PGT
966M tmp/1kg_chr20_all.zarr/call_PID
22G tmp/1kg_chr20_all.zarr/call_PL
4.7M tmp/1kg_chr20_all.zarr/call_RGQ
4.7M tmp/1kg_chr20_all.zarr/call_SB
In particular, call_PL
is 22G. Hopefully there's some reasonably straightforward combination of filters and compressor options that'll bring this down.
Current we have:
Usage: python -m bio2zarr vcf2zarr [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
convert
explode
genspec
summarise
to-zarr
validate
I think convert
is a good name for the top-level all-in-one command (so, plink2zarr convert
would just work).
We should probably change genspec
to mkschema
(or something; generally changing "spec" to "schema")
Not sure about summarise
. Maybe there should be an inspect
command that will look at either an exploded intermediate format or zarr and return some info?
I dislike to-zarr
quite a lot. Maybe encode
or something?
1000 Genomes chr20 we get this:
name type chunks size compressed max_n min_val max_val
------------------------ ------- -------- --------- ------------ ------- --------- ---------
CHROM String 160 1.25 MB 138.06 KB 1 n/a n/a
POS Integer 160 606.64 KB 4.43 MB 1 6e+04 6.4e+07
QUAL Float 160 606.64 KB 11.33 MB 1 30 1.2e+08
ID String 160 16 KB 9.6 KB 0 n/a n/a
FILTERS String 160 1.41 MB 739.29 KB 1 n/a n/a
REF String 160 1.32 MB 3.57 MB 1 n/a n/a
ALT String 160 1.91 MB 4.82 MB 6 n/a n/a
FORMAT/AB Float 1097 178.45 MB 584.71 MB 1 0.04 0.95
FORMAT/AD Integer 2496 466.49 MB 7.78 GB 7 0 5.8e+03
FORMAT/DP Integer 1224 204.99 MB 6.08 GB 1 0 5.8e+03
FORMAT/GQ Integer 1224 204.99 MB 5.25 GB 1 0 99
FORMAT/GT Integer 1788 307.2 MB 417.48 MB 3 -1 6
FORMAT/MIN_DP Integer 160 16 KB 9.6 KB 0 n/a n/a
FORMAT/MQ0 Integer 160 16 KB 9.6 KB 0 n/a n/a
FORMAT/PGT String 849 46.13 MB 100.24 MB 1 n/a n/a
FORMAT/PID String 849 59.74 MB 129.69 MB 1 n/a n/a
FORMAT/PL Integer 3992 807.39 MB 18.42 GB 28 0 1.7e+05
FORMAT/RGQ Integer 160 16 KB 9.6 KB 0 n/a n/a
FORMAT/SB Integer 160 16 KB 9.6 KB 0 n/a n/a
Note that cols like FORMAT/DP etc are reporting as much larger compressed than uncompressed. Investigate.
Probably related to #51
We don't currently use VCF numbers of R, A and G, but instead determine dimensions from observed data.
A minimal thing we need to do is update the dimensions for these fields as described in the spec
Click supports shell completion which we should document how to set up.
It should work for basic things like paths, but we can look at how to autocomplete things like column names (if CLI commands take them as parameters)
I've run into this issue a couple of times while attempting to convert 1000G data chr22
VCF to Zarr format. The first time I tried calling vcf2zarr convert ...
directly. The second time I did it in the staggered approach suggested in the documentation. However, it still produced this error at the encode
stage, particularly when encoding 2D matrices:
(bio2zarr_env) [szabad@narval2 bio2zarr_experiments]$ vcf2zarr encode -p 4 converted/chr22.exploded converted/chr22.zarr
Encode 1D: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 19.7k/19.7k [09:35<00:00, 34.2chunks/s]
Encode 2D: 79%|███████████████████████████████████████████████████████████████████████████████▎ | 8.49k/10.8k [28:22<29:35, 1.30chunks/s]Traceback (most recent call last):
File "/home/szabad/bio2zarr_env/bin/vcf2zarr", line 8, in <module>
sys.exit(vcf2zarr())
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/cli.py", line 127, in encode
vcf.encode(
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1538, in encode
SgvcfZarr.encode(
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1498, in encode
pwm.submit(
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/core.py", line 219, in __exit__
wait_on_futures(self.futures)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/core.py", line 56, in wait_on_futures
raise exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I'm running this on a cluster and it's possible that the processes are killed silently if they used more resources than they should. But in this case, I'm just using 4 workers and it didn't seem they were using lots of memory.
Also a general question related to this: Do you think it's possible to pick up the encoding work from where it left off if things like this happen instead of starting over?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.