Giter Site home page Giter Site logo

Run tests against Zarr 3 about bio2zarr HOT 13 OPEN

tomwhite avatar tomwhite commented on September 13, 2024
Run tests against Zarr 3

from bio2zarr.

Comments (13)

tomwhite avatar tomwhite commented on September 13, 2024 2

Hi @d-v-b, thanks for commenting! We'd actually like to encode variable-length strings, which use an object dtype and a numcodecs.vlen.VLenUTF8 codec.

I've added a comment explaining our use case to the discussion at zarr-developers/zarr-specs#83

from bio2zarr.

tomwhite avatar tomwhite commented on September 13, 2024 2

I filed the following issues to improve Zarr Python v3 API compatibility here:

from bio2zarr.

tomwhite avatar tomwhite commented on September 13, 2024 1

I'll try again when the next alpha comes out

There is no next alpha according to zarr-developers/zarr-python#1777.

from bio2zarr.

d-v-b avatar d-v-b commented on September 13, 2024 1

fwiw I would love to chart a path to getting more dtypes into zarr v3 (since i happen to be working on the v3 fill value normalization right now). as you noted, there's a dtype extension mechanism built into the spec but we haven't exercised it yet. Could you share or link to a description of how you are using string arrays, either here or in a discussion over in zarr specs? That might help kick-start things.

from bio2zarr.

tomwhite avatar tomwhite commented on September 13, 2024 1

I updated https://github.com/tomwhite/bio2zarr/tree/zarr-v3 to use the code from zarr-developers/zarr-python#2036 and the test still passes.

from bio2zarr.

jeromekelleher avatar jeromekelleher commented on September 13, 2024

I had a quick look at this and it seems like quite a bit of stuff is broken. I hit a wall at finding a way to specify a name for a new array in a Group. This seems pretty basic and I don't have time to chase down the details, so I'll try again when the next alpha comes out.

from bio2zarr.

jeromekelleher avatar jeromekelleher commented on September 13, 2024

I just had another go with the tip of the v3 branch, and still hitting walls with array creation. The support for creating v2 arrays seems to be pretty thin, and it's not clear at all to me how we're supposed to go about it. I don't really know where to start tbh.

from bio2zarr.

jeromekelleher avatar jeromekelleher commented on September 13, 2024

For reference:

python3 -m pip install pip install git+https://zarr-developers/zarr-python

from bio2zarr.

tomwhite avatar tomwhite commented on September 13, 2024

I'll take a look

from bio2zarr.

tomwhite avatar tomwhite commented on September 13, 2024

Here's the branch where I've been experimenting with Zarr v3: https://github.com/tomwhite/bio2zarr/tree/zarr-v3

After making changes to adapt to the different v3 API, it's now failing because Zarr v3 doesn't support string dtypes:

_____________________________________________________________ TestVcfZarrWriterExample.test_encode_partition[0] _____________________________________________________________

self = <tests.test_vcz.TestVcfZarrWriterExample object at 0x7f9dc9a75b40>
icf_path = PosixPath('/private/var/folders/9j/h1v35g4166z6zt816fq7wymc0000gn/T/pytest-of-tom/pytest-1771/data11/example.exploded')
tmp_path = PosixPath('/private/var/folders/9j/h1v35g4166z6zt816fq7wymc0000gn/T/pytest-of-tom/pytest-1771/test_encode_partition_0_0'), partition = 0

    @pytest.mark.parametrize("partition", [0, 1, 2])
    def test_encode_partition(self, icf_path, tmp_path, partition):
        zarr_path = tmp_path / "x.zarr"
        vcf2zarr.encode_init(icf_path, zarr_path, 3, variants_chunk_size=3)
        partition_path = zarr_path / "wip" / "partitions" / f"p{partition}"
        assert not partition_path.exists()
>       vcf2zarr.encode_partition(zarr_path, partition)

tests/test_vcz.py:508: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
bio2zarr/vcf2zarr/vcz.py:1064: in encode_partition
    writer.encode_partition(partition)
bio2zarr/vcf2zarr/vcz.py:683: in encode_partition
    self.encode_id_partition(partition_index)
bio2zarr/vcf2zarr/vcz.py:801: in encode_id_partition
    vid.flush()
bio2zarr/core.py:145: in flush
    sync_flush_1d_array(
bio2zarr/core.py:163: in sync_flush_1d_array
    zarr_array[offset : offset + np_buffer.shape[0]] = np_buffer
../zarr-python/src/zarr/array.py:984: in __setitem__
    self.set_basic_selection(cast(BasicSelection, pure_selection), value, fields=fields)
../zarr-python/src/zarr/array.py:1198: in set_basic_selection
    sync(self._async_array._set_selection(indexer, value, fields=fields, prototype=prototype))
../zarr-python/src/zarr/sync.py:92: in sync
    raise return_result
../zarr-python/src/zarr/sync.py:51: in _runner
    return await coro
../zarr-python/src/zarr/array.py:518: in _set_selection
    value_buffer = prototype.nd_buffer.from_ndarray_like(value)
../zarr-python/src/zarr/buffer.py:339: in from_ndarray_like
    return cls(ndarray_like)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[AttributeError("'NDBuffer' object has no attribute '_data'") raised in repr()] NDBuffer object at 0x7f9ddd467250>
array = array(['.', '.', 'rs6054257'], dtype=object)

    def __init__(self, array: NDArrayLike):
        # assert array.ndim > 0
>       assert array.dtype != object
E       AssertionError

../zarr-python/src/zarr/buffer.py:286: AssertionError

This is a major limitation for us. The Zarr v3 core spec does not cover strings.

They are likely to be a future extension:

The set of data types specified in v3 is less than in v2. Additional data types will be defined via extensions.

from bio2zarr.

jeromekelleher avatar jeromekelleher commented on September 13, 2024

Thanks Tom!

Yeesh, the lack of strings is scary. Looks like we'll be on v2 for a long time then.

from bio2zarr.

tomwhite avatar tomwhite commented on September 13, 2024

As an experiment I tried creating a VLenUTF8Codec which uses numcodecs.vlen.VLenUTF8. I can successfully run a test that writes a VCF Zarr file and then validates it:

pytest 'tests/test_vcf_examples.py::test_by_validating[sample.vcf.gz]'

from bio2zarr.

jeromekelleher avatar jeromekelleher commented on September 13, 2024

Amazing! 🎉

from bio2zarr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.