Comments (13)
Hi @d-v-b, thanks for commenting! We'd actually like to encode variable-length strings, which use an object
dtype and a numcodecs.vlen.VLenUTF8
codec.
I've added a comment explaining our use case to the discussion at zarr-developers/zarr-specs#83
from bio2zarr.
I filed the following issues to improve Zarr Python v3 API compatibility here:
from bio2zarr.
I'll try again when the next alpha comes out
There is no next alpha according to zarr-developers/zarr-python#1777.
from bio2zarr.
fwiw I would love to chart a path to getting more dtypes into zarr v3 (since i happen to be working on the v3 fill value normalization right now). as you noted, there's a dtype extension mechanism built into the spec but we haven't exercised it yet. Could you share or link to a description of how you are using string arrays, either here or in a discussion over in zarr specs? That might help kick-start things.
from bio2zarr.
I updated https://github.com/tomwhite/bio2zarr/tree/zarr-v3 to use the code from zarr-developers/zarr-python#2036 and the test still passes.
from bio2zarr.
I had a quick look at this and it seems like quite a bit of stuff is broken. I hit a wall at finding a way to specify a name
for a new array in a Group. This seems pretty basic and I don't have time to chase down the details, so I'll try again when the next alpha comes out.
from bio2zarr.
I just had another go with the tip of the v3 branch, and still hitting walls with array creation. The support for creating v2 arrays seems to be pretty thin, and it's not clear at all to me how we're supposed to go about it. I don't really know where to start tbh.
from bio2zarr.
For reference:
python3 -m pip install pip install git+https://zarr-developers/zarr-python
from bio2zarr.
I'll take a look
from bio2zarr.
Here's the branch where I've been experimenting with Zarr v3: https://github.com/tomwhite/bio2zarr/tree/zarr-v3
After making changes to adapt to the different v3 API, it's now failing because Zarr v3 doesn't support string dtypes:
_____________________________________________________________ TestVcfZarrWriterExample.test_encode_partition[0] _____________________________________________________________
self = <tests.test_vcz.TestVcfZarrWriterExample object at 0x7f9dc9a75b40>
icf_path = PosixPath('/private/var/folders/9j/h1v35g4166z6zt816fq7wymc0000gn/T/pytest-of-tom/pytest-1771/data11/example.exploded')
tmp_path = PosixPath('/private/var/folders/9j/h1v35g4166z6zt816fq7wymc0000gn/T/pytest-of-tom/pytest-1771/test_encode_partition_0_0'), partition = 0
@pytest.mark.parametrize("partition", [0, 1, 2])
def test_encode_partition(self, icf_path, tmp_path, partition):
zarr_path = tmp_path / "x.zarr"
vcf2zarr.encode_init(icf_path, zarr_path, 3, variants_chunk_size=3)
partition_path = zarr_path / "wip" / "partitions" / f"p{partition}"
assert not partition_path.exists()
> vcf2zarr.encode_partition(zarr_path, partition)
tests/test_vcz.py:508:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
bio2zarr/vcf2zarr/vcz.py:1064: in encode_partition
writer.encode_partition(partition)
bio2zarr/vcf2zarr/vcz.py:683: in encode_partition
self.encode_id_partition(partition_index)
bio2zarr/vcf2zarr/vcz.py:801: in encode_id_partition
vid.flush()
bio2zarr/core.py:145: in flush
sync_flush_1d_array(
bio2zarr/core.py:163: in sync_flush_1d_array
zarr_array[offset : offset + np_buffer.shape[0]] = np_buffer
../zarr-python/src/zarr/array.py:984: in __setitem__
self.set_basic_selection(cast(BasicSelection, pure_selection), value, fields=fields)
../zarr-python/src/zarr/array.py:1198: in set_basic_selection
sync(self._async_array._set_selection(indexer, value, fields=fields, prototype=prototype))
../zarr-python/src/zarr/sync.py:92: in sync
raise return_result
../zarr-python/src/zarr/sync.py:51: in _runner
return await coro
../zarr-python/src/zarr/array.py:518: in _set_selection
value_buffer = prototype.nd_buffer.from_ndarray_like(value)
../zarr-python/src/zarr/buffer.py:339: in from_ndarray_like
return cls(ndarray_like)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <[AttributeError("'NDBuffer' object has no attribute '_data'") raised in repr()] NDBuffer object at 0x7f9ddd467250>
array = array(['.', '.', 'rs6054257'], dtype=object)
def __init__(self, array: NDArrayLike):
# assert array.ndim > 0
> assert array.dtype != object
E AssertionError
../zarr-python/src/zarr/buffer.py:286: AssertionError
This is a major limitation for us. The Zarr v3 core spec does not cover strings.
They are likely to be a future extension:
The set of data types specified in v3 is less than in v2. Additional data types will be defined via extensions.
from bio2zarr.
Thanks Tom!
Yeesh, the lack of strings is scary. Looks like we'll be on v2 for a long time then.
from bio2zarr.
As an experiment I tried creating a VLenUTF8Codec
which uses numcodecs.vlen.VLenUTF8
. I can successfully run a test that writes a VCF Zarr file and then validates it:
pytest 'tests/test_vcf_examples.py::test_by_validating[sample.vcf.gz]'
from bio2zarr.
Amazing! 🎉
from bio2zarr.
Related Issues (20)
- Refactor docs build infrastructure
- Restructure vcf2zarr docs
- Add --no-progress (or similar) to suppress progress
- Bug in dexplode-partition
- Change dexplode-init to use ``--num-parts``/``-n`` instead of positional HOT 1
- Change dencode-init to use --num-partitions
- Hypothesis testing for vcf2zarr HOT 13
- Pin to zarr < 3
- ValueError: could not broadcast input array
- Run tests against numpy 2 HOT 4
- Set copy=True in np.array creation for numpy 2.0 compatibility HOT 1
- ICF stores created with numpy 1.x won't work with numpy 2.x HOT 1
- Optional first phasing symbol introduced in VCF 4.4 HOT 4
- Parsing fails for VCF with GT in header but not in FORMAT field HOT 1
- Char fields added as Unicode not string HOT 1
- Numcodecs v0.13.0 causing test failures HOT 1
- Inspect fails for datasets with out consolidated metadata
- LPL no smaller than PL HOT 10
- Add variant length/end coordinate field HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bio2zarr.