Comments (16)
The error originates in fsspec. It seams like something that should be handled gracefully by zarr
Traceback (most recent call last):
File "/home/erlend/scripts/zarr/mdiotest.py", line 22, in <module>
il_mask, il_headers, il_data = mdio[180,100:1300,:1000]
File "/home/erlend/.local/lib/python3.9/site-packages/mdio/api/accessor.py", line 390, in __getitem__
self._traces[item],
File "/home/erlend/.local/lib/python3.9/site-packages/zarr/core.py", line 788, in __getitem__
result = self.get_basic_selection(pure_selection, fields=fields)
File "/home/erlend/.local/lib/python3.9/site-packages/zarr/core.py", line 914, in get_basic_selection
return self._get_basic_selection_nd(selection=selection, out=out,
File "/home/erlend/.local/lib/python3.9/site-packages/zarr/core.py", line 957, in _get_basic_selection_nd
return self._get_selection(indexer=indexer, out=out, fields=fields)
File "/home/erlend/.local/lib/python3.9/site-packages/zarr/core.py", line 1252, in _get_selection
self._chunk_getitems(lchunk_coords, lchunk_selection, out, lout_selection,
File "/home/erlend/.local/lib/python3.9/site-packages/zarr/core.py", line 1985, in _chunk_getitems
cdatas = self.chunk_store.getitems(ckeys, on_error="omit")
File "/home/erlend/.local/lib/python3.9/site-packages/zarr/storage.py", line 1361, in getitems
results = self.map.getitems(keys_transformed, on_error="omit")
File "/home/erlend/.local/lib/python3.9/site-packages/fsspec/mapping.py", line 101, in getitems
return {
File "/home/erlend/.local/lib/python3.9/site-packages/fsspec/mapping.py", line 104, in <dictcomp>
if on_error == "return" or not isinstance(out[k2], BaseException)
from mdio-python.
Hi @ErlendHaa!
Thanks for reporting this.
Your interpretation of the missing chunk treatment is correct. When we ingest we don't write empty chunk keys to the store, and Zarr normally understands this and gracefully returns the fill value when a chunk key doesn't exist.
The behavior you're seeing is therefore not expected and we have never seen it before (maybe adlfs
issue, we haven't used Azure much). Can you please share the following versions so we can diagnose further?
Python
MDIO
Zarr
Fsspec
Adlfs
Also, are these conda
or pip
installed?
from mdio-python.
Sure thing, my environment:
erlend:~$ python3.9 --version
Python 3.9.12
erlend:~$ python3.9 -m pip --version
pip 22.2.2 from /home/erlend/.local/lib/python3.9/site-packages/pip (python 3.9)
erlend:~$ python3.9 -c "import mdio; print(mdio.__version__)"
0.2.0
erlend:~$ python3.9 -c "import zarr; print(zarr.__version__)"
2.12.0
erlend:~$ python3.9 -c "import fsspec; print(fsspec.__version__)"
2022.8.2
erlend:~$ python3.9 -c "import adlfs; print(adlfs.__version__)"
2022.0
from mdio-python.
Too narrow it down a bit a I stored the same .mdio file to disk. Then it reads just fine.
from mdio-python.
Too narrow it down a bit a I stored the same .mdio file to disk. Then it reads just fine.
This is very helpful, thank you.
It is starting to feel like it is an adlfs
and Zarr
integration issue, I will drill down a little and report it if that is the case.
from mdio-python.
@ErlendHaa, I was able to reproduce your issue.
It worked fine on GCP and AWS but I will double check again in case an update broke something.
If it works on other clouds, we can bring this up with Zarr
developers and they'll fix it upstream.
The demo file is also zero-padded; just like your file, it will have "empty" chunks that are not on the object store.
Steps:
- Create Azure Storage Account
defaultmdio
. Default settings. - Create container
mdio-test
. - Grabbed the account key from Azure Portal.
- Ran the MDIO Quickstart with the following syntax changes:
from mdio import segy_to_mdio
default_storage_options={'account_name': "defaultmdio", 'account_key': "..."}
segy_to_mdio(
segy_path="filt_mig.sgy",
mdio_path_or_buffer="az://mdio-test/filt-mig.mdio",
index_bytes=(181, 185),
index_names=("inline", "crossline"),
storage_options=default_storage_options,
chunksize=(16, 16, 1024), # to get the empty chunks because file is small
)
- Can't query the file because of
KeyError
from mdio-python.
Great! How should we proceed? Do you want me to make an issue upstream to zarr ?
from mdio-python.
I tracked down the root cause to the adlfs.AzureBlobFileSystem._expand_path
method [1]. More specifically to this continue
[2] which strips out paths to none-exising blobs. This method is called by adlfs.AzureBlobFileSystem.cat
[3] which again is called by fsspec.FSmap.getitems
[4]. The continue
basically undermines the "omit" option in getitems
and cat
by striping the path list anyway. As a result getitem
raises on KeyError when trying to index on of the striped out paths.
I guess we can close this issue now, as the bug is clearly unrelated to mdio. I'll make an upstream issue for it. Thanks for the help!
[1] https://github.com/fsspec/adlfs/blob/591485b9d77448cd6e791b49bda8942ef03507bf/adlfs/spec.py#L1672
[2] https://github.com/fsspec/adlfs/blob/591485b9d77448cd6e791b49bda8942ef03507bf/adlfs/spec.py#L1725
[3] https://github.com/fsspec/adlfs/blob/591485b9d77448cd6e791b49bda8942ef03507bf/adlfs/spec.py#L1610
[4] https://github.com/fsspec/filesystem_spec/blob/bb9989ce5bf0ed0c0a5f7d3540c3a59581d259ce/fsspec/mapping.py#L69
from mdio-python.
@ErlendHaa thanks a lot for all the debugging! You can go ahead and open an issue with Zarr. I'll run a couple more tests on other clouds and if it works ok there I'll close this issue.
Again, thanks a lot!
from mdio-python.
@ErlendHaa, I opened an issue with a minimal reproducible example.
I tested, and this does NOT happen on Google Cloud or S3.
Thanks for finding this!
from mdio-python.
Sorry, forgot to reply back! The bug lies with adlfs, not zarr. I submitted a patch for it, which addresses the root cause of the KeyError. Hopefully they'll accept it
from mdio-python.
Here is reference to the PR by @ErlendHaa
from mdio-python.
@ErlendHaa, is this issue resolved with the latest adlfs released a couple weeks ago (2022.10.0
)?
If it works as expected, we can close this issue. Thanks again!
from mdio-python.
Sadly, no! I'm not sure what happened there tbh. They seemed to approve my PR, but did not include it in their linted version of it
from mdio-python.
But as the cause definitely lies with adlfs I thing we can close this one
from mdio-python.
Sounds good. This is a big problem since many enterprise users use Azure. I will follow up with adlfs
.
Opened an issue with adlfs
to redo that PR fsspec/adlfs#358
from mdio-python.
Related Issues (20)
- Add command line utility for copying mdio files
- Support for segy ingestion of non-regularized data
- parse_trace_headers bug HOT 3
- Add an ability to pass a dask client to `MDIOReader` HOT 2
- C++ bindings for MDIO HOT 1
- OOM when allocating live mask HOT 2
- header scan fails incorrect for little endian segy HOT 3
- Loading speed HOT 6
- Documentation build is failing in RtD
- Missing some tests for 4D data
- Add dev container
- Lossless issue file not found in version 0.2.9 or newer HOT 1
- Sparsity check override HOT 2
- Reduce memory requirements for segy cutting HOT 2
- MDIO copy is slow HOT 2
- Improve logging for autochannelwrap HOT 2
- Improved pre-stack indexing for 3D towed streamer data
- Mac M2 + segy->mdio conversion error HOT 2
- Configure GridSparsityCheck threshold HOT 1
- Write tests for rechunking
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mdio-python.