tgsai / mdio-python Goto Github PK

View Code? Open in Web Editor NEW

29.0 29.0 10.0 3.09 MB

Cloud native, scalable storage engine for various types of energy data.

Home Page: https://mdio.dev/

License: Other

Python 99.44% Dockerfile 0.42% Shell 0.14%

dask deep-learning energy machine-learning segy zarr

mdio-python's People

Contributors

Stargazers

Watchers

Forkers

srib prathameshmalage sanath-2024 kwinkunks markspec ajaust anthonytorlucci bev005 jcfaracco rahulj2904

mdio-python's Issues

Lossless issue file not found in version 0.2.9 or newer

Segy2mdio conversion works properly with lossless but when lossy library is added it says file is not found.

Installed using pip install multidimio[lossy]
It is looking for a file "/metadata/live_mask/0/0/0"

MDIO documentation improvement

In the MDIO documentation, in the explanation CLI Reference about the chunks (below), maybe improve the documentation (e.g. put a period "." between "2-byte" and "Chunks") ?

3D Seismic Shot Data (Byte Locations Vary): Let’s assume streamer number is at byte 213 as 2-bytes Chunks: 8 shots x 2 cables x 256 channels x 512 samples –header-locations 9,213,13 –header-names shot,cable,chan –header-lengths 4,2,4,4 –chunk-size 8,2,256,512

MDIO copy is slow

mdio copy function is very slow. It starts the copy but nothing happens for a long time on the cpu. Need to profile.

Numba JIT function caching errors in Docker

We enable caching of numba functions so we don't compile them every time. However, this is not "zip safe". Docker happens to install packages as zip files.

We get this error.

RuntimeError: cannot cache function 'ieee2ibm': no locator available for file '/usr/local/lib/python3.9/site-packages/mdio/segy/ibm_float.py'

Solution: Wait till numba 0.57

See numba/numba#4908.

Band-aid solution: set environment variable NUMBA_CACHE_DIR=/tmp

@srib

Support segy_to_mdio when SEG-Y file is on object store

Currently, we can only convert a SEG-Y file to mdio when the SEG-Y file is on a filesystem as make use of segyio.

We need to support converting SEG-Y file without downloading the entirety of it.

Define good default chunks for seismic CDP gathers

Intro

This is regarding regularized CDP offset gathers and is pre-stack.

We need to define good default chunk sizes that satisfy multiple use cases (see access patterns below).

Keys

3D acquisition:

Inline
Crossline
Offset (or Angle)

2D acquisition:

CDP
Offset (or Angle)

+ Time or depth for both.

Access Patterns

Full traces for a geographic point ([il, xl] or CDP)
Visualization (tiled rendering); means we chunk sample axis
Be able to re-chunk into offset/angle volumes (i.e., split the ensemble in 3D)
Power of 2 chunk sizes for ML applications
(holy grail, but rare) also support time slices on offset/angle planes

Sizes

Support 2D and 3D CDP gathers
3,000 - 7,000 time or depth samples
100 - 300 offset values (or 5 - 25 angle values). Usually more offsets on 2D.
4TB+ files

Add dev container

A dev container will improve the developer experience providing a quick way to get started and more consistent experience.

Feature request - add Python 3.11 support

They current mdio project toml limits the Python version:

python = ">=3.8,<3.11"

Now that Python 3.11 has been release, the new feature would add support for it.

Reduce memory requirements for segy cutting

mdio_to_segy requires a significant amount of memory for large datasets. This seems related to dask/dask#10535

Test dask release: 2023.10.0

`copy_mdio` does not honor separator

copy_mdio function does not honor the separator when a file is copied.

Most of our files are ingested with the separator being / but when it gets copied, the separator becomes .

CLI: `mdio --version` returns a `RuntimeError`

Here is a simple repro:

$ conda create --prefix <path_to_your_env> python=3.9
$ conda activate <path_to_your_env>
$ pip install multidimio
$ mdio --version

raises RuntimeError with the error message as follows:

RuntimeError: 'mdio' is not installed. Try passing 'package_name' instead.

Add integration tests for 5D and 6D segy_to_mdio

OBN will potentially be read as ingested as 6D:

receiver line
receiver point with line
receiver component (e.g. hydrophone, vertical component etc)
shot line
shot point within line
time

Streamer data could be ingested as:

shot line
shot point within line
cable
channel
time

Machine learning examples with mdio

Add a few examples/tutorials showing how mdio can be used with ML/DL libraries.

Reduce memory consumption during SEG-Y export

The distributed workers flatten the chunks along the first dimension to write to SEG-Y.

Huge files >2TB use a lot of memory during export.

The output sharding strategy needs to be optimized:

mdio-python/src/mdio/converters/mdio.py

Lines 177 to 241 in 03b9e4f

    
           # We must unify chunks with "trc_chunks" here because 
        
           # headers and live mask may have different chunking. 
        
           # We don't take the time axis for headers / live 
        
           # Still lazy computation 
        
           traces_seq = traces.rechunk(seq_trc_chunks) 
        
           headers_seq = headers.rechunk(seq_trc_chunks[:-1]) 
        
           live_seq = live_mask.rechunk(seq_trc_chunks[:-1]) 
        
           # Build a Dask graph to do the computation 
        
           # Name of task. Using uuid1 is important because 
        
           # we could potentially generate these from different machines 
        
           task_name = "block-to-sgy-part-" + str(uuid.uuid1()) 
        
           trace_keys = flatten(traces_seq.__dask_keys__()) 
        
           header_keys = flatten(headers_seq.__dask_keys__()) 
        
           live_keys = flatten(live_seq.__dask_keys__()) 
        
           all_keys = zip(trace_keys, header_keys, live_keys) 
        
           # tmp file root 
        
           out_dir = path.dirname(output_segy_path) 
        
           task_graph_dict = {} 
        
           block_file_paths = [] 
        
           for idx, (trace_key, header_key, live_key) in enumerate(all_keys): 
        
               block_file_name = f".{idx}_{uuid.uuid1()}._segyblock" 
        
               block_file_path = path.join(out_dir, block_file_name) 
        
               block_file_paths.append(block_file_path) 
        
               block_args = ( 
        
                   block_file_path, 
        
                   trace_key, 
        
                   header_key, 
        
                   live_key, 
        
                   num_samp, 
        
                   sample_format, 
        
                   endian, 
        
               ) 
        
               task_graph_dict[(task_name, idx)] = (write_block_to_segy,) + block_args 
        
           # Make actual graph 
        
           task_graph = HighLevelGraph.from_collections( 
        
               task_name, 
        
               task_graph_dict, 
        
               dependencies=[traces_seq, headers_seq, live_seq], 
        
           ) 
        
           # Note this doesn't work with distributed. 
        
           tqdm_kw = dict(unit="block", dynamic_ncols=True) 
        
           block_progress = TqdmCallback(desc="Step 1 / 2 Writing Blocks", **tqdm_kw) 
        
           with block_progress: 
        
               block_exists = compute_as_if_collection( 
        
                   cls=Array, 
        
                   dsk=task_graph, 
        
                   keys=list(task_graph_dict), 
        
                   scheduler=client, 
        
               ) 
        
           merge_args = [output_segy_path, block_file_paths, block_exists] 
        
           if client is not None: 
        
               _ = client.submit(merge_partial_segy, *merge_args).result() 
        
           else: 
        
               merge_partial_segy(*merge_args)

and

mdio-python/src/mdio/segy/creation.py

Line 111 in 03b9e4f

def merge_partial_segy(output_segy_path, block_file_paths, block_exists):

Greater Shot Ingestion Flexibility

Background

Initial work on shot ingestions flexibility was addressed in PR #180 and contains a detailed description of the problem.

Issue

Although #180 contains a solution it requires an a priori knowledge of the configuration of the headers to successfully ingest correctly.

Solution

Add an option to automatically detect files with structure similar to Type B and ingest with wrapped channels (Type A)

segy_to_mdio(  
    segy_path="prefix/shot_file.segy",  
    mdio_path_or_buffer="s3://bucket/shot_file.mdio",  
    index_bytes=(17, 137, 13),  
    index_lengths=(4, 2, 4),  
    index_names=("shot", "cable", "channel"),  
    chunksize=(8, 2, 128, 1024),  
    grid_overrides={"AutoChannelWrap": True},  
)

Avoiding compression

Is there a way to avoid compression (Blosc) when we run segy_to_mdio? Will this work:

In src/mdio/segy/blocked_io.py

85     elif lossless is None:
86         trace_compressor = header_compressor = None

If yes, could you please support this in MDIO. That way people who do not want any compression can just ship lossless=None explicitly to segy_to_mdio. Thanks.

Loading speed

Hey guys

Nice work with SEG-Y loader! At our team, we use our own library to interact with SEG-Y data, so I've decided to give a try to MDIO and compare the results of multiple approaches and libraries.

Setup

For my tests, I've used a ~21GB sized SEG-Y with IEEE float32 values (no IBM float shenanigans here).
The cube is post-stack, so it is a cube for seismic interpretation: therefore, it has a meaningful regular 3D structure.

Using the same function you provided in tutorials, get_size, I've got the following results:

SEG-Y:	            21504.17 MB
MDIO:	             3943.37 MB
MDIO LOSSY+:	     1723.78 MB
SEG-Y QUANTIZED:     5995.96 MB

The LOSSY+ MDIO file was created by using compression_tolerance=(std of amplitude values).
The SEG-Y QUANTIZED file was created by quantizing the data and writing SEG-Y file (according to the standard) with int8 values.

The system is using Intel(R) Xeon(R) Gold 6242R CPU, just in case that may be of interest.

The tests

Multiple formats are tested against the tasks of loading slices (2D arrays) across three dimensions: INLINE, CROSSLINE, SAMPLES.
Also the ability to load sub-volumes (3D arrays) is tested.
For more advanced usage I have tests for loading batches of data: more on that later.

For tests, I use following engines:

vanilla segyio -- public functions from this great library
segfast -- our in-house library for loading any SEG-Y cubes
segfast with segyio engine -- essentially, a better cooked segyio where we use their private methods
seismiqb -- our library for seismic interpretation (optimized for post-stack cubes) only
seismiqb HDF5 -- converts the data to HDF5 (very similar to zarr you use)
segfast quantized -- automatically quantized (optimally in some information sense) SEG-Y data is written with int8 dtype

To this slew of engines, I've added MDIO loader, which looks very simple:

slide_i = mdio_file[idx, :, :][2]
slide_x = mdio_file[:, idx, :][2]
slide_d = mdio_file[:, :, idx][2]

I also used mdio_file._traces[idx, :, :] but have not noticed significant differences.

The results

An image is better than a thousand words, so a bar-plot of timings for loading INLINE slices:

The situation does not get better on CROSSLINE / SAMPLES axes either:

Note that even naive segyio, which takes a full sweep across file to get depth-slice, has the same speed.

The why

Some of the reasons for this slowness are apparent: during the conversion process, the default chunk_size for ZARR is 64x64x64. Therefore, loading 2D slices is not the forte of this exact chunking.

Unfortunately, even when it comes to 3D sub-volumes, the situation is not much better:

Even with this being the best (and only) scenario for chunked storage, it is still not as fast as plain SEG-Y storage, even with no quantization.

Questions

This leaves a few questions:

is it possible to speed up somehow the loading times? Maybe, I am just not using the right methods from the library.
Or, maybe, this is not the area you focus your format on and the current loading times are fine for the usecases you plan on developing;
is there a way to make a multipurpose file? The way I see it now I can make a file for somewhat fast 2D INLINE slices (by setting chunk_size to 1x64x64 or somewhat like that), but that would be a mess of lots of files;
is there a way to preallocate memory for the data to load into? That is a huge speedup for all ML applications;
is there a way to get values of a particular trace headers for the entire cube?

I hope you can help me with those questions!

Typo in get started in document

In the MDIO ‘get started in 10 minutes’, in the example below, it says dimensions = “inline”, but the plot seems to be for a crossline. Should we use dimensions=”crossline” here instead?

crossline_index = int(mdio.coord_to_index(100, dimensions="inline"))

xl_mask, xl_headers, xl_data = mdio[:, crossline_index]

Add command line utility for copying mdio files

Add the following capability:

mdio cp -i input_file -o output_file

Add an ability to pass a dask client to `MDIOReader`

@markspec thinks that it might be a useful idea to have the ability to pass a Client to MDIOReader.

Configure GridSparsityCheck threshold

A request has been made for the threshold for the GridTraceSparsityError to be configurable. This makes sense for some types of pre-stack data (such as land and OBN).

Current configuration throws an error when the proposed grid is 10 time the number of segy traces.

@tasansal, what is you preferred implementation for this? Command-line or environment variable?

Make `access_pattern` a list of strings instead of a single string

We have a use case where we need multiple chunking schemes for the same file. It needs to take a list of strings with different chunking patterns (access_pattern keyword) instead of a single string.

Documentation build is failing in RtD

Read the docs is failing.

Investigation shows this is the issue:

pradyunsg/furo#710

Need to update Furo to newer version once this is merged.

parse_trace_headers bug

parse trace headers should return an nd array. It currently returns a dictionary as implemented in the legacy solution due to an error introduced in #213 and for floating point header support (#212)

Support for segy ingestion of non-regularized data

Some types of segy data do not have a discrete integer bin on input. E.g. 2D CDP data might be regular in CDP and time but the offset dimension maybe represented by a floating point offset.

To handle cases like this the proposal is to have an ingestion header "index" which will just use a counter starting at 1and append traces to the "index" header.

Make test coverage 100%

Not a trivial task. This would require the following, but needs to be done.

Make or generate test data
Add more tests
Mock object stores, some options:
3.1. S3: Moto
3.2 GCS: fake-gcs-server
3.3 Azure: Azurite

Sparsity check override

Sometimes users need ability to run very sparse grids.

Disable the sparity check/warning with extra option.

In some cases, there can be duplicate 0s in a segy that cause an error/warning to be displayed and the mdio ingestion fails.
Can provide user with a waring but still ingest the segy to mdio.

Mac M2 + segy->mdio conversion error

I'm trying to convert a sample segy file to a mdio file with
Python Version 3.11.7 on Mac M2 is having issues it seems (or perhaps I'm doing something wrong)
(I also tried 3.9.18 and got a similar error)

Hopefully this helps in debugging:

$ python3 -m pip install multidimio
Requirement already satisfied: multidimio in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (0.6.0)
Requirement already satisfied: click<9.0.0,>=8.1.7 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from multidimio) (8.1.7)
Requirement already satisfied: click-params<0.6.0,>=0.5.0 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from multidimio) (0.5.0)
Requirement already satisfied: dask>=2023.10.0 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from multidimio) (2024.1.1)
Requirement already satisfied: fsspec>=2023.9.1 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from multidimio) (2024.2.0)
Requirement already satisfied: numba<0.60.0,>=0.59.0rc1 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from multidimio) (0.59.0)
Requirement already satisfied: psutil<6.0.0,>=5.9.5 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from multidimio) (5.9.8)
Requirement already satisfied: segyio<2.0.0,>=1.9.3 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from multidimio) (1.9.12)
Requirement already satisfied: tqdm<5.0.0,>=4.66.1 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from multidimio) (4.66.1)
Requirement already satisfied: urllib3<2.0.0,>=1.26.18 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from multidimio) (1.26.18)
Requirement already satisfied: zarr<3.0.0,>=2.16.1 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from multidimio) (2.16.1)
Requirement already satisfied: deprecated<2.0.0,>=1.2.14 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from click-params<0.6.0,>=0.5.0->multidimio) (1.2.14)
Requirement already satisfied: validators<0.23,>=0.22 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from click-params<0.6.0,>=0.5.0->multidimio) (0.22.0)
Requirement already satisfied: cloudpickle>=1.5.0 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from dask>=2023.10.0->multidimio) (3.0.0)
Requirement already satisfied: packaging>=20.0 in /Users/macuser/.local/lib/python3.11/site-packages (from dask>=2023.10.0->multidimio) (23.2)
Requirement already satisfied: partd>=1.2.0 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from dask>=2023.10.0->multidimio) (1.4.1)
Requirement already satisfied: pyyaml>=5.3.1 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from dask>=2023.10.0->multidimio) (6.0.1)
Requirement already satisfied: toolz>=0.10.0 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from dask>=2023.10.0->multidimio) (0.12.1)
Requirement already satisfied: importlib-metadata>=4.13.0 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from dask>=2023.10.0->multidimio) (7.0.1)
Requirement already satisfied: llvmlite<0.43,>=0.42.0dev0 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from numba<0.60.0,>=0.59.0rc1->multidimio) (0.42.0)
Requirement already satisfied: numpy<1.27,>=1.22 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from numba<0.60.0,>=0.59.0rc1->multidimio) (1.26.3)
Requirement already satisfied: asciitree in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from zarr<3.0.0,>=2.16.1->multidimio) (0.3.3)
Requirement already satisfied: fasteners in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from zarr<3.0.0,>=2.16.1->multidimio) (0.19)
Requirement already satisfied: numcodecs>=0.10.0 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from zarr<3.0.0,>=2.16.1->multidimio) (0.12.1)
Requirement already satisfied: wrapt<2,>=1.10 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from deprecated<2.0.0,>=1.2.14->click-params<0.6.0,>=0.5.0->multidimio) (1.16.0)
Requirement already satisfied: zipp>=0.5 in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from importlib-metadata>=4.13.0->dask>=2023.10.0->multidimio) (3.17.0)
Requirement already satisfied: locket in /Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages (from partd>=1.2.0->dask>=2023.10.0->multidimio) (1.0.0)

$ cat test.py
from mdio import segy_to_mdio

segy_to_mdio(
    segy_path="volve.segy",
    mdio_path_or_buffer="volve1.mdio",
    index_bytes=(189, 193),
    index_names=("inline", "crossline"),
    lossless = True
)

$ python3 test.py
Scanning SEG-Y for geometry attributes:   0%|                                                                  | 0/6 [00:00<?, ?block/s]Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 131, in _main
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 131, in _main
    prepare(preparation_data)
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 246, in prepare
    prepare(preparation_data)
    _fixup_main_from_path(data['init_main_from_path'])
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 246, in prepare
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 297, in _fixup_main_from_path
    _fixup_main_from_path(data['init_main_from_path'])
    main_content = runpy.run_path(main_path,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/macuser/Exp/cname/mdio-python/macuser/test.py", line 3, in <module>
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 297, in _fixup_main_from_path
    segy_to_mdio(
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages/mdio/converters/segy.py", line 352, in segy_to_mdio
    main_content = runpy.run_path(main_path,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/macuser/Exp/cname/mdio-python/macuser/test.py", line 3, in <module>
    segy_to_mdio(
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages/mdio/converters/segy.py", line 352, in segy_to_mdio
    dimensions, chunksize, index_headers = get_grid_plan(
    dimensions, chunksize, index_headers = get_grid_plan(
                                           ^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages/mdio/segy/utilities.py", line 66, in get_grid_plan
                                           ^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages/mdio/segy/utilities.py", line 66, in get_grid_plan
    index_headers = parse_trace_headers(
                    ^^^^^^^^^^^^^^^^^^^^
    index_headers = parse_trace_headers(
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages/mdio/segy/parsers.py", line 119, in parse_trace_headers
                    ^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages/mdio/segy/parsers.py", line 119, in parse_trace_headers
    lazy_work = executor.map(
                ^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/process.py", line 837, in map
    lazy_work = executor.map(
                ^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/process.py", line 837, in map
    results = super().map(partial(_process_chunk, fn),
    results = super().map(partial(_process_chunk, fn),
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/_base.py", line 608, in map
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/_base.py", line 608, in map
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/_base.py", line 608, in <listcomp>
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/_base.py", line 608, in <listcomp>
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
          ^^^^^^^^^^^^^^^^^^^^^^
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/process.py", line 808, in submit
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/process.py", line 808, in submit
    self._adjust_process_count()
    self._adjust_process_count()
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/process.py", line 767, in _adjust_process_count
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/process.py", line 767, in _adjust_process_count
    self._spawn_process()
    self._spawn_process()
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/process.py", line 785, in _spawn_process
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/process.py", line 785, in _spawn_process
    p.start()
    p.start()
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/process.py", line 121, in start
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
                  ^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
           ^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
    super().__init__(process_obj)
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 131, in _main
    self._launch(process_obj)
    self._launch(process_obj)
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 42, in _launch
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
    prep_data = spawn.get_preparation_data(process_obj._name)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 164, in get_preparation_data
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 164, in get_preparation_data
    _check_not_importing_main()
    prepare(preparation_data)
    _check_not_importing_main()
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 140, in _check_not_importing_main
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 140, in _check_not_importing_main
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 246, in prepare
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html

    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html

    _fixup_main_from_path(data['init_main_from_path'])
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 297, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/macuser/Exp/cname/mdio-python/macuser/test.py", line 3, in <module>
    segy_to_mdio(
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages/mdio/converters/segy.py", line 352, in segy_to_mdio
    dimensions, chunksize, index_headers = get_grid_plan(
                                           ^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages/mdio/segy/utilities.py", line 66, in get_grid_plan
    index_headers = parse_trace_headers(
                    ^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages/mdio/segy/parsers.py", line 119, in parse_trace_headers
    lazy_work = executor.map(
                ^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/process.py", line 837, in map
    results = super().map(partial(_process_chunk, fn),
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/_base.py", line 608, in map
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/_base.py", line 608, in <listcomp>
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/process.py", line 808, in submit
    self._adjust_process_count()
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/process.py", line 767, in _adjust_process_count
    self._spawn_process()
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/process.py", line 785, in _spawn_process
    p.start()
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 164, in get_preparation_data
    _check_not_importing_main()
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/spawn.py", line 140, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html

Scanning SEG-Y for geometry attributes:   0%|                                                                  | 0/6 [00:00<?, ?block/s]
Traceback (most recent call last):
  File "/Users/macuser/Exp/cname/mdio-python/macuser/test.py", line 3, in <module>
    segy_to_mdio(
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages/mdio/converters/segy.py", line 352, in segy_to_mdio
    dimensions, chunksize, index_headers = get_grid_plan(
                                           ^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages/mdio/segy/utilities.py", line 66, in get_grid_plan
    index_headers = parse_trace_headers(
                    ^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages/mdio/segy/parsers.py", line 139, in parse_trace_headers
    headers = list(lazy_work)
              ^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/site-packages/tqdm/std.py", line 1182, in __iter__
    for obj in iterable:
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/process.py", line 620, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/macuser/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.


=============================================================

On Python 3.8.18 on Ubuntu 22.04.03 LTS, I get the following error:

$ python3 test.py
Scanning SEG-Y for geometry attributes: 100%|██████████████████████████████████████████████████████████| 6/6 [00:00<00:00,  7.85block/s]
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    segy_to_mdio(
  File "/home/ubuntu_user/.pyenv/versions/3.8.18/lib/python3.8/site-packages/mdio/converters/segy.py", line 329, in segy_to_mdio
    write_attribute(name="created", zarr_group=zarr_root, attribute=iso_datetime)
  File "/home/ubuntu_user/.pyenv/versions/3.8.18/lib/python3.8/site-packages/mdio/core/utils_write.py", line 17, in write_attribute
    zarr_group.attrs[name] = attribute
  File "/home/ubuntu_user/.pyenv/versions/3.8.18/lib/python3.8/site-packages/zarr/attrs.py", line 89, in __setitem__
    self._write_op(self._setitem_nosync, item, value)
  File "/home/ubuntu_user/.pyenv/versions/3.8.18/lib/python3.8/site-packages/zarr/attrs.py", line 83, in _write_op
    return f(*args, **kwargs)
  File "/home/ubuntu_user/.pyenv/versions/3.8.18/lib/python3.8/site-packages/zarr/attrs.py", line 94, in _setitem_nosync
    d = self._get_nosync()
  File "/home/ubuntu_user/.pyenv/versions/3.8.18/lib/python3.8/site-packages/zarr/attrs.py", line 47, in _get_nosync
    d = self.store._metadata_class.parse_metadata(data)
  File "/home/ubuntu_user/.pyenv/versions/3.8.18/lib/python3.8/site-packages/zarr/meta.py", line 104, in parse_metadata
    meta = json_loads(s)
  File "/home/ubuntu_user/.pyenv/versions/3.8.18/lib/python3.8/site-packages/zarr/util.py", line 75, in json_loads
    return json.loads(ensure_text(s, "utf-8"))
  File "/home/ubuntu_user/.pyenv/versions/3.8.18/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/home/ubuntu_user/.pyenv/versions/3.8.18/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/ubuntu_user/.pyenv/versions/3.8.18/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Add seismic geometry abstractions

It would be nice to have geometry abstractions for standard interface and custom / geometry specific exception handling.

This has the following pros:

Common interface for coordinate conversions.
Trace iterators based on the type of data.
Enforce usage of methods for geometry instances that may be added later.
Improve maintainability.
Allow custom things to be added, i.e., for shots, we add the conversion from unwrapped channels to wrapped channels unwrap_channels, and we still have a base implementation that should follow the extensible interface.
Encapsulate the logic for geometry specific chunk size and access pattern configurations

Long shot, but cython or numba jit versions may be even better. xarray can also be used to handle named dimensions etc.

Would inherit from a base class like

# mdio/segy/geometry.py

from abc import ABC


class SeismicGeometry(ABC):
    def __init__(self, args, kwargs):
        ...

    @abstractmethod
    def __iter__(self):
        ...

    @abstractmethod
    def __getitem__(self):
        ...

    @abstractmethod
    def xy_to_grid(self, x, y, method="nearest"):
        ...

    @property
    @abstractmethod
    def num_traces(self):
       ...

Then we would have 3D as something like

from mdio.segy.geometry import SeismicGeometry


class SeismicStack3d(SeismicGeometry):
    def __init__(self, inlines, crosslines, samples):
        # set attributes, initialize grid etc.

    def __iter__(self):
        # logic to iterate traces on spatial il/xl grid

    @abstractmethod
    def xy_to_grid(self, x, y, method="nearest"):
        # logic to convert CDP-X CDP-Y to inline and crossline

    @property
    def num_traces(self):
        return self.get_size("inline") * self.get_size("crossline")

Or 3D shots like

from mdio.segy.geometry import SeismicGeometry


class SeismicShot3d(SeismicGeometry):
    def __init__(self, shots, cables, channels, samples):
        # set attributes, initialize grid etc.

    def __iter__(self):
        # logic to iterate traces on shot grid

    @abstractmethod
    def xy_to_grid(self, x, y, method="nearest"):
        # logic to convert SHOT-X SHOT-Y to shot number

    @property
    def num_traces(self):
        return self.get_size("shot") * self.get_size("cable") * self.get_size("channel")

    @def unwrap_channels(self, channels_per_streamer: int):
        return self.channel % channels_per_streamer + 1

and so on.

Improved pre-stack indexing for 3D towed streamer data

Currently ingesting on:

shot-line
gun
ffid
cable
channel

leads to a sparse representation of the data if gun is added as an indexing key. This could be handled by performing modulo division of the ffid by the number of guns in most standard towed-streamer acquisitions.

Enabling ingestion with gun will lead to improved support for seismic processing of pre-stack data (as many pre-stack operations are on common channel gathers), ML training for common-channel processes and data visualization and QC on common-channel gathers.

Add printable representation for MDIOReader and MDIOWriter

Currently, MDIOReader and MDIOWriter prints the Python object. It would be useful to have a nice printable representation.

class InfoReporter:

    def __init__(self, obj):
        self.obj = obj

    def __repr__(self):
        items = self.obj.info_items()
        return info_text_report(items)

    def _repr_html_(self):
        items = self.obj.info_items()
        return info_html_report(items)

from here is a good model to follow.

Colab and Binder badge for notebooks

I think it is a good idea to have a Colab and/or Binder badge for the tutorial notebooks.

Here is an action on the GitHub marketplace.

KeyError on missing zarr chunks

Hi!

I'm playing around with mdio and came across this error on one of my test files

mdio = MDIOReader(
       "abfs://somefile.mdio",
       return_metadata=True,
       storage_options={ ... }
)

 _ = mdio[:,:,:100] // KeyError: somefile.mdio/data/chunked_012/5/3/24'

The files is stored in an azure blob store. The error is correct in the sense that the chunk does not exist. The survey is irregular and the segy is padded around the edges to make a nice regular grid. Fill-value is 0, both in zarr and the original segy. Hence I wonder if the missing chunks is a zarr-optimisation:

There is no need for all chunks to be present within an array store. If a chunk is not present then it is considered to be in an uninitialized state. An uninitialized chunk MUST be treated as if it was uniformly filled with the value of the “fill_value” field in the array metadata. If the “fill_value” field isnull then the contents of the chunk are undefined [1].

But I havn't investigated further. It might also be user-error. Any suggestions for the root cause or a solution for this?

Thanks,

[1] https://zarr.readthedocs.io/en/stable/spec/v2.html#chunks

Basic SEG-Y information dump options in CLI

We need CLI options to just look at:

Text header
Binary header
Scan only
Example headers

Typo in README.md

Reads 'mappin' instead of 'mapping'.

OOM when allocating live mask

When user error leads to incorrect values being read for headers, the live mask can be extremely large leading to OOM error. MDIO should QC the dimensions before generating a live mask and raise an exception if a clear error has been made with some meaningful warnings.

C++ bindings for MDIO

Do you have C++ API for MDIO?
If there is, where can I find it?
Thanks

GPU direct storage backend for mdio

We should perhaps consider having a GPU Direct Storage backend for mdio. kvikio.zarr.GDSStore enables it available through kvikio.

Some relevant links here:

xarray-contrib/xbatcher#87

zarr-developers/zarr-python#934

https://zarr.readthedocs.io/en/stable/release.html#release-2-13-0

pydata/xarray#6874

Tests for backward compatibility

We need tests to ensure backward compatibility primarily for MDIOReader and MDIOWriter.

Write tests for rechunking

The newly added rechunking functionality needs tests

Consider tensorstore as backend

Hi,

I know that google's tensorstore is rather fresh and not officially supported by google (hopefully it will be), but I thought I'd make an issue / future feature request on it anyway as it offer significant performance improvements over the other backends.

I did some measurement that might be of interest

Cube dimensions: 6400x3200x1500
Environment: Cloud 2 cloud - client running in same datacenter (Azure) as where the data is stored.
Fetching a depth slice:

[mdio - zarr] Read slice of size 77.51423645019531 MB in 231.33055152895395 s
[mdio - dask] Read slice of size 77.51423645019531 MB in 16.46115321794059 s
[tensorstore] Read slice of size 77.51423645019531 MB in 5.266401566041168 s

For reference, here is the tensorstore script that I used to read the underlying zarr-array from a mdio-file:

import time
import tensorstore as ts


def fetch():
    dataset = ts.open({
        'driver': 'zarr',
        'kvstore' : {
            'driver': 'http',
            'base_url': 'https://account.blob.core.windows.net?<sas>',
            'path': 'somefile.mdio/data/chunked_012'
        }
    }).result()

    start = time.perf_counter()
    zslice = dataset[:,:,200]
    data = zslice.read().result()

    print(f'[tensorstore] Read slice of size {data.nbytes / (1024 * 1024)} MB in {time.perf_counter() - start} s')


if __name__ == '__main__':
    fetch()

Resolve issue with EBCDIC header of SEGY created by Petrel

OS windows10; python 3.9; MDIO v0.1.6

SEGY files generated by Petrel crash with an error related to the EBCDIC.

UnicodeDecodeError Traceback (most recent call last)
Cell In [14], line 1
----> 1 segy_to_mdio(
2 segy_path=f,
3 mdio_path_or_buffer="tst3d_cube.mdio",
4 index_bytes=(5, 9),
5 index_names=("inline", "crossline"),
6 )

File C:\appl\python\env\mdio_tst\lib\site-packages\mdio\converters\segy.py:173, in segy_to_mdio(segy_path, mdio_path_or_buffer, index_bytes, index_names, index_lengths, chunksize, endian, lossless, compression_tolerance, storage_options, overwrite)
169 # Read file specific metadata, build grid, and live trace mask.
170 with segyio.open(
171 filename=segy_path, mode="r", ignore_geometry=True, endian=endian
172 ) as segy_handle:
--> 173 text_header = parse_text_header(segy_handle)
174 binary_header = parse_binary_header(segy_handle)
175 num_traces = segy_handle.tracecount

File C:\appl\python\env\mdio_tst\lib\site-packages\mdio\segy\parsers.py:66, in parse_text_header(segy_handle)
54 def parse_text_header(segy_handle: segyio.SegyFile) -> list[str]:
55 """Parse text header from bytearray to python list of str per line.
56
57 The segyio library returns the text header as a bytearray instance.
(...)
64 Parsed text header in list with lines as elements.
65 """
---> 66 text_header = segy_handle.text[0].decode()
67 text_header = [
68 text_header[char_idx : char_idx + 80]
69 for char_idx in range(0, len(text_header), 80)
70 ]
71 return text_header

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 155: invalid continuation byte

Petrel EBCDIC appear to require encoding='ISO-8859-1' rather than the default UTF-8

Using
segy_handle.text[0].decode(errors='ignore')
should get around this issue

Add "timezone" mdio "created" date

mdio data has a "create" field in it's "zattrs" file.

The propose feature would add time zone information to this datetime in some canonical way.

In a scenario where timezone information is unavailable; the meta data should record that it does not have that information; and not default to something like UTC.

header scan fails incorrect for little endian segy

In segyio, the buf is ALWAYS big endian regardless of the segy bytorder. Thus, def header_scan_worker() will result in incorrect header values for the snippet below from src/mdio/segy/_workers.py:

formats = [type_.numpy_dtype.newbyteorder(endian) for type_ in byte_types]

An issue was raised with segyio issues #559. Confirmed in discussion on Slack.

Example of behavior can be found at anthonytorlucci/segyio_header_buf.

Installation issue on 3.10.6 on macOS

I think I can install skbuild manually and fix this, but was wondering if this was going to be an issue for others?


└──> python3 -m pip install multidimio
Collecting multidimio
  Downloading multidimio-0.2.0-py3-none-any.whl (60 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.2/60.2 kB 1.0 MB/s eta 0:00:00
Requirement already satisfied: click>=8.1.3 in /opt/homebrew/lib/python3.10/site-packages (from multidimio) (8.1.3)
Collecting zarr<3.0.0,>=2.12.0
  Downloading zarr-2.12.0-py3-none-any.whl (185 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 185.8/185.8 kB 3.0 MB/s eta 0:00:00
Collecting click-params<0.4.0,>=0.3.0
  Downloading click_params-0.3.0-py3-none-any.whl (12 kB)
Collecting numba<0.56.0,>=0.55.2
  Downloading numba-0.55.2-cp310-cp310-macosx_11_0_arm64.whl (2.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.3/2.3 MB 15.1 MB/s eta 0:00:00
Collecting segyio<2.0.0,>=1.9.3
  Downloading segyio-1.9.6.tar.gz (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 17.2 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/private/var/folders/tm/v4szwz995hg06nhvxjlvv7th0000gn/T/pip-install-q00euj7_/segyio_cbb75c3d7b1a4aaeb6b361e8b4253d68/setup.py", line 3, in <module>
          import skbuild
      ModuleNotFoundError: No module named 'skbuild'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.```

mdio format-specification

Hi!

Is there any plans to publish and version the underlying format. Similar to the zarr-specification [1] ? Great work on this project btw, it looks really promising!

[1] https://zarr.readthedocs.io/en/stable/spec.html

SEGY import spawns multiprocessing recursion

The following code will trigger a (near) endless recursion of subprocesses, causing all the conversions to be attempted in parallel, failing all of them:

(I am not using Dask)

import os
from mdio import segy_to_mdio

input_segy = 'C:\\Demo\\Data\\ST10010ZC11_PZ_PSDM_KIRCH_FAR_D.MIG_FIN.POST_STACK.3D.JS-017536.segy'

for level in [0, 0.01, 0.02, 1.0, 10.0]:
    compressed_mdio_file = os.path.join(os.path.splitext(input_segy)[0] + '_' + str(level) + '.mdio')
    segy_to_mdio(segy_path=input_segy,
                 mdio_path_or_buffer=compressed_mdio_file,
                 index_bytes=(189, 193),
                 index_names=('inline', 'crossline'),
                 compression_tolerance=level,
                 lossless=(level == 0))
    print(f'Converted {input_segy} @ {level} tolerance, to {compressed_mdio_file}')

print('Complete')

What I see is a constant looping of the following output:

C:\sources\mdio_test (main)
λ python mdio_test.py
Scanning SEG-Y for geometry attributes:   0%|                        | 0/6 [00:00<?, ?block/s]
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\sources\nDBenchmark\mdio_test.py", line 9, in <module>
    segy_to_mdio(segy_path=input_segy,
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\site-packages\mdio\converters\segy.py", line 177, in segy_to_mdio
    dimensions, index_headers = get_grid_plan(
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\site-packages\mdio\segy\utilities.py", line 53, in get_grid_plan
    index_headers = parse_trace_headers(
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\site-packages\mdio\segy\parsers.py", line 120, in parse_trace_headers
    with Pool(num_workers) as pool:
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\pool.py", line 212, in __init__
    self._repopulate_pool()
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\pool.py", line 303, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\pool.py", line 326, in _repopulate_pool_static
    w.start()
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\sources\nDBenchmark\mdio_test.py", line 9, in <module>
    segy_to_mdio(segy_path=input_segy,
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\site-packages\mdio\converters\segy.py", line 177, in segy_to_mdio
    dimensions, index_headers = get_grid_plan(
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\site-packages\mdio\segy\utilities.py", line 53, in get_grid_plan
    index_headers = parse_trace_headers(
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\site-packages\mdio\segy\parsers.py", line 120, in parse_trace_headers
    with Pool(num_workers) as pool:
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\pool.py", line 212, in __init__
    self._repopulate_pool()
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\pool.py", line 303, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\pool.py", line 326, in _repopulate_pool_static
    w.start()
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

Synchronization for `MDIOWriter`

Currently, MDIOWriter does not have ability to control synchronization anyway. It is extraneous to the MDIOWriter class.

I think MDIOWriter is useful if it has ability to expose synchronization for users. I don't know what form it takes but I am currently having a use case where I am having to calculate it to align with the chunk boundaries. This can perhaps be made simple enough for users?

What do you think?

cc @tasansal

Improve logging for autochannelwrap

Improve logging to let users know the reason for sparse grid and take self-corrective action on whether to ingest with autoindex or autochannelwrap.

This has become an issue for streamer data where ingestion with autochannelwrap usually works. In the unusual case of unwrapped channels with duplicate traces then segy2mdio ingestion will fail with a sparse index error. In this case the user would like to see an what the first repeated channel was to aid diagnosis and a suggestion to use autoindex in the case that the job fails.

Missing some tests for 4D data

@markspec

When you have a chance, can you please write some tests for #248:

Test amplitudes, or their statistics (after ingestion) match expected values for random slices
Test round trip import/export with random trace checks

Currently, these tests exist for 3D stack; you can extend them or make new ones :) Thanks a lot!

	# We must unify chunks with "trc_chunks" here because
	# headers and live mask may have different chunking.
	# We don't take the time axis for headers / live
	# Still lazy computation
	traces_seq = traces.rechunk(seq_trc_chunks)
	headers_seq = headers.rechunk(seq_trc_chunks[:-1])
	live_seq = live_mask.rechunk(seq_trc_chunks[:-1])

	# Build a Dask graph to do the computation
	# Name of task. Using uuid1 is important because
	# we could potentially generate these from different machines
	task_name = "block-to-sgy-part-" + str(uuid.uuid1())

	trace_keys = flatten(traces_seq.__dask_keys__())
	header_keys = flatten(headers_seq.__dask_keys__())
	live_keys = flatten(live_seq.__dask_keys__())

	all_keys = zip(trace_keys, header_keys, live_keys)

	# tmp file root
	out_dir = path.dirname(output_segy_path)

	task_graph_dict = {}
	block_file_paths = []
	for idx, (trace_key, header_key, live_key) in enumerate(all_keys):
	block_file_name = f".{idx}_{uuid.uuid1()}._segyblock"
	block_file_path = path.join(out_dir, block_file_name)
	block_file_paths.append(block_file_path)

	block_args = (
	block_file_path,
	trace_key,
	header_key,
	live_key,
	num_samp,
	sample_format,
	endian,
	)

	task_graph_dict[(task_name, idx)] = (write_block_to_segy,) + block_args

	# Make actual graph
	task_graph = HighLevelGraph.from_collections(
	task_name,
	task_graph_dict,
	dependencies=[traces_seq, headers_seq, live_seq],
	)

	# Note this doesn't work with distributed.
	tqdm_kw = dict(unit="block", dynamic_ncols=True)
	block_progress = TqdmCallback(desc="Step 1 / 2 Writing Blocks", **tqdm_kw)

	with block_progress:
	block_exists = compute_as_if_collection(
	cls=Array,
	dsk=task_graph,
	keys=list(task_graph_dict),
	scheduler=client,
	)

	merge_args = [output_segy_path, block_file_paths, block_exists]
	if client is not None:
	_ = client.submit(merge_partial_segy, *merge_args).result()
	else:
	merge_partial_segy(*merge_args)