Giter Site home page Giter Site logo

vol-async's Introduction

linux

HDF5 Asynchronous I/O VOL Connector

Asynchronous I/O is becoming increasingly popular with the large amount of data access required by scientific applications. They can take advantage of an asynchronous interface by scheduling I/O as early as possible and overlap computation or communication with I/O operations, which hides the cost associated with I/O and improves the overall performance. This work is part of the ECP-ExaIO project.

Documentation

Async VOL documentation website has detailed build instructions and examples.

Citation

To cite Async VOL, please use the following:

@ARTICLE{9459479,
  author={Tang, Houjun and Koziol, Quincey and Ravi, John and Byna, Suren},
  journal={IEEE Transactions on Parallel and Distributed Systems}, 
  title={Transparent Asynchronous Parallel I/O Using Background Threads}, 
  year={2022},
  volume={33},
  number={4},
  pages={891-902},
  doi={10.1109/TPDS.2021.3090322}}
  
@INPROCEEDINGS{8955215,
  author={Tang, Houjun and Koziol, Quincey and Byna, Suren and Mainzer, John and Li, Tonglin},
  booktitle={2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW)}, 
  title={Enabling Transparent Asynchronous I/O using Background Threads}, 
  year={2019},
  volume={},
  number={},
  pages={11-19},
  doi={10.1109/PDSW49588.2019.00006}}

vol-async's People

Contributors

brtnfld avatar github-actions[bot] avatar houjun avatar hyoklee avatar jeanbez avatar kencasimiro avatar lrknox avatar mierl avatar qkoziol avatar sbyna avatar zhenghh04 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vol-async's Issues

Checks for < 0 of unsigned variables.

One of many checks do this comparision:

if ((attempt_count = check_app_acquire_mutex(task, &mutex_count, &acquired)) < 0)
goto done;

is a no-op because attempt_count is an unsigned int. This if condition can be removed.

E3SM-IO failed on 1-process run

I am using the develop branch of vol-async 73a870d to test E3SM-IO benchmark.
One of the tests failed. The failed command runs on 1 MPI process, but
the same command runs fine with 16 processes.

Below are the related env variables.

HDF5_PLUGIN_PATH=$HOME/ASYNC_VOL/lib
HDF5_VOL_CONNECTOR=async under_vol=0;under_info={}
LD_LIBRARY_PATH=$HOME/ASYNC_VOL/lib:$HOME/Argobots/1.1/lib:$HOME/HDF5/1.14.1-2-thread/lib

Here is the run command.

e3sm_io -k -r 2 -y 2 datasets/map_f_case_16p.h5 -o blob_f_out.h5 -a hdf5 -x blob

Part of GDB trace is given below.

#26 0x00007f717436f218 in H5D__write (count=count@entry=1, dset_info=dset_info@entry=0x7f71565fff00)
    at ../../hdf5-1.14.1-2/src/H5Dio.c:745
#27 0x00007f71745b1f61 in H5VL__native_dataset_write (count=1, obj=<optimized out>, 
    mem_type_id=<optimized out>, mem_space_id=0x1922630, file_space_id=0x191b230, dxpl_id=<optimized out>, 
    buf=0x191c130, req=0x0) at ../../hdf5-1.14.1-2/src/H5VLnative_dataset.c:407
#28 0x00007f717459db47 in H5VL__dataset_write (cls=<optimized out>, req=0x0, buf=0x191c130, 
    dxpl_id=792633534417207497, file_space_id=0x191b230, mem_space_id=0x1922630, mem_type_id=0x191a430, 
    obj=0x1915350, count=1) at ../../hdf5-1.14.1-2/src/H5VLcallback.c:2236
#29 H5VLdataset_write (count=1, obj=0x1915350, connector_id=648518346341351424, mem_type_id=0x191a430, 
    mem_space_id=0x1922630, file_space_id=0x191b230, dxpl_id=792633534417207497, buf=0x191c130, req=0x0)
    at ../../hdf5-1.14.1-2/src/H5VLcallback.c:2396
#30 0x00007f71725a8ef0 in async_dataset_write_fn (foo=0x1a335a0)
    at /homes/wkliao/ASYNC_VOL/vol-async/src/h5_async_vol.c:9712
#31 0x00007f717238104a in ABTD_ythread_func_wrapper (p_arg=0x7f71566001e0)
    at ../../argobots-1.1/src/arch/abtd_ythread.c:21

HDF5 segfault with vol-asyc when building FLASHX

Runtime segfault, below is the valgrind output.

(base) jain @ compute001 ~/F5/async_hdf5 (rajeeja/async_hdf5_io)
└─ $ ▶ valgrind --leak-check=full ./flash5
==65029== Memcheck, a memory error detector
==65029== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==65029== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==65029== Command: ./flash5
==65029==
[Driver_initParallel]: Called MPI_Init_thread - requested level 3, given level 3
RuntimeParameters_read: ignoring unknown parameter "nriem"...
Grid_init: resolution based on runtime params:
lrefine dx dy
1 1.250 1.250
2 0.625 0.625
3 0.312 0.312
MaterialProperties initialized
attribute # 1 = 2 ->meshVar 1 1
attribute # 2 = 7 ->meshVar 8 1
pt_gcMaskForAdvance: T F F F F F F F T T F
pt_gcMaskForWrite: T F F F F F F T F F F
Particles_init: pt_velNumAttrib is 2
Particles_init: pt_velAttrib is 9 9 10 10 0 0
Source terms initialized
5.0000000000000000 1 1

flash: 2 dimensional vortex initialization

Parameters read:

gamma = 1.3999999999999999
ambient density = 1.0000000000000000
ambient pressure = 1.0000000000000000
ambient x-velocity = 1.0000000000000000
ambient y-velocity = 1.0000000000000000

vortex_strength = 5.0000000000000000
x center = 5.0000000000000000
y center = 5.0000000000000000
x subintervals = 1
y subintervals = 1

Parameters computed :

ambient temperature = 1.2027239580856474E-008
ambient int. energy = 2.5000000000000004
gas constant = 83144598.000000000

iteration, no. not moved = 0 0
Done with refinement: total blocks = 1
[amr_morton_process]: Initializing surr_blks using standard orrery implementation
INFO: Grid_fillGuardCells is ignoring masking.
iteration, no. not moved = 0 0
Done with refinement: total blocks = 5
iteration, no. not moved = 0 0
Done with refinement: total blocks = 21
Finished with Grid_initDomain, no restart
Ready to call Hydro_init
Hydro initialized
Gravity initialized


Warning: The initial timestep is too large.
initial timestep = 2.5000000000000001E-002
CFL timestep = 0.10170685742456619
Resetting dtinit to dr_tstepSlowStartFactor*dtcfl.


Initial dt verified
Particles_initPositions on processor 0 done, pt_numLocal= 100
arrays freed
==65029== Warning: client switching stacks? SP change: 0xffeffedd0 --> 0xe86c078
==65029== to suppress, use: --max-stackframe=68458982744 or greater
==65029== Warning: client switching stacks? SP change: 0xe86bfa0 --> 0xec6d078
==65029== to suppress, use: --max-stackframe=4198616 or greater
==65029== Warning: client switching stacks? SP change: 0xec6cf60 --> 0xffeffedd0
==65029== to suppress, use: --max-stackframe=68454784624 or greater
==65029== further instances of this message will not be shown.
HDF5-DIAG: Error detected in HDF5 (1.13.0) MPI-process 0:
#000: H5Pfapl.c line 5671 in H5Pget_vol_info(): not a property list
major: Invalid arguments to routine
minor: Inappropriate type
==65029== Use of uninitialised value of size 8
==65029== at 0xB62B875: H5VL_async_file_create (h5_async_vol.c:21067)
==65029== by 0x102D103: H5VL__file_create (H5VLcallback.c:3393)
==65029== by 0x102D37C: H5VL_file_create (H5VLcallback.c:3427)
==65029== by 0xC6C7E8: H5F__create_api_common (H5F.c:613)
==65029== by 0xC6D0BD: H5Fcreate_async (H5F.c:703)
==65029== by 0x8189EC: io_h5init_file_ (io_h5file_interface.c:205)
==65029== by 0x8200F3: io_initfile_ (io_initFile.F90:56)
==65029== by 0x529F0A: io_writecheckpoint_ (IO_writeCheckpoint.F90:112)
==65029== by 0x52966C: io_outputinitial_ (IO_outputInitial.F90:76)
==65029== by 0x412219: driver_initflash_ (Driver_initFlash.F90:194)
==65029== by 0x42C217: MAIN__ (Flash.F90:49)
==65029== by 0x42C284: main (Flash.F90:43)
==65029==
==65029== Invalid read of size 8
==65029== at 0xB62B875: H5VL_async_file_create (h5_async_vol.c:21067)
==65029== by 0x102D103: H5VL__file_create (H5VLcallback.c:3393)
==65029== by 0x102D37C: H5VL_file_create (H5VLcallback.c:3427)
==65029== by 0xC6C7E8: H5F__create_api_common (H5F.c:613)
==65029== by 0xC6D0BD: H5Fcreate_async (H5F.c:703)
==65029== by 0x8189EC: io_h5init_file_ (io_h5file_interface.c:205)
==65029== by 0x8200F3: io_initfile_ (io_initFile.F90:56)
==65029== by 0x529F0A: io_writecheckpoint_ (IO_writeCheckpoint.F90:112)
==65029== by 0x52966C: io_outputinitial_ (IO_outputInitial.F90:76)
==65029== by 0x412219: driver_initflash_ (Driver_initFlash.F90:194)
==65029== by 0x42C217: MAIN__ (Flash.F90:49)
==65029== by 0x42C284: main (Flash.F90:43)
==65029== Address 0x900000000000001 is not stack'd, malloc'd or (recently) free'd
==65029==

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x5272777
#1 0x5272D7E
#2 0x5F31CAF
#3 0xB62B875
#4 0x102D103 in H5VL__file_create at H5VLcallback.c:3393
#5 0x102D37C in H5VL_file_create at H5VLcallback.c:3427
#6 0xC6C7E8 in H5F__create_api_common at H5F.c:613
#7 0xC6D0BD in H5Fcreate_async at H5F.c:703
#8 0x8189EC in io_h5init_file_ at io_h5file_interface.c:205
#9 0x8200F3 in io_initfile_ at io_initFile.F90:56
#10 0x529F0A in io_writecheckpoint_ at IO_writeCheckpoint.F90:112
#11 0x52966C in io_outputinitial_ at IO_outputInitial.F90:76
#12 0x412219 in driver_initflash_ at Driver_initFlash.F90:194
#13 0x42C217 in flash at Flash.F90:49
==65029==
==65029== HEAP SUMMARY:
==65029== in use at exit: 83,157,515 bytes in 6,205 blocks
==65029== total heap usage: 32,452 allocs, 26,247 frees, 94,766,509 bytes allocated
==65029==
==65029== 8 bytes in 1 blocks are possibly lost in loss record 178 of 5,182
==65029== at 0x4C2D110: memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==65029== by 0x4001149: allocate_and_init (dl-tls.c:529)
==65029== by 0x4001149: tls_get_addr_tail (dl-tls.c:742)
==65029== by 0xB840C83: local_set_xstream_internal (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB84542C: xstream_launch_root_ythread (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB854F3D: xstream_context_thread_func (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0x4E3F183: start_thread (pthread_create.c:312)
==65029== by 0x5FF903C: clone (clone.S:111)
==65029==
==65029== 336 bytes in 1 blocks are possibly lost in loss record 4,356 of 5,182
==65029== at 0x4C2CC70: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==65029== by 0x4012EE4: allocate_dtv (dl-tls.c:296)
==65029== by 0x4012EE4: _dl_allocate_tls (dl-tls.c:460)
==65029== by 0x4E3FD92: allocate_stack (allocatestack.c:589)
==65029== by 0x4E3FD92: pthread_create@@GLIBC_2.2.5 (pthread_create.c:500)
==65029== by 0xB855073: ABTD_xstream_context_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB845379: xstream_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB845CEC: ABT_xstream_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB5FA7FC: async_instance_init (h5_async_vol.c:1133)
==65029== by 0xB5FAF6D: H5VL_async_init (h5_async_vol.c:1386)
==65029== by 0x104983D: H5VL__register_connector (H5VLint.c:1237)
==65029== by 0x104A2EE: H5VL__register_connector_by_name (H5VLint.c:1379)
==65029== by 0x1046321: H5VL__set_def_conn (H5VLint.c:442)
==65029== by 0x104543D: H5VL_init_phase2 (H5VLint.c:201)
==65029==
==65029== 4,194,432 bytes in 1 blocks are possibly lost in loss record 5,175 of 5,182
==65029== at 0x4C2D110: memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==65029== by 0x4C2D227: posix_memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==65029== by 0xB850576: ABTI_ythread_create_root (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB845259: xstream_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB846F7F: ABTI_xstream_create_primary (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB83F24D: ABT_init (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB5FA1C6: async_init (h5_async_vol.c:925)
==65029== by 0xB5FA544: async_instance_init (h5_async_vol.c:1054)
==65029== by 0xB5FAF6D: H5VL_async_init (h5_async_vol.c:1386)
==65029== by 0x104983D: H5VL__register_connector (H5VLint.c:1237)
==65029== by 0x104A2EE: H5VL__register_connector_by_name (H5VLint.c:1379)
==65029== by 0x1046321: H5VL__set_def_conn (H5VLint.c:442)
==65029==
==65029== 4,194,432 bytes in 1 blocks are possibly lost in loss record 5,176 of 5,182
==65029== at 0x4C2D110: memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==65029== by 0x4C2D227: posix_memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==65029== by 0xB84967B: ythread_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB850713: ABTI_ythread_create_main_sched (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB845300: xstream_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB846F7F: ABTI_xstream_create_primary (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB83F24D: ABT_init (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB5FA1C6: async_init (h5_async_vol.c:925)
==65029== by 0xB5FA544: async_instance_init (h5_async_vol.c:1054)
==65029== by 0xB5FAF6D: H5VL_async_init (h5_async_vol.c:1386)
==65029== by 0x104983D: H5VL__register_connector (H5VLint.c:1237)
==65029== by 0x104A2EE: H5VL__register_connector_by_name (H5VLint.c:1379)
==65029==
==65029== 4,194,432 bytes in 1 blocks are possibly lost in loss record 5,177 of 5,182
==65029== at 0x4C2D110: memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==65029== by 0x4C2D227: posix_memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==65029== by 0xB84967B: ythread_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB850713: ABTI_ythread_create_main_sched (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB845300: xstream_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB845CEC: ABT_xstream_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB5FA7FC: async_instance_init (h5_async_vol.c:1133)
==65029== by 0xB5FAF6D: H5VL_async_init (h5_async_vol.c:1386)
==65029== by 0x104983D: H5VL__register_connector (H5VLint.c:1237)
==65029== by 0x104A2EE: H5VL__register_connector_by_name (H5VLint.c:1379)
==65029== by 0x1046321: H5VL__set_def_conn (H5VLint.c:442)
==65029== by 0x104543D: H5VL_init_phase2 (H5VLint.c:201)
==65029==
==65029== LEAK SUMMARY:
==65029== definitely lost: 0 bytes in 0 blocks
==65029== indirectly lost: 0 bytes in 0 blocks
==65029== possibly lost: 12,583,640 bytes in 5 blocks
==65029== still reachable: 70,573,875 bytes in 6,200 blocks
==65029== suppressed: 0 bytes in 0 blocks
==65029== Reachable blocks (those to which a pointer was found) are not shown.
==65029== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==65029==
==65029== For counts of detected and suppressed errors, rerun with: -v
==65029== Use --track-origins=yes to see where uninitialised values come from
==65029== ERROR SUMMARY: 7 errors from 7 contexts (suppressed: 0 from 0)
Killed
(base) jain @ compute001 ~/F5/async_hdf5 (rajeeja/async_hdf5_io)
└─ $ ▶ uname -a
lsLinux compute001 3.13.0-170-generic #220-Ubuntu SMP Thu May 9 12:40:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
(base) jain @ compute001 ~/F5/async_hdf5 (rajeeja/async_hdf5_io)
└─ $ ▶ lsb_release
Display all 2371 possibilities? (y or n)
(base) jain @ compute001 ~/F5/async_hdf5 (rajeeja/async_hdf5_io)
└─ $ ▶ lsb_release
LSB Version: core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch:core-4.1-amd64:core-4.1-noarch:security-4.0-amd64:security-4.0-noarch:security-4.1-amd64:security-4.1-noarch
(base) jain @ compute001 ~/F5/async_hdf5 (rajeeja/async_hdf5_io)
└─ $ ▶ lsb_release -a
LSB Version: core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch:core-4.1-amd64:core-4.1-noarch:security-4.0-amd64:security-4.0-noarch:security-4.1-amd64:security-4.1-noarch
Distributor ID: Ubuntu
Description: Ubuntu 14.04.6 LTS
Release: 14.04
Codename: trusty

signal SIGABRT in testing

As I'm developing the FORTRAN async tests in HDF5, I'm seeing an issue with H5Aopen_async_f (backtrace below)

Sometimes the test fails and sometimes it does not. I'm running on 6 ranks.

It is basically doing:


    CALL h5fopen_async_f(filename, H5F_ACC_RDWR_F, file_id, es_id, hdferror, access_prp = fapl_id )
    CALL check("h5fopen_async_f",hdferror, total_error)

    f_ptr = C_LOC(exists0)
    CALL H5Aexists_async_f(file_id, attr_name, f_ptr, es_id, hdferror)
    CALL check("H5Aexists_async_f",hdferror, total_error)

    f_ptr = C_LOC(exists1)
    CALL H5Aexists_async_f(file_id, TRIM(attr_name)//"00", f_ptr, es_id, hdferror)
    CALL check("H5Aexists_async_f",hdferror, total_error)

    f_ptr = C_LOC(exists2)
    CALL H5Aexists_by_name_async_f(file_id, "/", attr_name, f_ptr, es_id, hdferror)
    CALL check("H5Aexists_by_name_async_f",hdferror, total_error)

    f_ptr = C_LOC(exists3)
    CALL H5Aexists_by_name_async_f(file_id, "/", TRIM(attr_name)//"00", f_ptr, es_id, hdferror)
    CALL check("H5Aexists_by_name_async_f",hdferror, total_error)

    CALL H5Aopen_async_f(file_id, attr_name, attr_id0, es_id, hdferror)  <--- fails here
    CALL check("H5Aopen_async_f", hdferror, total_error)


async_test: ../../src/H5Fint.c:631: H5F__get_objects_cb: Assertion `obj_ptr' failed.
async_test: ../../src/H5Fint.c:631: H5F__get_objects_cb: Assertion `obj_ptr' failed.

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7f5c5f7734e2 in ???
#1  0x7f5c5f772675 in ???
#2  0x7f5c5e280d4f in ???
#3  0x7f5c5e280cbb in ???
#4  0x7f5c5e282354 in ???
#5  0x7f5c5e278cb9 in ???
#6  0x7f5c5e278d41 in ???
#7  0x7f5c60813ec7 in H5F__get_objects_cb
        at ../../src/H5Fint.c:631
#8  0x7f5c608e3555 in H5I__iterate_cb
        at ../../src/H5Iint.c:1526
#9  0x7f5c608e4eb2 in H5I_iterate
        at ../../src/H5Iint.c:1592
#10  0x7f5c60813dc0 in H5F__get_objects
        at ../../src/H5Fint.c:599
#11  0x7f5c608173a0 in H5F_get_obj_count
        at ../../src/H5Fint.c:475
#12  0x7f5c60920b98 in H5O__attr_find_opened_attr
        at ../../src/H5Oattribute.c:661
#13  0x7f5c60921f31 in H5O__attr_open_by_name
        at ../../src/H5Oattribute.c:473
#14  0x7f5c606fcacc in H5A__open
        at ../../src/H5Aint.c:535
#15  0x7f5c60b04368 in H5VL__native_attr_open
        at ../../src/H5VLnative_attr.c:154
#16  0x7f5c60ae073d in H5VL__attr_open
        at ../../src/H5VLcallback.c:1104
#17  0x7f5c60ae8827 in H5VLattr_open
        at ../../src/H5VLcallback.c:1175
#18  0x7f5c60d8527b in async_attr_open_fn
        at /home/brtnfld/work/vol-async/src/h5_async_vol.c:5675
#19  0x7f5c5c1bbc97 in ???
#20  0x7f5c5c1c1e98 in ???
#21  0xffffffffffffffff in ???

Argobots segfault in MacOS Solution

On MacOS, one may encounter the following segfault:

*** Process received signal ***
Signal: Segmentation fault: 11 (11)
Signal code: (0)
Failing at address: 0x0
[ 0] 0 libsystem_platform.dylib 0x00007fff20428d7d _sigtramp + 29
[ 1] 0 ??? 0x0000000000000000 0x0 + 0
[ 2] 0 libabt.1.dylib 0x0000000105bdbdc0 ABT_thread_create + 128
[ 3] 0 libh5async.dylib 0x00000001064bde1f push_task_to_abt_pool + 559
[ 4] 0 libh5async.dylib 0x00000001064e6a02 async_group_create + 1890
[ 5] 0 libh5async.dylib 0x00000001064c4061 H5VL_async_group_create + 321
[ 6] 0 libhdf5.1000.dylib 0x0000000105f6f794 H5VL__group_create + 180
[ 7] 0 libhdf5.1000.dylib 0x0000000105f6f569 H5VL_group_create + 217
[ 8] 0 libhdf5.1000.dylib 0x0000000105d48a04 H5G__create_api_common + 660
[ 9] 0 libhdf5.1000.dylib 0x0000000105d485f5 H5Gcreate2 + 325
[10] 0 async_test_parallel.exe 0x0000000105bbcb43 main + 739
[11] 0 libdyld.dylib 0x00007fff203fef5d start + 1
[12] 0 ??? 0x0000000000000001 0x0 + 1

Solution from Argobots developer is setting the following variable before running the application:
ABT_THREAD_STACKSIZE=100000 ./your_app.exe

2.1 Compile H5_DIR Configure Issue

Dear Authors, @houjun @jeanbez I was trying to compile 2.1 but had an issue once I run the second command which is
> ./configure --prefix=$H5_DIR/install --enable-parallel --enable-threadsafe --enable-unsupported #(may need to add CC=cc or CC=mpicc)
I tried with both, also I added some flags to make it work but it gives me errors. ./autogen.sh works but it does not generate any make file to compile as well.
Two different errors while adding CC=cc or CC=mpicc
I added some flags like --with-zlib CFLAGS="03" in some cases, I followed this link but still, it did not resolve my issue, not sure what blocking me to execute it successfully: Link:

Screenshot 2022-03-15 160746
Screenshot 2022-03-15 160553

[2.2 works fine
2.3 Fixed but can't make it as H5 is required]

I tried figuring it out but I really need a bit of suggestion or help to debug it.

Any suggestions will be highly appreciated, Thank you.

Test errors

Hi,
I am getting errors when I run the test cases in the code.

For example:

$>./async_test_multifile.exe
async_test_multifile.exe: H5CX.c:3610: H5CX__pop_common: Assertion `head && *head' failed

and:

$ ./async_test_serial_event_set_error_stack.exe
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
  #000: H5.c line 1010 in H5open(): library initialization failed
    major: Function entry/exit
    minor: Unable to initialize object
  #001: H5.c line 277 in H5_init_library(): unable to initialize vol interface
    major: Function entry/exit
    minor: Unable to initialize object
  #002: H5VLint.c line 202 in H5VL_init_phase2(): unable to set default VOL connector
    major: Virtual Object Layer
    minor: Can't set value
  #003: H5VLint.c line 444 in H5VL__set_def_conn(): can't register connector
    major: Virtual Object Layer
    minor: Unable to register new ID
  #004: H5VLint.c line 1376 in H5VL__register_connector_by_name(): unable to load VOL connector
    major: Virtual Object Layer
    minor: Unable to initialize object
H5Fcreate start
H5Fcreate done
H5Gcreate start
H5Gcreate done
H5Gcreate 2 start (should fail when executed)
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
  #000: H5G.c line 268 in H5Gcreate_async(): unable to asynchronously create group
    major: Symbol table
    minor: Unable to create file
  #001: H5G.c line 185 in H5G__create_api_common(): unable to create group
    major: Symbol table
    minor: Unable to initialize object
  #002: H5VLcallback.c line 4248 in H5VL_group_create(): group create failed
    major: Virtual Object Layer
    minor: Unable to create file
  #003: H5VLcallback.c line 4215 in H5VL__group_create(): group create failed
    major: Virtual Object Layer
    minor: Unable to create file
  #004: H5VLnative_group.c line 103 in H5VL__native_group_create(): unable to create group
    major: Symbol table
    minor: Unable to initialize object
  #005: H5Gint.c line 328 in H5G__create_named(): unable to create and link to group
    major: Symbol table
    minor: Unable to initialize object
  #006: H5L.c line 2546 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #007: H5L.c line 2788 in H5L__create_real(): can't insert link
    major: Links
    minor: Unable to insert object
  #008: H5Gtraverse.c line 838 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #009: H5Gtraverse.c line 614 in H5G__traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #010: H5L.c line 2581 in H5L__link_cb(): name already exists
    major: Links
    minor: Object already exists
Error with group create
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
  #000: H5S.c line 496 in H5Sclose(): not a dataspace
    major: Invalid arguments to routine
    minor: Inappropriate type
Closing dataset's dataspace failed
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
  #000: H5D.c line 472 in H5Dclose(): not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
Closing dataset failed

Thanks

error when using H5S_BLOCK

Using Async I/O VOL version 1.4 and H5S_BLOCK as a memory space
will cause errors below.

 HDF5-DIAG: Error detected in HDF5 (1.13.3) MPI-process 0:
   #000: ../../hdf5-1.13.3/src/H5S.c line 487 in H5Scopy(): not a dataspace
     major: Invalid arguments to routine
     minor: Inappropriate type
   [ASYNC ABT LOG] Argobots execute async_dataset_write_fn failed
 free(): invalid pointer
 Abort (core dumped)

Here is a short test program to reproduce.
https://github.com/DataLib-ECP/vol-log-based/blob/master/tests/basic/h5s_block.c

Unable to pass parallel make tests

Hello vol-async team,

I'm trying to get this HDF5 Asynchronous I/O VOL Connector installed on my system and I can get it to a point where it is passing the serial tests (in vol-async/test/pytest.py) but never the parallel ones; I think there may be some inconsistencies with the directory structures / paths as written so hopefully we can clear this up together. Let me walk you through how I got here:

  1. I cloned the HDF5 and vol-async repos and set my environment directories like so:
export H5_DIR=/home1/sneuhoff/nbu11/scratch/hdf5_async/hdf5/
export VOL_DIR=/home1/sneuhoff/nbu11/scratch/hdf5_async/vol-async/
export ABT_DIR=/home1/sneuhoff/nbu11/scratch/hdf5_async/vol-async/argobots/
  1. I checked out the async_vol_register_optional branch and ran autogen.sh
  2. I ran ./configure --prefix=$H5_DIR/install --enable-parallel --enable-threadsafe --enable-unsupported CC=mpicc using my systems HPE MPT installation for MPI.
  3. Ran make install with no issues, switched to $ABT_DIR, ran ./autogen.sh && CC=cc ./configure --prefix=$ABT_DIR/build && make install with no issues
  4. Here is where I think things start to break down a little. I cd into $VOL_DIR/src, and copy Makefile.summit to Makefile. I edit it so that:
HDF5_DIR = /home1/sneuhoff/nbu11/scratch/hdf5_async/hdf5/install`
ABT_DIR = /home1/sneuhoff/nbu11/scratch/hdf5_async/vol-async/argobots/build

Notice these are not as written in repo's README: I had to add /install on the end of HDF5_DIR for it to find the correct header files, if I did not do this, it would complain that hdf5dev.h could not be found (as it should, that header file is not in $H5_DIR as Makefile.summit would have you believe)
6. After editing that Makefile, I run make and it completes smoothly. Next, I run

export LD_LIBRARY_PATH=$VOL_DIR/src:$H5_DIR/lib:$LD_LIBRARY_PATH
export HDF5_PLUGIN_PATH="$VOL_DIR"
export HDF5_VOL_CONNECTOR="async under_vol=0;under_info={}

although, here again I find that $H5_DIR/lib doesn't exist, perhaps it should be $H5_DIR/install/lib
7. I copy Makefile.summit to Makefile and again edit it so that:

ASYNC_DIR = /home1/sneuhoff/nbu11/scratch/hdf5_async/vol-async/src
HDF5_DIR = /home1/sneuhoff/nbu11/scratch/hdf5_async/hdf5/install
ABT_DIR = /home1/sneuhoff/nbu11/scratch/hdf5_async/vol-async/argobots/build
  1. I run make with no issues
  2. When I run make check (my Python is version 3.7.0), I get the following:
./pytest.py -p
Running serial tests
Test # 1 : async_test_serial.exe PASSED
Test # 2 : async_test_serial2.exe PASSED
ERROR: Test async_test_multifile.exe : returned non-zero exit status= -6 aborting test
run_cmd= ./async_test_multifile.exe
pytest was unsuccessful

Running async_test_multifile.exe alone gives me:

async_test_multifile.exe: H5CX.c:3610: H5CX__pop_common: Assertion `head && *head' failed.
Aborted (core dumped)

In my other attempts changing various things I was able to get it to pass all the way to here:

./pytest.py -p
Running serial tests
Test # 1 : async_test_serial.exe PASSED
Test # 2 : async_test_serial2.exe PASSED
Test # 3 : async_test_multifile.exe PASSED
Test # 4 : async_test_serial_event_set.exe PASSED
ERROR: Test async_test_serial_event_set_error_stack.exe : returned non-zero exit status= 255 aborting test
run_cmd= ./async_test_serial_event_set_error_stack.exe
pytest was unsuccessful

Running that test individually gives:

H5Fcreate start
H5Fcreate done
H5Gcreate start
H5Gcreate done
H5Gcreate 2 start (should fail when executed)
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
  #000: H5G.c line 268 in H5Gcreate_async(): unable to asynchronously create group
    major: Symbol table
    minor: Unable to create file
  #001: H5G.c line 185 in H5G__create_api_common(): unable to create group
    major: Symbol table
    minor: Unable to initialize object
  #002: H5VLcallback.c line 4920 in H5VL_group_create(): group create failed
    major: Virtual Object Layer
    minor: Unable to create file
  #003: H5VLcallback.c line 4887 in H5VL__group_create(): group create failed
    major: Virtual Object Layer
    minor: Unable to create file
  #004: H5VLnative_group.c line 103 in H5VL__native_group_create(): unable to create group
    major: Symbol table
    minor: Unable to initialize object
  #005: H5Gint.c line 328 in H5G__create_named(): unable to create and link to group
    major: Symbol table
    minor: Unable to initialize object
  #006: H5L.c line 2383 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #007: H5L.c line 2625 in H5L__create_real(): can't insert link
    major: Links
    minor: Unable to insert object
  #008: H5Gtraverse.c line 838 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #009: H5Gtraverse.c line 614 in H5G__traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #010: H5L.c line 2418 in H5L__link_cb(): name already exists
    major: Links
    minor: Object already exists
Error with group create
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
  #000: H5S.c line 496 in H5Sclose(): not a dataspace
    major: Invalid arguments to routine
    minor: Inappropriate type
Closing dataset's dataspace failed
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
  #000: H5D.c line 472 in H5Dclose(): not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
Closing dataset failed

I am wondering if there is anything here that is obviously inconsistent with how I should be installing things. Let me know, thanks!

The problem about async memory limit?

Dear Sir:
When I use HDF5 VOL-Async within HDF5 1.13.1, and I run my app with async I/O of HDF5. The information like "ASYNC ABT INFO 0 write size 18920009385957 larger than async memory limit 23632764928, switch to synchronous write
It seems like I haven't set some Environmental Variable in my system? Now the async mode can not function well.
Thanks
Li Jian

Failing tests with HDF5 API tests for VOLS.

For the serial tests (test/API in HDF5), only h5_api_test_attribute fails with:


1: Testing shared datatype for attributes                                *FAILED*
1:     reference count of the named datatype is wrong: 1

For the parallel tests (testpar/API), only h5_api_test_parallel_async fails with:

9: **********************************************
9: *                                            *
9: *      API Parallel Async Tests              *
9: *                                            *
9: **********************************************
9: 
9: Testing single dataset I/O                         
9:   Testing test setup                                                  HDF5-DIAG: Error detected in HDF5 (1.15.0) MPI-process 0:
9:   #000: ../../src/H5VLcallback.c line 6321 in H5VLintrospect_get_conn_cls(): NULL obj pointer
9:     major: Invalid arguments to routine
9:     minor: Bad value
9: HDF5-DIAG: Error detected in HDF5 (1.15.0) MPI-process 0:
9:   #000: ../../src/H5VL.c line 658 in H5VLobject_is_native(): can't determine if object is a native connector object
9:     major: Virtual Object Layer
9:     minor: Can't get value
9:   #001: ../../src/H5VLint.c line 1077 in H5VL_object_is_native(): can't get VOL connector class
9:     major: Virtual Object Layer
9:     minor: Can't get value
9:   #002: ../../src/H5VLcallback.c line 6289 in H5VL_introspect_get_conn_cls(): can't query connector class
9:     major: Virtual Object Layer
9:     minor: Can't get value
9:   #003: ../../src/H5VLcallback.c line 6256 in H5VL__introspect_get_conn_cls(): can't query connector class
9:     major: Virtual Object Layer
9:     minor: Can't get value
9:   #004: ../../src/H5VLcallback.c line 6321 in H5VLintrospect_get_conn_cls(): NULL obj pointer
9:     major: Invalid arguments to routine
9:     minor: Bad value
9: *FAILED*

both ASYNC dynamic and static libraries in LDFLAGS in test/Makefile, conflict?

Hi,
This is just a comment of an issue I found in the test/Makefile

Got this error while I was running the tests async_test_serial.exe

./async_test_serial.exe: symbol lookup error: /home/myuser/hdf5-async/vol-async/src/libh5async.so: undefined symbol: ABT_initialized

I noticed that LDFLAGS in the test/Makefile has:

LDFLAGS = $(DEBUG) -L$(ASYNC_DIR) -L$(ABT_DIR)/lib -L$(HDF5_DIR)/lib -Wl,-rpath=$(ASYNC_DIR) -Wl,-rpath=$(ABT_DIR)/lib -Wl,-rpath=$(HDF5_DIR)/lib -labt -lhdf5 -lh5async -lasynchdf5 -labt

So I removed '-lh5async' in LDFLAGS which is pointing to the dynamic library. Now the test async_test_serial.exe passed.

async_test_multifile.exe fails with segmentation fault

Hi,

I am running on an x86-64 Linux OpenMPI cluster, and I have built following the instructions in the README, but the tests do not complete successfully:

$ make check_serial
python3 ./pytest.py
Running serial tests
Test # 1 : async_test_serial.exe PASSED
Test # 2 : async_test_serial2.exe PASSED
ERROR: Test async_test_multifile.exe : returned non-zero exit status= -11 aborting test
run_cmd= ./async_test_multifile.exe
pytest was unsuccessful

The backtrace is:

$ cat async_vol_test.err
[gadi-login-07:3639707:0:3639707] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x110)
==== backtrace (tid:3639707) ====
 0 0x0000000000012c20 .annobin_sigaction.c()  sigaction.c:0
 1 0x0000000000007f5c get_n_running_task_in_queue_obj()  /home/120/bw0729/vol-async/src/h5_async_vol.c:2138
 2 0x0000000000008f0c H5VL_async_request_wait()  /home/120/bw0729/vol-async/src/h5_async_vol.c:24279
 3 0x000000000045238a H5VL__request_wait()  /home/120/bw0729/hdf5/src/H5VLcallback.c:6435
 4 0x00000000004653f6 H5VL_request_wait()  /home/120/bw0729/hdf5/src/H5VLcallback.c:6469
 5 0x0000000000177597 H5ES__wait_cb()  /home/120/bw0729/hdf5/src/H5ESint.c:669
 6 0x0000000000178ce2 H5ES__list_iterate()  /home/120/bw0729/hdf5/src/H5ESlist.c:171
 7 0x00000000001786a4 H5ES__wait()  /home/120/bw0729/hdf5/src/H5ESint.c:754
 8 0x0000000000174130 H5ESwait()  /home/120/bw0729/hdf5/src/H5ES.c:342
 9 0x000000000040129a main()  /home/120/bw0729/vol-async/test/async_test_multifile.c:61
10 0x0000000000023493 __libc_start_main()  ???:0
11 0x000000000040106e _start()  ???:0
=================================

Summit crash with hdf5-iotest and > 1 node

When I try to run hdf5-iotest with > 1 node I get a crash, below. It works fine if it is using one node.:

#0  0x000020001ac6bfb4 in ABT_thread_create () from /ccs/home/brtnfld/packages/argobots/build/argobots//lib/libabt.so.1
#1  0x0000200003d98870 in push_task_to_abt_pool (qhead=0x4b22fed0, pool=0x4b2a1980) at h5_async_vol.c:2249
#2  0x0000200003db98e4 in async_file_open (qtype=REGULAR, aid=0x4b22fed0, name=0x7fffdc00e840 "hdf5_iotest.h5", flags=0, fapl_id=792633534417208627, dxpl_id=792633534417207304, req=0x0) at h5_async_vol.c:13253
#3  0x0000200003dd4b3c in H5VL_async_file_open (name=0x7fffdc00e840 "hdf5_iotest.h5", flags=0, fapl_id=792633534417207316, dxpl_id=792633534417207304, req=0x0) at h5_async_vol.c:22141
#4  0x00002000004a85e4 in H5VL__file_open (name=<optimized out>, name@entry=0x7fffdc00e840 "hdf5_iotest.h5", flags=flags@entry=0, fapl_id=<optimized out>, fapl_id@entry=792633534417207316, dxpl_id=<optimized out>, 
    dxpl_id@entry=792633534417207304, req=<optimized out>, req@entry=0x0, cls=<optimized out>, cls=<optimized out>) at ../../src/H5VLcallback.c:3497
#5  0x00002000004b199c in H5VL_file_open (connector_prop=0x7fffdc00e440, name=0x7fffdc00e840 "hdf5_iotest.h5", flags=<optimized out>, fapl_id=792633534417207316, dxpl_id=792633534417207304, req=0x0) at ../../src/H5VLcallback.c:3646
#6  0x000020000025346c in H5F__open_api_common (filename=filename@entry=0x7fffdc00e840 "hdf5_iotest.h5", flags=flags@entry=0, fapl_id=<optimized out>, fapl_id@entry=792633534417207316, token_ptr=token_ptr@entry=0x0)
    at ../../src/H5F.c:795
#7  0x0000200000255c38 in H5Fopen_async (app_file=0x1000f878 "../../src/read_test.c", app_func=0x1000fbc8 "read_test", app_line=<optimized out>, filename=0x7fffdc00e840 "hdf5_iotest.h5", flags=<optimized out>, 
    fapl_id=792633534417207316, es_id=0) at ../../src/H5F.c:880
#8  0x0000000010009284 in ?? ()
#9  0x000000001000820c in ?? ()
#10 0x00002000008b4078 in generic_start_main.isra () from /lib64/power9/libc.so.6
#11 0x00002000008b4264 in __libc_start_main () from /lib64/power9/libc.so.6
#12 0x0000000000000000 in ?? ()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.