Comments (17)
The branch is https://github.com/brtnfld/hdf5/tree/ASYNC_F
To run the test:
#!/bin/bash
export ABT_DIR=$HOME/work/argobots/build/argobots/
export HDF5_DIR=$HOME/work/hdf5.brtnfld/build/hdf5
export LD_LIBRARY_PATH="$HDF5_DIR/lib64:$HOME/packages/szip-2.1.1/szip/lib64:$ABT_DIR/lib64:$LD_LIBRARY_PATH"
export HDF5_PLUGIN_PATH="$HOME/work/vol-async/build/lib"
export HDF5_VOL_CONNECTOR="async under_vol=0;under_info={}"
mpiexec -n 6 ./async_test
from vol-async.
I'm also getting hanging periodically with 8 ranks, but that is probably a separate issue:
#0 0x00007f5178db5890 in pool_pop_shared () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#1 0x00007f5178db9aea in sched_run () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#2 0x00007f5178da59b9 in thread_main_sched_func () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#3 0x00007f5178db3c98 in ABTD_ythread_func_wrapper () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#4 0x00007f5178da5469 in ABTD_ythread_context_func_wrapper () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#5 0x0000000000000000 in ?? ()
from vol-async.
@brtnfld , can you add your full test code file here?
from vol-async.
It is also here:
https://github.com/brtnfld/hdf5/blob/ASYNC_F/fortran/testpar/async.F90
line 252 is the issue.
from vol-async.
Got it. Is there a C version of this test code?
from vol-async.
No, only Fortran.
from vol-async.
@brtnfld I'm able to reproduce the error.
After some debugging, this appears to be an old issue that I thought was resolved by HDF5 previously, but looks like it is either recurring or I was not testing the case very well before.
Basically, the issue comes from HDF5 trying to check whether an attribute is already opened in H5Oattribute.c and it seems to not like the future ID used by async vol when some are already created/opened and some are not. I found two workarounds that will not cause this error:
- Comment out lines 473-479 and 512 of H5Oattribute.c, this way HDF5 won't check for already opened attributes and things will be fine.
- Do "export HDF5_ASYNC_EXE_FCLOSE=1" before you run the test program, it will force async vol to not start executing the I/O operations until ESwait or Fclose are called, and the attribute ids are true future ids that have not been filled by async vol.
I forgot whether it was Neil or Jordan who looked at this issue before, can you check with them and see if there is a better solution?
Also, the test code seems to always segfault at the end:
nid00074:testpar$ srun -n 6 ./async_test
H5ES API tests PASSED
H5A async API tests PASSED
srun: error: nid00074: tasks 0-4: Segmentation fault
srun: launch/slurm: _step_signal: Terminating StepId=520944.42
srun: error: nid00074: task 5: Segmentation fault
from vol-async.
Thanks, I'll ask Jordan and Neil. I've not seen that segmentation fault before. Though I've only run it on a local desktop.
from vol-async.
BTW, even if I add an ESwait after the last exists, it still fails.
from vol-async.
@brtnfld does setting the environment variable work for you?
I don't think adding an ESwait would help, the issue seems to be from HDF5 checking the cached attribute.
from vol-async.
Yes, HDF5_ASYNC_EXE_FCLOSE fixes the issue.
from vol-async.
@houjun could you share more details of your debugging? Looking through the future ID code I'm having trouble understanding how this could happen.
from vol-async.
Hi @fortnern , I have tried two things in my debugging that seem to fix this issue, the first is to comment out the code in HDF5 library (473-479 and 512 of H5Oattribute.c) so that HDF5 doesn't check whether an attribute is already opened. The second is in vol-async, I can delay the execution of all the attribute operations to a later time (e.g. at file close time).
My guess for the cause is there may be something wrong when the library is checking its cached attributes, it either doesn't like the future ID or there's some interference from vol-async. Although the interference seems unlikely as there can be only one thread performing HDF5 operations as threadsafty is turned on.
from vol-async.
@brtnfld @fortnern, can you check if the latest develop branch fixes all the Fortran test issues?
from vol-async.
It passes most of the time, but running it over and over, I can sometimes get it to fail with:
async_test: ../../src/H5Fint.c:631: H5F__get_objects_cb: Assertion `obj_ptr' failed.
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x7f48829964e2 in ???
#1 0x7f4882995675 in ???
#2 0x7f4880518cff in ???
#3 0x7f4880518c6b in ???
#4 0x7f488051a304 in ???
#5 0x7f4880510c69 in ???
#6 0x7f4880510cf1 in ???
#7 0x7f4883466b0a in H5F__get_objects_cb
at ../../src/H5Fint.c:631
#8 0x7f4883536331 in H5I__iterate_cb
at ../../src/H5Iint.c:1526
#9 0x7f4883537c5a in H5I_iterate
at ../../src/H5Iint.c:1592
#10 0x7f4883466a03 in H5F__get_objects
at ../../src/H5Fint.c:599
#11 0x7f4883469ff4 in H5F_get_obj_count
at ../../src/H5Fint.c:475
#12 0x7f4883573ffd in H5O__attr_find_opened_attr
at ../../src/H5Oattribute.c:661
#13 0x7f488357539b in H5O__attr_open_by_name
at ../../src/H5Oattribute.c:473
#14 0x7f488334df3f in H5A__open
at ../../src/H5Aint.c:535
#15 0x7f488375e753 in H5VL__native_attr_open
at ../../src/H5VLnative_attr.c:158
#16 0x7f488373aaac in H5VL__attr_open
at ../../src/H5VLcallback.c:1104
#17 0x7f4883742b96 in H5VLattr_open
at ../../src/H5VLcallback.c:1175
#18 0x7f47f20c2737 in async_attr_open_fn
at /home/brtnfld/work/vol-async/src/h5_async_vol.c:5772
#19 0x7f47f209bc97 in ???
#20 0x7f47f20a1e78 in ???
#21 0xffffffffffffffff in ???
from vol-async.
@brtnfld I think this is probably the same issue I mentioned earlier with the opened attribute, did you set "export HDF5_ASYNC_EXE_FCLOSE=1"?
In my previous debugging, the issue seems to come from searching the cached attributes in the library, my guess is the (filled) future id is not handled properly by the library, I'll see if I can find more this week.
from vol-async.
That was my mistake. It got removed in my editing of the run script. Using that, all the test pass.
from vol-async.
Related Issues (17)
- Unable to pass parallel make tests HOT 14
- both ASYNC dynamic and static libraries in LDFLAGS in test/Makefile, conflict? HOT 2
- Test errors HOT 3
- Argobots segfault in MacOS Solution HOT 1
- HDF5 segfault with vol-asyc when building FLASHX HOT 3
- Checks for < 0 of unsigned variables. HOT 3
- Summit crash with hdf5-iotest and > 1 node HOT 3
- 2.1 Compile H5_DIR Configure Issue HOT 2
- async_test_multifile.exe fails with segmentation fault HOT 7
- Support latest HDF5 VOL connector feature flags HOT 1
- Update for HDF5 multi-dataset support HOT 1
- The problem about async memory limit? HOT 1
- error when using H5S_BLOCK HOT 2
- make a new release as #34 fixes a critical bug HOT 1
- E3SM-IO failed on 1-process run HOT 18
- Failing tests with HDF5 API tests for VOLS. HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vol-async.