[ip-172-31-20-63:49779:0:49779] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f96ef392140)
==== backtrace (tid: 49779) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x000000000002a2e8 __pyx_f_15pylibwholegraph_7binding_19wholememory_binding_python_cb_wrapper_temp_free() wholememory_binding.cxx:0
2 0x000000000027428c wholememory_ops::wholememory_gather_nccl() ???:0
3 0x0000000000272b75 wholememory_gather() ???:0
4 0x00000000000e9225 wholememory::noncached_embedding::gather() ???:0
5 0x0000000000056fa2 __pyx_pw_15pylibwholegraph_7binding_19wholememory_binding_13EmbeddingGatherForward() wholememory_binding.cxx:0
6 0x000000000015372b _PyObject_MakeTpCall() ???:0
7 0x000000000014c0e7 _PyEval_EvalFrameDefault() ???:0
8 0x000000000015d4ec _PyFunction_Vectorcall() ???:0
9 0x0000000000145c14 _PyEval_EvalFrameDefault() ???:0
10 0x000000000015d4ec _PyFunction_Vectorcall() ???:0
11 0x0000000000146d6b _PyEval_EvalFrameDefault() ???:0
12 0x000000000015d4ec _PyFunction_Vectorcall() ???:0
13 0x0000000000145c14 _PyEval_EvalFrameDefault() ???:0
14 0x000000000016af11 PyMethod_New() ???:0
15 0x0000000000146d6b _PyEval_EvalFrameDefault() ???:0
16 0x000000000015d4ec _PyFunction_Vectorcall() ???:0
17 0x0000000000145a1d _PyEval_EvalFrameDefault() ???:0
18 0x0000000000142176 _PyArg_ParseTuple_SizeT() ???:0
19 0x0000000000237c56 PyEval_EvalCode() ???:0
20 0x0000000000264b18 PyUnicode_Tailmatch() ???:0
21 0x000000000025d96b PyInit__collections() ???:0
22 0x0000000000264865 PyUnicode_Tailmatch() ???:0
23 0x0000000000263d48 _PyRun_SimpleFileObject() ???:0
24 0x0000000000263a43 _PyRun_AnyFileObject() ???:0
25 0x0000000000254c3e Py_RunMain() ???:0
26 0x000000000022abcd Py_BytesMain() ???:0
27 0x0000000000029d90 __libc_init_first() ???:0
28 0x0000000000029e40 __libc_start_main() ???:0
29 0x000000000022aac5 _start() ???:0
=================================
[2023-10-20 14:30:07,147] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 49778 closing signal SIGTERM
[2023-10-20 14:30:07,147] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 49780 closing signal SIGTERM
[[2023-10-20 14:30:07,147] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 49781 closing signal SIGTERM
14:30:07] /opt/dgl/dgl-source/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
[[2023-10-20 14:30:07,147] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 49782 closing signal SIGTERM
14:30:07] /opt/dgl/dgl-source/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
[2023-10-20 14:30:07,147] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 49783 closing signal SIGTERM
[14:30:07] /opt/dgl/dgl-source/src/rpc/rpc.cc:[[2023-10-20 14:30:07,147] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 49784 closing signal SIGTERM
390:
User pressed Ctrl+C, Exiting
14:30:07] /opt/dgl/dgl-source/src/rpc/rpc.cc:390[:
User pressed Ctrl+C, Exiting
[2023-10-20 14:30:07,147] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 49785 closing signal SIGTERM
14:30:07] /opt/dgl/dgl-source/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
[14:30:07] /opt/dgl/dgl-source/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
[14:30:07] /opt/dgl/dgl-source/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
[2023-10-20 14:30:37,147] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 49778 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2023-10-20 14:30:41,255] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 49780 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2023-10-20 14:30:48,381] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 49783 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2023-10-20 14:30:53,370] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 1 (pid: 49779) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/workspace/graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-10-20_14:30:07
host : ip-172-31-20-63.us-west-2.compute.internal
rank : 1 (local_rank: 1)
exitcode : -11 (pid: 49779)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 49779
============================================================