The wholegraph from rapidsai

Seg fault while running GraphStorm using WholeGraph

We are using WholeGraph in GraphStorm to train GNN models. We encountered Segmentation fault when using WholeGraph with 8 trainers on 4 nodes. 4 trainers on 4 nodes rarely throws this error but usually clearing shared memory helps. Is it an OOM issue for 8 trainers? Issue couldn’t be reproduced for baseline (w/o wholegraph).

[ip-172-31-20-63:49779:0:49779] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f96ef392140)
==== backtrace (tid:  49779) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000002a2e8 __pyx_f_15pylibwholegraph_7binding_19wholememory_binding_python_cb_wrapper_temp_free()  wholememory_binding.cxx:0
 2 0x000000000027428c wholememory_ops::wholememory_gather_nccl()  ???:0
 3 0x0000000000272b75 wholememory_gather()  ???:0
 4 0x00000000000e9225 wholememory::noncached_embedding::gather()  ???:0
 5 0x0000000000056fa2 __pyx_pw_15pylibwholegraph_7binding_19wholememory_binding_13EmbeddingGatherForward()  wholememory_binding.cxx:0
 6 0x000000000015372b _PyObject_MakeTpCall()  ???:0
 7 0x000000000014c0e7 _PyEval_EvalFrameDefault()  ???:0
 8 0x000000000015d4ec _PyFunction_Vectorcall()  ???:0
 9 0x0000000000145c14 _PyEval_EvalFrameDefault()  ???:0
10 0x000000000015d4ec _PyFunction_Vectorcall()  ???:0
11 0x0000000000146d6b _PyEval_EvalFrameDefault()  ???:0
12 0x000000000015d4ec _PyFunction_Vectorcall()  ???:0
13 0x0000000000145c14 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000016af11 PyMethod_New()  ???:0
15 0x0000000000146d6b _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015d4ec _PyFunction_Vectorcall()  ???:0
17 0x0000000000145a1d _PyEval_EvalFrameDefault()  ???:0
18 0x0000000000142176 _PyArg_ParseTuple_SizeT()  ???:0
19 0x0000000000237c56 PyEval_EvalCode()  ???:0
20 0x0000000000264b18 PyUnicode_Tailmatch()  ???:0
21 0x000000000025d96b PyInit__collections()  ???:0
22 0x0000000000264865 PyUnicode_Tailmatch()  ???:0
23 0x0000000000263d48 _PyRun_SimpleFileObject()  ???:0
24 0x0000000000263a43 _PyRun_AnyFileObject()  ???:0
25 0x0000000000254c3e Py_RunMain()  ???:0
26 0x000000000022abcd Py_BytesMain()  ???:0
27 0x0000000000029d90 __libc_init_first()  ???:0
28 0x0000000000029e40 __libc_start_main()  ???:0
29 0x000000000022aac5 _start()  ???:0
=================================
[2023-10-20 14:30:07,147] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 49778 closing signal SIGTERM
[2023-10-20 14:30:07,147] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 49780 closing signal SIGTERM
[[2023-10-20 14:30:07,147] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 49781 closing signal SIGTERM
14:30:07] /opt/dgl/dgl-source/src/rpc/rpc.cc:390: 
User pressed Ctrl+C, Exiting
[[2023-10-20 14:30:07,147] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 49782 closing signal SIGTERM
14:30:07] /opt/dgl/dgl-source/src/rpc/rpc.cc:390: 
User pressed Ctrl+C, Exiting
[2023-10-20 14:30:07,147] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 49783 closing signal SIGTERM
[14:30:07] /opt/dgl/dgl-source/src/rpc/rpc.cc:[[2023-10-20 14:30:07,147] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 49784 closing signal SIGTERM
390: 
User pressed Ctrl+C, Exiting
14:30:07] /opt/dgl/dgl-source/src/rpc/rpc.cc:390[: 
User pressed Ctrl+C, Exiting
[2023-10-20 14:30:07,147] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 49785 closing signal SIGTERM
14:30:07] /opt/dgl/dgl-source/src/rpc/rpc.cc:390: 
User pressed Ctrl+C, Exiting
[14:30:07] /opt/dgl/dgl-source/src/rpc/rpc.cc:390: 
User pressed Ctrl+C, Exiting
[14:30:07] /opt/dgl/dgl-source/src/rpc/rpc.cc:390: 
User pressed Ctrl+C, Exiting
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
[2023-10-20 14:30:37,147] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 49778 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2023-10-20 14:30:41,255] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 49780 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2023-10-20 14:30:48,381] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 49783 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2023-10-20 14:30:53,370] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 1 (pid: 49779) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/workspace/graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-10-20_14:30:07
  host      : ip-172-31-20-63.us-west-2.compute.internal
  rank      : 1 (local_rank: 1)
  exitcode  : -11 (pid: 49779)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 49779
============================================================

[wholegraph] CUDA 12 Conda Packages for ARM

Overview

CUDA 12.0 was successfully released on Conda-Forge for x86-64 systems in RAPIDS 23.08. This is a continuation of that work, bringing support to ARM systems with the target of RAPIDS 23.12.

RAPIDS Dependencies

Beta Give feedback

[RMM] CUDA 12 Conda Packages for ARM rmm#1345

1 of 1

feature request
[RAFT] CUDA 12 Conda Packages for ARM raft#1834

1 of 1
Options

Sample PRs

RMM
cuDF

how to use wholegraph

how to use wholegraph in branch-23.10,I cannot find the guide of it

use "random" from RAFT API

WM has its own RNG functions inside wholegraph/random.cuh. This is unnecessary IMO and we should instead be using RAFT's random namespace

Meet problems while building the project

Hi, I am now trying to compile the wholegraph project. But I met some problems.

Here's my environment.

Need your help, thanks.

WG as a FeatureStore in cugraph-pyg

Work with cugraph-pyg team to get WG integrated as a feature store

WholeGraph Next Refactoring

Improve the initial refactoring

Tasks

Beta Give feedback

Use RAFT communicator instead of a separate communicator in WholeMemory #4

good first issue tech-debt
Use top-k from RAFT #5

good first issue tech-debt
Usage of C10 API's in WholeMemory and its implication on build #6
use "random" from RAFT API #7

good first issue tech-debt
use cudart_utils.cuh/error.hpp from RAFT #8

good first issue tech-debt
WG as a FeatureStore in cugraph-pyg #47
Use RAFT's RngState on APIs, DeviceState within kernels, and use specialized distributions for generating numbers #23

feature request
WG Introduction BLOG #49

doc
WG Performance BLOG #50

doc
Options

WG Performance BLOG

Create a BLOG on the performance of WG

Use top-k from RAFT

PR #3 introduces wholegraph/block_radix_topk.cuh which should be replaced with calls to the top-k kernels in RAFT.

Add docs to WholeGraph build

Create the base Sphinx files for capture the APIs for WholeGraph and have the docs built and published

Is there any other way to install wholegraph if not using docker in branch-23.06?

use cudart_utils.cuh/error.hpp from RAFT

WM currently also has its own set of similar cuda-rt check macros here: wholegraph/macros.h. Those can be replaced with the corresponding headers (as mentioned in the title) from RAFT.

Use RAFT communicator instead of a separate communicator in WholeMemory

PR #3 introduces include/whole_memory.h which currently seems to have its own communicator. In the future, we should be using the raft communicator handle instead.

WG Introduction BLOG

Create a BLOG Introducing WholeGraph - what it does, how to get code, how to use.

This might need to be tied to the woprk on getting WG tied into cugraph-{dgl/pyg}

WholeGraph packaging

Add the build process to produce Pip and Conda packages

Use RAFT's RngState on APIs, DeviceState within kernels, and use specialized distributions for generating numbers

Any API generating random numbers should use raft::random::RngState instead of the direct values for seed etc.
Implementations of these APIs should call RngState.advance accordingly so that they can be invoked several times without the user having to re-seed correctly (this is not trivial !)
Kernels should use DeviceState constructed from RngState in their signatures
Kernels should use <GeneratorClass>(device_state, subsequence) constructors instead of direct seed / subsequence constructors which are for advanced usage in RAFT
Kernels should call the appropriate distributions. For example, both kernels in unweighted_sample_without_replacement_func.cuh use bounded integer generation. They should use the appropriate RAFT distribution (https://github.com/rapidsai/raft/blob/branch-23.08/cpp/include/raft/random/detail/rng_device.cuh#L183 in this case) which are more optimized
Generation for unweighted_sample_without_replacement_kernel seems wrong: idx < M may be ==N which would lead to undefined behavior. Maybe this should be idx < N instead.

Refactor code

WholeMemoryGatherIntFunc - typo in whole_memory_embedding.cu ?

in line: https://github.com/rapidsai/wholegraph/blob/main/wholegraph/whole_memory_embedding.cu#L435

REGISTER_DISPATCH_THREE_TYPES(WholeMemoryGatherIntFunc,
                              WholeMemoryGatherFunc,
                              SINT,
                              SINT,
                              SINT3264)

should be REGISTER_DISPATCH_THREE_TYPES(WholeMemoryGatherIntFunc, WholeMemoryGatherIntFunc, ...) ?

Initial WholeGraph Refactored Release

Integrate WholeGraph into the RAPIDS process and refactor the code to adhere to RAPIDS standards and leverage RAPIDS tools.

Tasks

Beta Give feedback

[Bug] WholeGraph communicators need to be reset in the end after the finalize call

WholeGraph should have a clean shutdown process once it has completed the finalization of the library. For example, after each finalize, users should be able to start/launch the WholeGraph again within the same process, i.e., the following should work

wgth.distributed_launch()
wgth.finalize()
wgth.distributed_launch()
wgth.finalize()
...

just like dist.init_process_group+dist.destroy_process_group +dist.init_process_group+dist.destroy_process_group...

Attached is a small repo script:

import torch
import time
import os
import argparse
import torch.distributed as dist
import pylibwholegraph.torch as wgth

class wholegraph_config:
    """Add/initialize default options required for distributed launch incorprating with wholegraph

    NOTE: This class might be deprecated soon once wholegraph's update its configuration API.
    """
    def __init__(self, **kwargs):
        self.__dict__.update(kwargs)
        self.launch_env_name_world_rank = "RANK"
        self.launch_env_name_world_size = "WORLD_SIZE"
        self.launch_env_name_master_addr = "MASTER_ADDR"
        self.launch_env_name_master_port = "MASTER_PORT"
        self.launch_env_name_local_size = "LOCAL_WORLD_SIZE"
        self.launch_env_name_local_rank = "LOCAL_RANK"
        self.local_rank = int(os.environ["LOCAL_RANK"])
        self.local_size = int(os.environ["LOCAL_WORLD_SIZE"])

def create_wg_sparse_params(nnodes, embedding_dim, location='cpu'): # location = ['cpu'|'cuda']

    global_comm = wgth.comm.get_global_communicator()
    embedding_wholememory_type = 'distributed'
    embedding_wholememory_location = location
    dist_embedding = wgth.create_embedding(global_comm,
                                           embedding_wholememory_type,
                                           embedding_wholememory_location,
                                           torch.float32,
                                           [nnodes, embedding_dim],
                                           optimizer=None,
                                           cache_policy=None, # disable cache for now
                                           random_init=False,
                                           )
    return dist_embedding

def main_func():
    print(f"Rank={wgth.get_rank()}, local_rank={wgth.get_local_rank()}")
    global_comm, local_comm = wgth.init_torch_env_and_create_wm_comm(
        wgth.get_rank(),
        wgth.get_world_size(),
        wgth.get_local_rank(),
        wgth.get_local_size(),
    )
    # dummy operation
    create_wg_sparse_params(100, 100)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--mode",
        type=str,
        default="pytorch",
        help="[pytorch | wholegraph].",
    )
    args = parser.parse_args()
    if args.mode == "pytorch":
        for i in range(5):
            torch.cuda.set_device(int(os.environ['LOCAL_RANK']))
            dist.init_process_group(backend="nccl")
            device = torch.cuda.current_device()
            print("starting pytorch")
            # dummpy operation
            t = torch.rand(100, 100, device=device)
            dist.all_reduce(t)
            print("finalizing pytorch")
            dist.destroy_process_group()
            time.sleep(1)
    elif args.mode == "wholegraph":
        for i in range(5):
            torch.cuda.set_device(int(os.environ['LOCAL_RANK']))
            config = wholegraph_config(launch_agent="pytorch")
            wgth.distributed_launch(config, main_func)
            wgth.finalize()
            if dist.is_initialized():
                dist.destroy_process_group() # this line needs to be included in wgth.finalize()
            # needs to reset communicators at end
            time.sleep(1)

And if I run it with: torchrun --standalone --nproc_per_node=1 test.py --mode wholegraph, it would crash with the stack traces:

Rank=0, local_rank=0
Rank=0, local_rank=0
[7e46ae2ab5f8:30226:0:30226] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x55e84e98cae6)
==== backtrace (tid:  30226) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000162f8d wholememory::nccl_comms::host_allgather()  ???:0
 2 0x000000000014ca3a wholememory::negotiate_handle_id_with_comm_locked()  ???:0
 3 0x000000000014e8d2 wholememory::create_wholememory()  ???:0
 4 0x000000000018d3fb wholememory_create_tensor()  ???:0
 5 0x000000000013af56 wholememory::embedding_base::allocate()  ???:0
 6 0x000000000013bdd4 wholememory_create_embedding()  ???:0
 7 0x0000000000064520 __pyx_pw_15pylibwholegraph_7binding_19wholememory_binding_22PyWholeMemoryEmbedding_3create_embedding()  wholememory_binding.cxx:0
 8 0x000000000005bc69 __pyx_pw_15pylibwholegraph_7binding_19wholememory_binding_11create_embedding()  wholememory_binding.cxx:0
 9 0x0000000000148cfa _PyEval_EvalFrameDefault()  ???:0
10 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
11 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
12 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
13 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0

I think this is due to the communicators declared below are not reset after each run.

wholegraph/python/pylibwholegraph/pylibwholegraph/torch/comm.py

Lines 25 to 27 in 0c61783

    
           global_communicators = {} 
        
           local_node_communicator = None 
        
           local_device_communicator = None

Destroy the entire temp_memory handle before each allocation can be error-prone

Hi,

Within temp_memory_handle class, can we avoid completely free the object when invoking each type of *_malloc() function? I thought, by design if needed, users/developers should be able to allocate memory multiple times using the same set of env_fns or handle object, like:

{
  temp_memory_handle  scratch_space(env_fns);
  ptr = scratch_space.device_malloc();
  /*  do something */
  ptr = scratch_space.device_malloc();  // calling free_memory() before malloc_fn()
  /* do something */
}

However, free_memory() function serves as the destructor of the entire handle object, for example, here:

wholegraph/cpp/src/wholememory_ops/temp_memory_handle.hpp

Lines 32 to 37 in a0ef0d2

    
           void* device_malloc(size_t elt_count, wholememory_dtype_t data_type) 
        
           { 
        
             free_memory(); 
        
             wholememory_tensor_description_t tensor_description; 
        
             get_tensor_description(&tensor_description, elt_count, data_type); 
        
             ptr_ = temp_mem_fns_->malloc_fn(

it not only frees the memory but also deletes memory context. Thus, the above code snippet would crash with segfault from calling *malloc_fn function (memory_context_ becomes nullptr after each free_memory call).

Therefore, I suggest to add a new free_data function to just deallcoate memory but not memory context, before each *_malloc() call.

Usage of C10 API's in WholeMemory and its implication on build

@BradReesWork JFYI...
Currently, WM directly uses C10 API's from pytorch. Eg: wholegraph/torch/whole_nccl_pytorch_tensor.h. This means, similar to the pytorch backend of cugraph-ops, we'll also have to think about the topic of CXX_ABI during its build! This can especially become complicated when WM moves inside cuGraph.

[Bug] [Test] In one of test util function, host sampling for all neighbors returns wrong results

In one of the test utility function host_sample_all_neighbors, the output output_center_localid_tensor returns the global node id.

wholegraph/python/pylibwholegraph/pylibwholegraph/test_utils/test_comm.py

Line 106 in cc633b9

output_center_localid_tensor[output_id + j] = node_id

However, we know by definition like the following places:

wholegraph/cpp/include/wholememory/wholegraph_op.h

Line 64 in cc633b9

    
            * @param output_center_localid_memory_context : memory context to output center local id

wholegraph/python/pylibwholegraph/pylibwholegraph/tests/wholegraph_torch/ops/test_wholegraph_unweighted_sample_without_replacement.py

Line 142 in cc633b9

output_center_localid_tensor[output_id + j] = i

output_center_localid_tensor is the output of center local id.

As a result, the unit test test_wholegraph_unweighted_sample would fail when max_sample_count = -1 (full neighbor sampling).

explicitly include `cstdlib`

The code contains abort() in few places without explicitly including the matching headers. With the Dockerfile given in this repository, this somehow does not cause any issues. To allow more straightforward builds from source in other environments, it might be better to explicitly import those headers, e.g. cstdlib.

wholegraph: CUDA 12 testing

Currently CUDA 12 testing on wholegraph is skipped

wholegraph/ci/test_python.sh

Line 17 in aafd5be

rapids-logger "Exiting CUDA 12 due to no pytorch stable yet"

Would be interested in understanding why that is the case and figuring out how we can enable testing here

Replace all the GNN layer ops kernels with the corresponding APIs from cugraph-ops

All (if not most) of the GNN layer ops in cugraph-ops should already help do the job, for which we have specific kernels in WholeGraph. Thus, there's lot of duplication of code between these 2 projects. We must further refactor WG GNN layer ops in order to use the ones from cugraph-ops instead of having to maintain specific kernels for these in WG.

If not attempted already, please prepare a plan to replace these kernels with the corresponding calls from cugraph-ops in the next release cycle?
If attempted, can you file issues against cugraphops for any missing features that are preventing this from happening?

Rename test_utils subpackage and test_utils/test_comms.py to avoid pytest conflicts

pytest will assume that any file or directory named test_* is a test (see pytest's discovery rules). wholegraph currently puts testing utilities into a file that fits this description, which causes various issues with test discovery (and import order because test_utils is a package and not just a module, which causes additional problems for pytest) depending on exactly what path is used to run the tests. We worked around some of these in #128, but it would be best if this subpackage and the test_comms.py file it contains could be renamed with a different prefix like testing to avoid this conflict.

[Bug] WholeMemoryEmbedding gather operation does not work with integer-type of embeddings

🐛 Bug

When creating a WholeMemoryEmbedding instance with memory_type=distributed, the code would crash with dtype=int64 or int32 (working fine with fp32 and fp64).

To Reproduce

Minimum code to reproduce:

feat_size=111059956
feat_dim=1
dtype = torch.int64
node_feat_wm_embedding = wgth.create_embedding(
        global_comm,
        "distributed",
        "cpu",
        dtype,
        [feat_size, feat_dim],
)
sampled_nodes = 128000
input_nodes = torch.randint(0, feat_size, (sampled_nodes,), dtype=torch.int64, device=device)
x = node_feat_wm_embedding.gather(input_nodes)

WholeMemory failure at file=/opt/rapids/wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu line=55: File /opt/rapids/wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu, line 55, it != ScatterFuncIntegerInt64_dispatch2_map->end() check failed.
WholeMemory failure at file=/opt/rapids/wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu line=55: File /opt/rapids/wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu, line 55, it != ScatterFuncIntegerInt64_dispatch2_map->end() check failed.
WholeMemory failure at file=/opt/rapids/wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu line=55: File /opt/rapids/wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu, line 55, it != ScatterFuncIntegerInt64_dispatch2_map->end() check failed.
WholeMemory failure at file=/opt/rapids/wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu line=55: File /opt/rapids/wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu, line 55, it != ScatterFuncIntegerInt64_dispatch2_map->end() check failed.
[2023-09-19 20:43:35,753] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1203 closing signal SIGTERM
[2023-09-19 20:43:36,017] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 1202) of binary: /usr/bin/python

Environment

Version: 23.08
Source build with bash build.sh libwholegraph pylibwholegraph tests -v --allgpuarch

Additional notes:

It seems like a bug to me... For integer type of data, scatter func impl should dispatch from int types, instead of float types (HALF-FLOAT-DOUBLE), right?

wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu

Lines 40 to 41 in 2e963b9

    
           HALF_FLOAT_DOUBLE, 
        
           HALF_FLOAT_DOUBLE)

[Performance] Remove unnecessary synchronization using thrust::cuda::par_nosync policy

We are always using asynchronous thrust launch on a cuda stream, which involves extra cudaStreamSync within thrust calls, e.g.,

wholegraph/cpp/src/wholememory_ops/functions/exchange_ids_nccl_func.cu

Line 63 in 9f290c4

    
           thrust::cuda::par(allocator).on(stream), seq_indices, seq_indices + indices_desc.size, 0);

wholegraph/cpp/src/wholegraph_ops/unweighted_sample_without_replacement_func.cuh

Line 340 in 9f290c4

thrust::exclusive_scan(thrust::cuda::par(thrust_allocator).on(stream),

It would be better to change to thrust::cuda::par_nosync, to make it easier to overlap with other operations.

	global_communicators = {}
	local_node_communicator = None
	local_device_communicator = None

	void* device_malloc(size_t elt_count, wholememory_dtype_t data_type)
	{
	free_memory();
	wholememory_tensor_description_t tensor_description;
	get_tensor_description(&tensor_description, elt_count, data_type);
	ptr_ = temp_mem_fns_->malloc_fn(

rapidsai / wholegraph Goto Github PK

wholegraph's Introduction

WholeGraph

Table of Contents

wholegraph's People

Contributors

Stargazers

Watchers

Forkers

wholegraph's Issues

Overview

RAPIDS Dependencies

Sample PRs

Tasks

Tasks

🐛 Bug

To Reproduce

Environment

Additional notes:

Recommend Projects

Recommend Topics

Recommend Org