llnl / comb Goto Github PK

Comb is a communication performance benchmarking tool.

License: MIT License

C++ 83.46% Lua 1.08% Shell 9.30% CMake 6.16%

comb's Introduction

Comb v0.3.1

Comb is a communication performance benchmarking tool. It is used to determine performance tradeoffs in implementing communication patterns on high performance computing (HPC) platforms. At its core comb runs combinations of communication patterns with execution patterns, and memory spaces in order to find efficient combinations. The current set of capabilities Comb provides includes:

Configurable structured mesh halo exchange communication.
A variety of communication patterns based on grouping messages.
A variety of execution patterns including serial, openmp threading, cuda, cuda fused kernels.
A variety of memory spaces including default system allocated memory, pinned host memory, cuda device memory, and cuda managed memory with different cuda memory advice.

It is important to note that Comb is very much a work-in-progress. Additional features will appear in future releases.

Quick Start

The Comb code lives in a GitHub repository. To clone the repo, use the command:

git clone --recursive https://github.com/llnl/comb.git

On an lc system you can build Comb using the provided cmake scripts and host-configs.

./scripts/lc-builds/blueos_nvcc_gcc.sh 10.1.243 sm_70 8.3.1
cd build_lc_blueos-nvcc10.1.243-sm_70-gcc8.3.1
make

You can also create your own script and host-config provided you have a C++ compiler that supports the C++11 standard, an MPI library with compiler wrapper, and optionally an install of cuda 9.0 or later.

./scripts/my-builds/compiler_version.sh
cd build_my_compiler_version
make

To run basic tests make a directory and make symlinks to the comb executable and scripts. The scripts expect a symlink to comb to exist in the run directory. The run_tests.bash script runs the basic_tests.bash script in 2^3 processes.

ln -s /path/to/comb/build_my_compiler_version/bin/comb .
ln -s /path/to/comb/scripts/* .
./run_tests.bash 2 basic_tests.bash

User Documentation

Minimal documentation is available.

Comb runs every combination of execution pattern, and memory space enabled. Each rank prints its results to stdout. The sep_out.bash script may be used to simplify data collection by piping the output of each rank into a different file. The combine_output.lua lua script may be used to simplify data aggregation from multiple files.

Comb uses a variety of manual packing/unpacking execution techniques such as sequential, openmp, and cuda. Comb also uses MPI_Pack/MPI_Unpack with MPI derived datatypes for packing/unpacking. (Note: tests using cuda managed memory and MPI datatypes are disabled as they sometimes produce incorrect results)

Comb creates a different MPI communicator for each test. This communicator is assigned a generic name unless MPI datatypes are used for packing and unpacking. When MPI datatypes are used the name of the memory allocator is appended to the communicator name.

Configure Options

The cmake configuration options change which execution patterns and memory spaces are enabled.

ENABLE_MPI Allow use of mpi and enable test combinations using mpi
ENABLE_OPENMP Allow use of openmp and enable test combinations using openmp
ENABLE_CUDA Allow use of cuda and enable test combinations using cuda
ENABLE_RAJA Allow use of RAJA performance portability library
ENABLE_CALIPER Allow use of the Caliper performance profiling library
ENABLE_ADIAK Allow use of the Adiak library for recording program metadata

Runtime Options

The runtime options change the properties of the grid and its decomposition, as well as the communication pattern used.

#_#_# Grid size in each dimension (Required)
-divide #_#_# Number of subgrids in each dimension (Required)
-periodic #_#_# Periodicity in each dimension
-ghost #_#_# The halo width or number of ghost zones in each dimension
-vars # The number of grid variables
-comm option Communication options
- cutoff # Number of elements cutoff between large and small message packing kernels
- enable|disable option Enable or disable specific message passing execution policies
  - all all message passing execution patterns
  - mock mock message passing execution pattern (do not communicate)
  - mpi mpi message passing execution pattern
  - gdsync libgdsync message passing execution pattern (experimental)
  - gpump libgpump message passing execution pattern
  - mp libmp message passing execution pattern (experimental)
  - umr umr message passing execution pattern (experimental)
- post_recv option Communication post receive (MPI_Irecv) options
  - wait_any Post recvs one-by-one
  - wait_some Post recvs in groups
  - wait_all Post all recvs
  - test_any Post recvs one-by-one
  - test_some Post recvs in groups
  - test_all Post all recvs
- post_send option Communication post send (MPI_Isend) options
  - wait_any pack and send messages one-by-one
  - wait_some pack messages then send them in groups
  - wait_all pack all messages then send them all
  - test_any pack messages asynchronously and send when ready
  - test_some pack multiple messages asynchronously and send when ready
  - test_all pack all messages asynchronously and send when ready
- wait_recv option Communication wait to recv and unpack (MPI_Wait, MPI_Test) options
  - wait_any recv and unpack messages one-by-one (MPI_Waitany)
  - wait_some recv messages then unpack them in groups (MPI_Waitsome)
  - wait_all recv all messages then unpack them all (MPI_Waitall)
  - test_any recv and unpack messages one-by-one (MPI_Testany)
  - test_some recv messages then unpack them in groups (MPI_Testsome)
  - test_all recv all messages then unpack them all (MPI_Testall)
- wait_send option Communication wait on sends (MPI_Wait, MPI_Test) options
  - wait_any Wait for each send to complete one-by-one (MPI_Waitany)
  - wait_some Wait for all sends to complete in groups (MPI_Waitsome)
  - wait_all Wait for all sends to complete (MPI_Waitall)
  - test_any Wait for each send to complete one-by-one by polling (MPI_Testany)
  - test_some Wait for all sends to complete in groups by polling (MPI_Testsome)
  - test_all Wait for all sends to complete by polling (MPI_Testall)
- allow|disallow option Allow or disallow specific communications options
  - per_message_pack_fusing Combine packing/unpacking kernels for boundaries communicated in the same message
  - message_group_pack_fusing Fuse packing/unpacking kernels across messages (and variables) in the same message group
-cycles # Number of times the communication pattern is tested
-omp_threads # Number of openmp threads requested
-exec option Execution options
- enable|disable option Enable or disable specific execution patterns
  - all all execution patterns
  - seq sequential CPU execution pattern
  - omp openmp threaded CPU execution pattern
  - cuda cuda GPU execution pattern
  - cuda_graph cuda GPU batched via cuda graph API execution pattern
  - hip hip GPU execution pattern
  - raja_seq RAJA sequential CPU execution pattern
  - raja_omp RAJA openmp threaded CPU execution pattern
  - raja_cuda RAJA cuda GPU execution pattern
  - raja_hip RAJA hip GPU execution pattern
  - mpi_type MPI datatypes MPI implementation execution pattern
-memory option Memory space options
- UseType enable|disable Optional UseType modifier for enable|disable, default is all. UseType specifies what uses to enable|disable, for example "-memory buffer disable cuda_pinned" disables cuda_pinned buffer allocations.
  - all all use types
  - mesh mesh use type
  - buffer buffer use type
- enable|disable option Enable or disable specific memory spaces for UseType allocations
  - all all memory spaces
  - host host CPU memory space
  - cuda_hostpinned cuda pinned memory space (pooled)
  - cuda_device cuda device memory space (pooled)
  - cuda_managed cuda managed memory space (pooled)
  - cuda_managed_host_preferred cuda managed with host preferred advice memory space (pooled)
  - cuda_managed_host_preferred_device_accessed cuda managed with host preferred and device accessed advice memory space (pooled)
  - cuda_managed_device_preferred cuda managed with device preferred advice memory space (pooled)
  - cuda_managed_device_preferred_host_accessed cuda managed with device preferred and host accessed advice memory space (pooled)
  - hip_hostpinned hip pinned memory space (pooled)
  - hip_hostpinned_coarse hip coarse grained (non-coherent) pinned memory space (pooled)
  - hip_device hip device memory space (pooled)
  - hip_device_fine hip fine grained device memory space (pooled)
  - hip_managed hip managed memory space (pooled)
-cuda_aware_mpi Assert that you are using a cuda aware mpi implementation and enable tests that pass cuda device or managed memory to MPI
-hip_aware_mpi Assert that you are using a hip aware mpi implementation and enable tests that pass hip device or managed memory to MPI
-cuda_host_accessible_from_device Assert that your system supports pageable host memory access from the device and enable tests that access pageable host memory on the device
-use_device_preferred_for_cuda_util_aloc Use device preferred host accessed memory for cuda utility allocations instead of host pinned memory, mainly affects fused kernels
-use_device_for_hip_util_aloc Use device memory for hip utility allocations instead of host pinned memory, mainly affects fused kernels
-print_packing_sizes Print message and packing sizes to proc files
-print_message_sizes Print message sizes to proc files
-caliper_config Caliper performance profiling config (e.g., "runtime-report")

Example Script

The run_tests.bash is an example script that allocates resources and uses a script such as focused_tests.bash to run the code in a variety of configurations. The run_tests.bash script takes two arguments, the number of processes per side used to split the grid into an N x N x N decomposition, and the tests script.

mkdir 1_1_1
cd 1_1_1
ln -s path/to/comb/build/bin/comb .
ln -s path/to/comb/scripts/* .
./run_tests.bash 1 focused_tests.bash

The scale_tests.bash script used with run_tests.bash which shows the options available and how the code may be run with multiple sets of arguments with mpi. The focused_tests.bash script used with run_tests.bash which shows the options available and how the code may be run with one set of arguments with mpi.

Output

Comb outputs Comb_(number)_summary and Comb_(number)_proc(number) files. The summary file contains aggregated results from the proc files which contain per process results. The files contain the argument and code setup information and the results of multiple tests. The results for each test follow a line started with "Starting test" and the name of the test.

The first set of tests are memory copy tests with names of the following form.

Starting test memcpy (execution policy) dst (destination memory space) src (source memory space)"
copy_sync-(number of variables)-(elements per variable)-(bytes per element): num (number of repeats) avg (time) s min (time) s max (time) s

Example:

Starting test memcpy seq dst Host src Host
copy_sync-3-1061208-8: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s

This is a test in which memory is copied via sequential cpu execution to one host memory buffer from another host memory buffer. The test involves one measurement.

copy_sync-3-1061208-8 Copying 3 buffers of 1061208 elements of size 8.

The second set of tests are the message passing tests with names of the following form.

Comm (message passing execution policy) Mesh (physics execution policy) (mesh memory space) Buffers (large message execution policy) (large message memory space) (small message execution policy) (small message memory space)
(test phase): num (number of repeats) avg (time) s min (time) s max (time) s
...

Example

Comm mpi Mesh seq Host Buffers seq Host seq Host
pre-comm:  num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
post-recv: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
post-send: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
wait-recv: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
wait-send: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
post-comm: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
start-up:   num 8 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
test-comm:  num 8 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
bench-comm: num 8 avg 0.123456789 s min 0.123456789 s max 0.123456789 s

This is a test in which a mesh is updated with physics running via sequential cpu execution using memory allocated in host memory. The buffers used for large messages are packed/unpacked via sequential cpu execution and allocated in host memory and the buffers used with MPI for small messages are packed/unpacked via sequential cpu execution and allocated in host memory. This test involves multiple measurements, the first six time individual parts of the physics cycle and communication.

pre-comm "Physics" before point-to-point communication, in this case setting memory to initial values.
post-recv Allocating MPI receive buffers and calling MPI_Irecv.
post-send Allocating MPI send buffers, packing buffers, and calling MPI_Isend.
wait-recv Waiting to receive MPI messages, unpacking MPI buffers, and freeing MPI receive buffers
wait-send Waiting for MPI send messages to complete and freeing MPI send buffers.
post-comm "Physics" after point-to-point communication, in this case resetting memory to initial values. The final three measure problem setup, correctness testing, and total benchmark time.
start-up Setting up mesh and point-to-point communication.
test-comm Testing correctness of point-to-point communication.
bench-comm Running benchmark, starts after an initial MPI_Barrier and ends after a final MPI_Barrier.

Execution Policies

seq Sequential CPU execution
omp Parallel CPU execution via OpenMP
cuda Parallel GPU execution via cuda
cudaGraph Parallel GPU execution via cuda graphs
hip Parallel GPU execution via hip
raja_seq RAJA Sequential CPU execution
raja_omp RAJA Parallel CPU execution via OpenMP
raja_cuda RAJA Parallel GPU execution via cuda
raja_hip RAJA Parallel GPU execution via hip
mpi_type Packing or unpacking execution done via mpi datatypes used with MPI_Pack/MPI_Unpack

Note: The cudaGraph exec policy updates the graph each cycle. There is currently no option to use the same graph for every cycle.

Memory Spaces

Host CPU memory (malloc)
HostPinned Cuda/Hip Pinned CPU memory pool (cudaHostAlloc/hipMallocHost)
Device Cuda/Hip GPU memory pool (cudaMalloc/hipMalloc)
Managed Cuda/Hip Managed GPU memory pool (cudaMallocManaged/hipMallocManaged)
ManagedHostPreferred Cuda Managed CPU Pinned memory pool (cudaMallocManaged + cudaMemAdviseSetPreferredLocation cudaCpuDeviceId)
ManagedHostPreferredDeviceAccessed Cuda Managed CPU Pinned memory pool (cudaMallocManaged + cudaMemAdviseSetPreferredLocation cudaCpuDeviceId + cudaMemAdviseSetAccessedBy 0)
ManagedDevicePreferred Cuda Managed CPU Pinned memory pool (cudaMallocManaged + cudaMemAdviseSetPreferredLocation 0)
ManagedDevicePreferredHostAccessed Cuda Managed CPU Pinned memory pool (cudaMallocManaged + cudaMemAdviseSetPreferredLocation 0 + cudaMemAdviseSetAccessedBy cudaCpuDeviceId)

Note: Some memory spaces are pooled. This is done to amortize the cost of allocation. After the first allocation the cost of allocating memory should be trivial for pooled memory spaces. The first allocation is done in a warmup step and is not be included in any timers.

Related Software

The RAJA Performance Suite contains a collection of loop kernels implemented in multiple RAJA and non-RAJA variants. We use it to monitor and assess RAJA performance on different platforms using a variety of compilers.

The RAJA Proxies repository contains RAJA versions of several important HPC proxy applications.

Contributions

The Comb team follows the GitFlow development model. Folks wishing to contribute to Comb, should include their work in a feature branch created from the Comb develop branch. Then, create a pull request with the develop branch as the destination. That branch contains the latest work in Comb. Periodically, we will merge the develop branch into the master branch and tag a new release.

Authors

Thanks to all of Comb's contributors.

Comb was created by Jason Burmark ([email protected]).

Release

Comb is released under an MIT license. For more details, please see the LICENSE, RELEASE, and NOTICE files.

LLNL-CODE-758885

comb's People

Contributors

Stargazers

Watchers

Forkers

kingchc spotluri daboehme geraldc-unm tmhines trey-ornl jarusified

comb's Issues

MPI_Pack with device memory

Hi,

I have built this on a system with a single GPU, that I would like to share between two MPI ranks (just for the sake of getting things up and running).
The build basically follows the ubuntu_nvcc10_gcc8 except adjusted for gcc 10.
I built commit e06e54d (the latest at the time of writing).

-- The CXX compiler identification is GNU 10.2.0
-- Check for working CXX compiler: /usr/bin/g++-10
-- Check for working CXX compiler: /usr/bin/g++-10 - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- BLT Version: 0.3.0
-- CMake Version: 3.17.5
-- CMake Executable: /home/pearson/software/cmake-3.17.5/bin/cmake
-- Found Git: /usr/bin/git (found version "2.28.0") 
-- Git Support is ON
-- Git Executable: /usr/bin/git
-- Git Version: 2.28.0
-- MPI Support is ON
-- Enable FindMPI:  ON
-- Found MPI_CXX: /home/pearson/software/openmpi-4.0.5/lib/libmpi.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- BLT MPI Compile Flags:  $<$<NOT:$<COMPILE_LANGUAGE:CUDA>>:-pthread>;$<$<COMPILE_LANGUAGE:CUDA>:-Xcompiler=-pthread>
-- BLT MPI Include Paths:  /home/pearson/software/openmpi-4.0.5/include
-- BLT MPI Libraries:      /home/pearson/software/openmpi-4.0.5/lib/libmpi.so
-- BLT MPI Link Flags:     -Wl,-rpath -Wl,/home/pearson/software/openmpi-4.0.5/lib -Wl,--enable-new-dtags -pthread
-- MPI Executable:       /home/pearson/software/openmpi-4.0.5/bin/mpiexec
-- MPI Num Proc Flag:    -n
-- MPI Command Append:   
-- OpenMP Support is OFF
-- CUDA Support is ON
-- The CUDA compiler identification is NVIDIA 11.1.74
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "11.1") 
-- CUDA Version:       11.1
-- CUDA Compiler:      /usr/local/cuda/bin/nvcc
-- CUDA Host Compiler: /usr/bin/g++-10
-- CUDA Include Path:  /usr/local/cuda/include
-- CUDA Libraries:     /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib/x86_64-linux-gnu/librt.so
-- CUDA Compile Flags: 
-- CUDA Link Flags:    
-- CUDA Separable Compilation:  ON
-- CUDA Link with NVCC:         
-- HIP Support is OFF
-- HCC Support is OFF
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE) 
-- Sphinx support is ON
-- Failed to locate Sphinx executable (missing: SPHINX_EXECUTABLE) 
-- Valgrind support is ON
-- Failed to locate Valgrind executable (missing: VALGRIND_EXECUTABLE) 
-- Uncrustify support is ON
-- Failed to locate Uncrustify executable (missing: UNCRUSTIFY_EXECUTABLE) 
-- AStyle support is ON
-- Failed to locate AStyle executable (missing: ASTYLE_EXECUTABLE) 
-- Cppcheck support is ON
-- Failed to locate Cppcheck executable (missing: CPPCHECK_EXECUTABLE) 
-- ClangQuery support is ON
-- Failed to locate ClangQuery executable (missing: CLANGQUERY_EXECUTABLE) 
-- C Compiler family is GNU
-- Adding optional BLT definitions and compiler flags
-- Enabling all compiler warnings on all targets.
-- Fortran support disabled.
-- CMAKE_C_FLAGS flags are:    -Wall -Wextra 
-- CMAKE_CXX_FLAGS flags are:    -Wall -Wextra 
-- CMAKE_EXE_LINKER_FLAGS flags are:  
-- Google Test Support is ON
-- Google Mock Support is OFF
-- The C compiler identification is GNU 10.2.0
-- Check for working C compiler: /usr/bin/gcc-10
-- Check for working C compiler: /usr/bin/gcc-10 - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Found PythonInterp: /usr/bin/python3.8 (found version "3.8.6") 
-- MPI Enabled
-- Cuda Enabled
-- Configuring done
-- Generating done
-- Build files have been written to: /home/pearson/repos/Comb/build_debian-nvcc11-gcc10

I tried to run it with the following:

~/software/openmpi-4.0.5/bin/mpirun -n 2 bin/comb 10_10_10 -divide 2_1_1 -cuda_aware_mpi -comm enable mpi -exec enable mpi_type -memory enable cuda_device

but I get the following error:

Comb version 0.2.0
Args  bin/comb;10_10_10;-divide;2_1_1;-cuda_aware_mpi;-comm;enable;mpi;-exec;enable;mpi_type;-memory;enable;cuda_device
Started rank 0 of 2
Node deneb
Compiler "/usr/bin/g++-10"
Cuda compiler "/usr/local/cuda/bin/nvcc"
Cuda driver version 11010
Cuda runtime version 11010
GPU 0 visible undefined
Cart coords         0        0        0
Message policy cutoff 200
Post Recv using wait_all method
Post Send using wait_all method
Wait Recv using wait_all method
Wait Send using wait_all method
Num cycles          5
Num vars            1
ghost_widths        1        1        1
sizes              10       10       10
divisions           2        1        1
periodic            0        0        0
division map
map                 0        0        0
map                 5       10       10
map                10                  
Starting test memcpy seq dst Host src Host
Starting test Comm mock Mesh seq Host Buffers seq Host seq Host
Starting test Comm mock Mesh seq Host Buffers mpi_type Host mpi_type Host
comb: /home/pearson/repos/Comb/include/comm_pol_mock.hpp:948: void detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::Irecv(detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::context_type&, detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::communicator_type&, detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::message_type**, IdxT, detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::request_type*): Assertion `buf != nullptr' failed.
comb: /home/pearson/repos/Comb/include/comm_pol_mock.hpp:948: void detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::Irecv(detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::context_type&, detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::communicator_type&, detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::message_type**, IdxT, detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::request_type*): Assertion `buf != nullptr' failed.
[deneb:23991] *** Process received signal ***
[deneb:23991] Signal: Aborted (6)
[deneb:23991] Signal code:  (-6)
[deneb:23992] *** Process received signal ***
[deneb:23992] Signal: Aborted (6)
[deneb:23992] Signal code:  (-6)
[deneb:23991] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140)[0x7ffb77f91140]
[deneb:23992] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140)[0x7fd0a2014140]
[deneb:23992] [ 1] [deneb:23991] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x141)[0x7ffb77ac6db1]
[deneb:23991] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x141)[0x7fd0a1b49db1]
[deneb:23992] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x123)[0x7ffb77ab0537]
[deneb:23991] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0x123)[0x7fd0a1b33537]
[deneb:23992] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2540f)[0x7fd0a1b3340f]
[deneb:23992] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2540f)[0x7ffb77ab040f]
[deneb:23991] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x345b2)[0x7fd0a1b425b2]
[deneb:23992] [ 5] bin/comb(+0x32b1d)[0x55baa8e21b1d]
[deneb:23992] [ 6] bin/comb(+0x43508)[0x55baa8e32508]
/lib/x86_64-linux-gnu/libc.so.6(+0x345b2)[0x7ffb77abf5b2]
[deneb:23991] [ 5] bin/comb(+0x32b1d)[0x56455c96ab1d]
[deneb:23991] [ 6] bin/comb(+0x43508)[0x56455c97b508]
[deneb:23991] [ 7] bin/comb(+0x468c6)[0x56455c97e8c6]
[deneb:23991] [ 8] bin/comb(+0x5cdc5)[0x56455c994dc5]
[deneb:23991] [ 9] bin/comb(+0x63303)[0x56455c99b303]
[deneb:23991] [10] bin/comb(+0x635a7)[0x56455c99b5a7]
[deneb:23991] [11] bin/comb(+0xe184)[0x56455c946184]
[deneb:23991] [12] [deneb:23992] [ 7] bin/comb(+0x468c6)[0x55baa8e358c6]
[deneb:23992] [ 8] bin/comb(+0x5cdc5)[0x55baa8e4bdc5]
[deneb:23992] [ 9] bin/comb(+0x63303)[0x55baa8e52303]
[deneb:23992] [10] bin/comb(+0x635a7)[0x55baa8e525a7]
[deneb:23992] [11] bin/comb(+0xe184)[0x55baa8dfd184]
[deneb:23992] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7fd0a1b34cca]
[deneb:23992] [13] bin/comb(+0xf41a)[0x55baa8dfe41a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7ffb77ab1cca]
[deneb:23991] [13] bin/comb(+0xf41a)[0x56455c94741a]
[deneb:23991] *** End of error message ***
[deneb:23992] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node deneb exited on signal 6 (Aborted).
--------------------------------------------------------------------------

I also managed ot run the focused tests:

cd <build>/bin
../../scripts/run_tests.bash 1 ../../scripts/focused_tests.bash

which appears to have worked with the following output:

mpirun -np 1 /home/pearson/repos/Comb/build_debian-nvcc11-gcc10/bin/comb -comm post_recv wait_any -comm post_send wait_any -comm wait_recv wait_any -comm wait_send wait_any 100_100_100 -divide 1_1_1 -periodic 1_1_1 -ghost 1_1_1 -vars 3 -cycles 25 -comm cutoff 250 -omp_threads 10 -exec disable seq -exec enable cuda -memory disable host -memory enable cuda_managed -comm enable mock -comm enable mpi
Comb version 0.2.0
Args  /home/pearson/repos/Comb/build_debian-nvcc11-gcc10/bin/comb;-comm;post_recv;wait_any;-comm;post_send;wait_any;-comm;wait_recv;wait_any;-comm;wait_send;wait_any;100_100_100;-divide;1_1_1;-periodic;1_1_1;-ghost;1_1_1;-vars;3;-cycles;25;-comm;cutoff;250;-omp_threads;10;-exec;disable;seq;-exec;enable;cuda;-memory;disable;host;-memory;enable;cuda_managed;-comm;enable;mock;-comm;enable;mpi
Started rank 0 of 1
Node deneb
Compiler "/usr/bin/g++-10"
Cuda compiler "/usr/local/cuda/bin/nvcc"
Cuda driver version 11010
Cuda runtime version 11010
GPU 0 visible undefined
Not built with openmp, ignoring -omp_threads 10.
Cart coords         0        0        0
Message policy cutoff 250
Post Recv using wait_any method
Post Send using wait_any method
Wait Recv using wait_any method
Wait Send using wait_any method
Num cycles         25
Num vars            3
ghost_widths        1        1        1
sizes             100      100      100
divisions           1        1        1
periodic            1        1        1
division map
map                 0        0        0
map               100      100      100
Starting test memcpy cuda dst Managed src HostPinned
Starting test memcpy cuda dst Managed src Device
Starting test Comm mock Mesh cuda Managed Buffers cuda HostPinned cuda HostPinned
Starting test Comm mpi Mesh cuda Managed Buffers cuda HostPinned cuda HostPinned
done

real    0m1.475s
user    0m1.244s
sys     0m0.146s

Is device memory + MPI + MPI_Type a supported configuration at this time? If so, any advice?

Thanks!

Can't use arbitrary memory spaces with buffers

You can't use arbitrary memory spaces with buffers, currently the code will only allows using one or two memory spaces if the execution space is host or device. For example with cuda you can only use device or pinned memory for communication buffers.

Create a spack package

Create a spack package.

Add file input

Add some kind of input file to specify the tests to run.
I'm not sure if this should be in addition to or instead of command line input.

CMake failing without errors?

Apologies if this is a CMake n00b problem.

I've been trying to configure on my system with this script (based on the ones in your source tree):

#!/bin/bash

BUILD_SUFFIX=rndv

rm -rf build_${BUILD_SUFFIX} 2>/dev/null
mkdir build_${BUILD_SUFFIX}
cd build_${BUILD_SUFFIX}

PATH=/path/to/mpi/bin:$PATH \
    LD_LIBRARY_PATH=/path/to/mpi/lib:$LD_LIBRARY_PATH \
    cmake \
    -DCMAKE_BUILD_TYPE=Release \
    -DENABLE_OPENMP=OFF \
    -DENABLE_CUDA=OFF \
    -DCMAKE_C_FLAGS=-pthread \
    -DCMAKE_CXX_FLAGS=-pthread \
    -DCMAKE_EXE_LINKER_FLAGS=-pthread \
    -C ../host-configs/ansklnvram01/gcc_${BUILD_SUFFIX}.cmake \
    -DCMAKE_INSTALL_PREFIX=../install_${BUILD_SUFFIX} \
    "$@" \
    -Wno-dev \
    ..

When I didn't include the -pthread flags, I kept getting errors about not being able to find the pthread symbols.

Now my output looks like this:

$ ./scripts/my-builds/gcc_rndv.sh
loading initial cache file ../host-configs/ansklnvram01/gcc_rndv.cmake
CMake Error: Error processing file: ../host-configs/ansklnvram01/gcc_rndv.cmake
-- The CXX compiler identification is GNU 7.3.1
-- Check for working CXX compiler: /opt/rh/devtoolset-7/root/usr/bin/c++
-- Check for working CXX compiler: /opt/rh/devtoolset-7/root/usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /nfs/site/home/wbland/home/tools/x86/bin/git (found version "2.19.2")
-- Git Support is ON
-- Git Executable: /nfs/site/home/wbland/home/tools/x86/bin/git
-- Git Version: 2.19.2
-- MPI Support is ON
-- Found MPI_CXX: /path/to/mpi/lib/libmpicxx.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- MPI C Compile Flags:
-- MPI C Include Path:
-- MPI C Link Flags:
-- MPI C Libraries:
-- MPI CXX Compile Flags:
-- MPI CXX Include Path:  /path/to/mpi/include;/path/to/mpi/include
-- MPI CXX Link Flags:    -Wl,-rpath -Wl,/path/to/mpi/lib -Wl,--enable-new-dtags -L/usr/lib64 -L/path/to/mpi/lib
-- MPI CXX Libraries:     /path/to/mpi/lib/libmpi.so
-- MPI Executable:       /path/to/mpi/bin/mpiexec
-- MPI Num Proc Flag:    -n
-- MPI Command Append:
-- CUDA Support is OFF
-- ROCM Support is OFF
-- Found Doxygen: /usr/bin/doxygen (found version "1.8.5") found components:  doxygen dot
-- Sphinx support is ON
-- Failed to locate Sphinx executable (missing: SPHINX_EXECUTABLE)
-- Valgrind support is ON
-- Found Valgrind: /opt/rh/devtoolset-7/root/usr/bin/valgrind
-- Uncrustify support is ON
-- Failed to locate Uncrustify executable (missing: UNCRUSTIFY_EXECUTABLE)
-- AStyle support is ON
-- Failed to locate AStyle executable (missing: ASTYLE_EXECUTABLE)
-- Cppcheck support is ON
-- Failed to locate Cppcheck executable (missing: CPPCHECK_EXECUTABLE)
-- C Compiler family is GNU
-- OpenMP Support is OFF
-- Adding optional BLT definitions and compiler flags
-- Standard C++11 selected
-- Enabling all compiler warnings on all targets.
-- Fortran support disabled.
-- CMAKE_C_FLAGS flags are:  -pthread  -Wall -Wextra
-- CMAKE_CXX_FLAGS flags are:  -pthread     -Wall -Wextra
-- CMAKE_EXE_LINKER_FLAGS flags are:  -pthread
-- Google Test Support is ON
-- Google Mock Support is OFF
-- The C compiler identification is GNU 7.3.1
-- Check for working C compiler: /opt/rh/devtoolset-7/root/usr/bin/cc
-- Check for working C compiler: /opt/rh/devtoolset-7/root/usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Found PythonInterp: /usr/bin/python (found version "2.7.5")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- MPI Enabled
-- Configuring incomplete, errors occurred!
See also "/path/to/Comb/build_rndv/CMakeFiles/CMakeOutput.log".

So it's telling me there's errors, but I don't actually see any errors and nothing is showing up in that log.

I'm assuming this is probably caused by me manually requiring the pthreads flag, but I'd love it if you could give me a pointer where to look.

Add csv output

Add an output file in csv format.

identifier "cudaGraphExecKernelNodeSetParams" is undefined

Hi,
I want to compile the CUDA version, and I get some trouble during make.

I have change the script in ./scripts/ubuntu-builds/ubuntu_nvcc10_gcc8.sh and ./host-configs/ubuntu-builds/nvcc_gcc_X.cmake respectively according to my system environment. What needs to be mentioned is that I add -std=c++11 to COMB_NVCC_FLAGS in nvcc_gcc_X.cmake. Otherwise, it will has the errors as bellow.

/app/compiler/gcc/4.9.4/include/c++/4.9.4/bits/c++0x_warning.h:32:2: error：#error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
 #error This file requires compiler and library support for the \
  ^
make[2]: *** [CMakeFiles/comb.dir/src/comb.cpp.o] error 1
make[1]: *** [CMakeFiles/comb.dir/all] error 2

After that, I run the script and get below.

loading initial cache file ../host-configs/ubuntu-builds/nvcc_gcc_X.cmake
-- The CXX compiler identification is GNU 4.9.4
-- Check for working CXX compiler: /app/compiler/gcc/4.9.4/bin/g++
-- Check for working CXX compiler: /app/compiler/gcc/4.9.4/bin/g++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- BLT Version: 0.3.0
-- CMake Version: 3.10.2
-- CMake Executable: /GPUFS/app_GPU/application/cmake/3.10.2/bin/cmake
-- Found Git: /usr/bin/git (found version "1.8.3.1")
-- Git Support is ON
-- Git Executable: /usr/bin/git
-- Git Version: 1.8.3.1
-- MPI Support is ON
-- Enable FindMPI:  ON
-- Found MPI_CXX: /app/compiler/intel/15.0.1/impi/5.0.2.044/intel64/lib/libmpicxx.so (found version "3.0")
-- Found MPI: TRUE (found version "3.0")
-- BLT MPI Compile Flags:
-- BLT MPI Include Paths:  /app/compiler/intel/15.0.1/impi/5.0.2.044/intel64/include
-- BLT MPI Libraries:      /app/compiler/intel/15.0.1/impi/5.0.2.044/intel64/lib/libmpicxx.so;/app/compiler/intel/15.0.1/impi/5.0.2.044/intel64/lib/libmpifor
t.so;/app/compiler/intel/15.0.1/impi/5.0.2.044/intel64/lib/release_mt/libmpi.so;/app/compiler/intel/15.0.1/impi/5.0.2.044/intel64/lib/libmpigi.a;/lib64/libdl
.so;/lib64/librt.so;/lib64/libpthread.so
-- BLT MPI Link Flags:     -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker /app/compiler/intel/15.0.1//impi/5.0.2.044/intel64/lib/release_mt -Xlinker -r
path -Xlinker /app/compiler/intel/15.0.1//impi/5.0.2.044/intel64/lib -Xlinker -rpath -Xlinker /opt/intel/mpi-rt/5.0/intel64/lib/release_mt -Xlinker -rpath -X
linker /opt/intel/mpi-rt/5.0/intel64/lib
-- MPI Executable:       /app/compiler/intel/15.0.1/impi/5.0.2.044/intel64/bin/mpiexec
-- MPI Num Proc Flag:    -n
-- MPI Command Append:
-- OpenMP Support is ON
-- Found OpenMP_CXX: -fopenmp (found version "4.0")
-- Found OpenMP: TRUE (found version "4.0")
-- OpenMP Compile Flags: $<$<NOT:$<COMPILE_LANGUAGE:CUDA>>:-fopenmp>;$<$<COMPILE_LANGUAGE:CUDA>:-Xcompiler=-fopenmp>
-- OpenMP Link Flags:    -fopenmp
-- CUDA Support is ON
-- The CUDA compiler identification is NVIDIA 10.0.130
-- Check for working CUDA compiler: /GPUFS/app_GPU/compiler/CUDA/10.0.1/bin/nvcc
-- Check for working CUDA compiler: /GPUFS/app_GPU/compiler/CUDA/10.0.1/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /GPUFS/app_GPU/compiler/CUDA/10.0.1 (found version "10.0")
-- CUDA Version:       10.0
-- CUDA Compiler:      /GPUFS/app_GPU/compiler/CUDA/10.0.1/bin/nvcc
-- CUDA Host Compiler: /app/compiler/gcc/4.9.4/bin/g++
-- CUDA Include Path:  /GPUFS/app_GPU/compiler/CUDA/10.0.1/include
-- CUDA Libraries:     /GPUFS/app_GPU/compiler/CUDA/10.0.1/lib64/libcudart_static.a;-lpthread;dl;/usr/lib64/librt.so
-- CUDA Compile Flags:
-- CUDA Link Flags:
-- CUDA Separable Compilation:  ON
-- CUDA Link with NVCC:
-- HIP Support is OFF
-- HCC Support is OFF
-- Found Doxygen: /usr/bin/doxygen (found version "1.8.5") found components:  doxygen dot
-- Sphinx support is ON
-- Failed to locate Sphinx executable (missing: SPHINX_EXECUTABLE)
-- Valgrind support is ON
-- Found Valgrind: /usr/bin/valgrind
-- Uncrustify support is ON
-- Failed to locate Uncrustify executable (missing: UNCRUSTIFY_EXECUTABLE)
-- AStyle support is ON
-- Failed to locate AStyle executable (missing: ASTYLE_EXECUTABLE)
-- Cppcheck support is ON
-- Failed to locate Cppcheck executable (missing: CPPCHECK_EXECUTABLE)
-- ClangQuery support is ON
-- Failed to locate ClangQuery executable (missing: CLANGQUERY_EXECUTABLE)
-- C Compiler family is GNU
-- Adding optional BLT definitions and compiler flags
-- Enabling all compiler warnings on all targets.
-- Fortran support disabled.
-- CMAKE_C_FLAGS flags are:    -Wall -Wextra
-- CMAKE_CXX_FLAGS flags are:    -Wall -Wextra
-- CMAKE_EXE_LINKER_FLAGS flags are:
-- Google Test Support is ON
-- Google Mock Support is OFF
-- The C compiler identification is GNU 4.9.4
-- Check for working C compiler: /app/compiler/gcc/4.9.4/bin/gcc
-- Check for working C compiler: /app/compiler/gcc/4.9.4/bin/gcc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Found PythonInterp: /usr/bin/python (found version "2.7.5")
-- MPI Enabled
-- OpenMP Enabled
-- Cuda Enabled
-- Configuring done
-- Generating done
-- Build files have been written to: /GPUFS/demo/benchmark/comb/build_ubuntu-nvcc10-gcc8

When I make, I get the errors.

/GPUFS/demo/benchmark/comb/include/graph_launch.hpp(154): error: identifier "cudaGraphExecKernelNodeSetParams" is undefined

/GPUFS/demo/benchmark/comb/include/graph_launch.hpp(174): error: argument of type "const cudaGraphNode_t *" is incompatible with parameter of type "cudaGraphNode_t *"

2 errors detected in the compilation of "/tmp/tmpxft_0004aa35_00000000-6_comb.cpp1.ii".
make[2]: *** [CMakeFiles/comb.dir/src/comb.cpp.o] error 1
make[1]: *** [CMakeFiles/comb.dir/all] error 2
make: *** [all] error 2

I cannot find cudaGraphExecKernelNodeSetParams in the code except graph_launch.hpp(154). Is it the bug or something I do wrong (like add the -std=c++11)

error: argument of type "const cudaGraphNode_t *" is incompatible with parameter of type "cudaGraphNode_t *" is also a problem about the code. How can I fix it?

Thanks for your attention. Looking forward to your reply.