kokkos / kokkos-tools Goto Github PK

Kokkos C++ Performance Portability Programming Ecosystem: Profiling and Debugging Tools

License: Other

Makefile 0.10% C++ 99.52% DTrace 0.01% Shell 0.01% CMake 0.36%

profiling tools profiler memory-analysis timing kokkos debug debugging snl-performance-workflow snl-prog-models-runtimes

kokkos-tools's Introduction

Kokkos: Core Libraries

Kokkos Core implements a programming model in C++ for writing performance portable applications targeting all major HPC platforms. For that purpose it provides abstractions for both parallel execution of code and data management. Kokkos is designed to target complex node architectures with N-level memory hierarchies and multiple types of execution resources. It currently can use CUDA, HIP, SYCL, HPX, OpenMP and C++ threads as backend programming models with several other backends in development.

Kokkos Core is part of the Kokkos C++ Performance Portability Programming Ecosystem.

Kokkos is a Linux Foundation project.

Learning about Kokkos

To start learning about Kokkos:

Kokkos Lectures: they contain a mix of lecture videos and hands-on exercises covering all the important capabilities.
Programming guide: contains in "narrative" form a technical description of the programming model, machine model, and the main building blocks like the Views and parallel dispatch.
API reference: organized by category, i.e., core, algorithms and containers or, if you prefer, in alphabetical order.
Use cases and Examples: a serie of examples ranging from how to use Kokkos with MPI to Fortran interoperability.

Obtaining Kokkos

The latest release of Kokkos can be obtained from the GitHub releases page.

The current release is 4.3.01.

curl -OJ -L https://github.com/kokkos/kokkos/archive/refs/tags/4.3.01.tar.gz
# Or with wget
wget https://github.com/kokkos/kokkos/archive/refs/tags/4.3.01.tar.gz

To clone the latest development version of Kokkos from GitHub:

git clone -b develop  https://github.com/kokkos/kokkos.git

Building Kokkos

To build Kokkos, you will need to have a C++ compiler that supports C++17 or later. All requirements including minimum and primary tested compiler versions can be found here.

Building and installation instructions are described here.

You can also install Kokkos using Spack: spack install kokkos. Available configuration options can be displayed using spack info kokkos.

For the complete documentation: kokkos.org/kokkos-core-wiki/

Support

For questions find us on Slack: https://kokkosteam.slack.com or open a GitHub issue.

For non-public questions send an email to: crtrott(at)sandia.gov

Contributing

Please see this page for details on how to contribute.

Citing Kokkos

Please see the following page.

License

Under the terms of Contract DE-NA0003525 with NTESS, the U.S. Government retains certain rights in this software.

The full license statement used in all headers is available here or here.

kokkos-tools's People

Contributors

Stargazers

Watchers

Forkers

simongdg adamsimpson hughes-c kyungjoo-kim stanmoore1 tom95858 prwolfe delpinux aprokop khou2020 etphipp mhoemmen wcohen jrmadsen wa-dev davidpoliakoff masterleinad e10harvey mastino crtrott jennloe calewis winklerf-zih rppawlo keichi zfrye-llnl vbrunini dalg24 nmhamster pgrete loftyhauser xzzx maxpkatz ldh4 nexgenanalytics ajpowelsnl zhanyongtang vsurjadidjaja cwpearson jhux2 vlkale anagainaru rombur khuck uliegecsm goga1992 brian-kelley cexa-project computationalgasdynamicslab blegouix redcod tony-dja wrwilliams twilk10

kokkos-tools's Issues

chrome-tracing: Separate logical devices

Right now the chrome tracing tool maps all devices to thread 0, we could allow different ones to be different "threads" in Chrome Tracing thinking

chrome-tracing: allow non-MPI programs

Currently the tool requires MPI, it shouldn't need it

Add deep_copy to profiling tools

Can we please add Kokkos::deep_copy to the kernel_logger and memory_events tools?

space-time-stack: configurable threshold for filtering out kernels

Currently we have an arbitrary threshold for what we don't print out (0.1%), which should be a configurable value

space-time-stack erroring out due to Kokkos version

I'm getting a cryptic error from space-time-stack:

kokkosp_init_library: version 20171029 != 20150628
*** Process received signal ***
Signal: Aborted (6)

Looks like it is coming from https://github.com/kokkos/kokkos-tools/blob/master/src/tools/space-time-stack/kp_space_time_stack.cpp#L533

  if (version != 20150628) {
    std::cerr << "kokkosp_init_library: version " << version << " != 20150628\n";
    abort();
  }

Can this check be relaxed or updated?

Warning in roctx-connector

Using ROCM 5.2.0, compiling the roctx connector gives a warning:

g++ -shared -fPIC -O3 -g -I/opt/rocm-5.2.0/roctracer/include -I/autofs/nccs-svm1_home1/jjhu/crusher/kokkos-tools/src/kokkos-tools/profiling/roctx-connector/ -L/opt/rocm-5.2.0/lib \
	-o kp_roctx_connector.so /autofs/nccs-svm1_home1/jjhu/crusher/kokkos-tools/src/kokkos-tools/profiling/roctx-connector/kp_roctx_connector.cpp -lroctx64
In file included from /autofs/nccs-svm1_home1/jjhu/crusher/kokkos-tools/src/kokkos-tools/profiling/roctx-connector/kp_roctx_connector.cpp:43:0:
/opt/rocm-5.2.0/roctracer/include/roctx.h:26:119: note: #pragma message: This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with roctracer
 #pragma message("This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with roctracer")

kp_space_time_stack Assertion `it != alloc_set.end()' failed

@ibaned I'm getting the following error when trying to use space_time_stack, any ideas? I just updated to the latest version of kokkos-tools, and I'm using the latest master version of kokkos.

lmp_kokkos_phi: kp_space_time_stack.cpp:339: void <unnamed>::Allocations::deallocate(std::basic_string<char, std::char_traits<char>, std::allocator<char>> &&, void *, unsigned long, <unnamed>::StackNode *): Assertion `it != alloc_set.end()' failed.

Kokkos memory events should substract its own memory overhead

When there are a lot of memory events, Kokkos memory events itself can have significant memory overhead due to storing all of the events. This should be subtracted off to give an accurate memory profile.

Fix memory HWM to also show how much memory the profiler is taking

I've been tracking down a memory issue and found that memory usage tool takes most of the memory for a real run. It would be nice to track that and output it, or remove it from the RSS.

Add profiling for atomics

I'd like to know how much time is spent in atomic operations. Is there currently a way to discover this using the Kokkos profiling tools, and if not could this be added?

Enhancement: Set max file size for kernel logger

I need to run the Kokkos kernel logger on hundreds of thousands of MPI ranks for many hours to debug an issue. I only care about the last few kernels, i.e. when the code stops. It would be nice if there was an option for the kernel logger to overwrite (i.e. start at the beginning) of the output files if they get bigger than some threshold. Otherwise too much data could be generated.

Output memory files incrementally instead of all at once on Kokkos finalize

If the code crashes while profiling, like due to out of memory, there is zero output from the memory tools, which doesn't help debugging. It would be nice if the tools could write to the files incrementally instead of all at once at the end. This could also reduce memory overhead from storing all of the memory events.

Simple kernel timer library named the same as the JSON version

The library built in simple-kernel-timer is the same as in simple-kernel-timer-json and gets overwritten when using thei build-all.sh script from the top level. Please rename one or both libraries.

Kokkos profiling library segfauls if using static linking

Should gracefully error out instead of segfaulting, and alert the user to use dynamic linking.

Memory report wrong for Macs

On my Mac laptop, the Kokkos memory profiling values from getrusage are off by a factor 1024. This is because getrusage uses units of bytes on Mac but kilobytes on Linux.

Document the hooks

Create a wiki page describing the key hooks that the profiling library can define, what their arguments are, and when they are called.

this project has no license

Please add a license to this project. Thanks.

MPI imbalance documentation question

On https://github.com/kokkos/kokkos-tools/wiki/Space-Time-Stack
the documentation says
"The second column is the imbalance across MPI ranks, defined as the maximum time consumed by the kernel in any MPI rank divided by the average time consumed by the kernel over all MPI ranks."

That means the number reported should always be one or greater. But the values shown in the example are percentages all less than 100%.

How should one interpret the MPI imbalance field? For example, does 100% mean perfectly balanced, or that some process takes twice as long as the average?

Add times to breakdown in space-time-stack

@ibaned it would be really nice if space-time-stack had actual times in the breakdown instead of just relative percentages plus the total time. This would make it much easier to compare different runs that have a different total time. Right now I have to post-process the results by taking the total time and multiplying by all the percentages to get the breakdown times.

space-time-stack will break with HIP

Tools are overly CUDA-specific, code like this

https://github.com/kokkos/kokkos-tools/blob/master/profiling/space-time-stack/kp_space_time_stack.cpp#L86

Will break with HIP

`get_current_dir_name` not supported on OSX

get_current_dir_name() is not supported on osx. Would/should something like:

  #define CWD_MAX 1024                                                             
  char cwd[CWD_MAX];                                                               
  getcwd(cwd, CWD_MAX);                                                            
  printf("KokkosP: Kernel timing written to %s/%s \n", cwd, fileOutput);

Be prefered for kp_kernel_timer.cpp?

(As an aside, it would be nice if there was an option to write the file in plain text to avoid the need for kp_reader for an easier work flow on simple projects..)

Documentation for writing a Kokkos connector?

@DavidPoliakoff

@zfrye-llnl and I are thinking of writing a Variorum connector for Kokkos (https://github.com/LLNL/variorum). Is there any documentation on how to write one (or which functions: initialize/finalize, push/pop, begin/end_parallel_for are mandatory)?

Add sync/modify to Kokkos memory events

It would be nice if the memory events output could include sync/modify between different memory spaces, since this is a memory event. Right now the tool seems to only include allocate and deallocate.

Tool for catching common anti-patterns at compile-time

This is a feature request to provide a tool for finding/fixing common anti-patterns for GPU development.

For example: using class/structure member variables/methods within device lambda functions.

This supports Sierra (application) development.

This tool would be extremely helpful as typical (Sierra) workflow is to develop an increment on the CPU for faster compile/run turnaround and then test the increment on the GPU. Issues such as accessing through the implicit this pointer are difficult to track down in larger applications as the error may not be reported with the offending kernel.

Add "percent time in Kokkos parallel constructs" to region stats in space-time-stack

@csiefer2

Tuning Interface revision

My scatch of how I think this could look after the changes. Note how Tools::VariableValue is now only used in places where I actually semantically have to associate values with some id in the tool.

#include <Kokkos_Core.hpp>
#include <Kokkos_Macros.hpp>
#include <cmath>
#include <array>
#include <mpi.h>
#include <chrono>
#include <thread>
#include <iostream>
#include <vector>
constexpr const int num_iterations = 1000;
constexpr const float penalty = 0.1f;
void sleep_difference(int actual, int guess){
  std::this_thread::sleep_for(std::chrono::milliseconds(int(penalty * std::abs(guess-actual))));
}

// This is a simple tuning example, in which we have a function that takes two values and
// sleeps for their difference. The tool is given one value, and asked to guess the other
//
// The correct answer is to guess the same value the tool was given in the context

int main(int argc, char* argv[]) {
  MPI_Init(&argc, &argv);
  Kokkos::initialize(argc, argv);
  // *Here we want to declare a context variable, something a tool might tune using
  //

  Kokkos::Tools::VariableInfo context_info;
  // ** now we want to let the tool know about the semantics of that variable
  // *** it's an integer
  context_info.type          = kokkos_value_integer;
  // *** it's 'interval', roughly meaning that subtracting two values makes sense
  context_info.category      = kokkos_value_interval;
  // *** and the candidate values are in a range, not a set 
  context_info.valueQuantity = kokkos_value_range;
  Kokkos::Tools::SetOrRange value_range;

  // ** this is just the earlier "range" construction
  // ** the last two values are bools representing whether 
  // ** the range is open (endpoints not included) or closed (endpoints included)
  // ** Why would this need the context_variable_id?
  value_range.range = Kokkos::Tools::ValueRange{
      /*context_variable_id,*/ int64_t(0), int64_t(num_iterations),
      int64_t(1),         false,        false};
  // ** here we actually declare it to the tool
  size_t context_variable_id = Kokkos::Tools::declareVariableType("target_value",
                                         context_info, value_range);


   // ** its semantics exactly mirror the tuning variable, it's an
  // ** integer interval value
  Kokkos::Tools::VariableInfo tuning_variable_info;
  tuning_variable_info.category      = kokkos_value_interval;
  tuning_variable_info.type          = kokkos_value_integer;

  // ** Here I'm setting the candidate values to be from a set for two reasons
  // ** 1) It shows the other side of this interface
  // ** 2) ... the prototype tool doesn't support guessing from ranges yet
  tuning_variable_info.valueQuantity = kokkos_value_set;

  std::vector<int64_t> candidate_value_array { 0, 100, …, 900 };

  Kokkos::Tools::SetOrRange candidate_values;
  candidate_values.set = Kokkos::Tools::ValueSet{/*tuning_value_id,*/ 10,
                                                  candidate_value_array.data()};
  size_t tuning_value_id = Kokkos::Tools::declareTuningVariable("tuned_choice", 
                                        tuning_variable_info, candidate_values);

  {
    // * declaring a VariableValue which can hold the results
    // *   of the tuning tool
    Kokkos::Tools::VariableValue tuned_choice;
    tuned_choice.id = tuning_value_id;
    // ** Note that the default value must not crash the program
    tuned_choice.value.int_value = 0;

    // * Looping multiple times so the tool can converge
    for(int attempt =0;attempt<120;++attempt){
    for (int work_intensity = 0; work_intensity < num_iterations;
         work_intensity+=200) {
      std::cout << "Attempting a run with value "<<work_intensity<<std::endl;
      size_t contextId = Kokkos::Tools::getNewContextId();

      // *Here we tell the tool the value of the context variable
      Kokkos::Tools::VariableValue context_value = Kokkos::Tools::make_variable_value(
              context_variable_id, int64_t(work_intensity));
      Kokkos::Tools::setInputValues(
          contextId, 1, 
          &context_value);
            
      // *Now we ask the tool to give us the value it thinks will perform best
      Kokkos::Tools::getOutputValues(contextId, 1, &tuned_choice); 

      // *Call the function with those two values
      sleep_difference(work_intensity, tuned_choice.value.int_value);

      // *This call tells the tool the context is over, so it can
      // *take measurements
      Kokkos::Tools::endContext(contextId);
    }
    }
  }
  Kokkos::finalize();
  MPI_Finalize();
}

Make Simple Kernel Timer Split out Tagged Parallel Loops

When using tags instead of functors or lambdas, the simple kernel timer aggregates all of the time spent in all the tags in each class into one category. It should instead give separate timings for each tag individually.

Output total transferred between different memory spaces

It would be nice if there was a way to see total memory transferred between different memory spaces, specifically between GPU and CPU.

Memory in SpaceTimeStack tool seems wrong

The memory in the SpaceTimeStack tool seems wrong. The high water mark from this tool (MAX BYTES ALLOCATED) doesn't match the other Kokkos memory high water mark tools, and the percentages listed don't add up to 100%.

@ibaned

Profiling Filter Tool Ideas

This is a collection of ideas for setting the profiling filter via expressions

// start and stop based on entering region at a certain iteration
// no :ITERATION given start stop and any entry/exit
global_start [ region REGEX : ITERATION ]
global_stop [ region REGEX : ITERATION ]

// Specific Kernel Always
REGEX

// Specific Kernel Regex inside region
// ITERATION_START/STOP optional, only one number: only profile in that region instance
REGEX [ region REGION_REGEX : ITERATION_START : ITERATION_STOP ]

// Specific Kernel only if as parallel_reduce in two specific nested regions
REGEX [ type parallel_reduce ] [ region REGION_REGEX1 ] [ region REGION_REGEX2 ]

// Specific Kernel but only 1011 call
REGEX:1011

// Specific Kernel but only after sequence of other kernels
REGEX [ sequence REGEX1 REGEX2 REGEX3 ]

simple-kernel-timer: time deep_copies

Update the tool with that hook

Simple Kernel Timer crashes when push_region and pop_region aren't consistent

When more pop_regions are called then push_regions, simple timer crashes without warning.

Ideally, if more pop_regions are called than expected, the system should give an indication of the last push region so that the user can more quickly find the extraneous pop_region.

kp-space-time-stack aborts on vortex when more than 1 MPI rank is used

I was testing using the MueLu driver (Trilinos/packages/muelu/test/scaling/Driver.cpp) and 1 MPI rank is OK, more than one leads to an abort.

It woofs here:

if (stack_frame != &stack_root) {
  std::cerr << "Program ended before \"" << stack_frame->get_full_name()
            << "\" ended\n";
  abort();
}

The timer it complains was finished before the code terminates.

[kp_kernel_timer] output format is unreadable

Running with kp_kernel_timer as the profiling library, I see that the output is unreadable:

sajid@LAPTOP-CDJT2P3R ~/p/m/b/p/deposit (develop)> KOKKOS_PROFILE_LIBRARY=/home/sajid/packages/kokkos-tools/kp_kernel_timer.so OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./deposit_profile
KokkosP: Example Library Initialized (sequence is 0, version: 20211015)
timer label                   time(s)             cout
--------------------------------------------------------------
100its of old_method          13.97732            1

KokkosP: Kernel timing written to /home/sajid/packages/minigia/build/profiling/deposit/LAPTOP-CDJT2P3R-3994.dat

sajid@LAPTOP-CDJT2P3R ~/p/m/b/p/deposit (develop)> cat LAPTOP-CDJT2P3R-3994.dat
��?-@4�100its of old_method�Xd�+@nnamed I�[;Kokkos::ScatterView::ReduceDuplicates [duplicated_rho2_dev]d
                                                                                                        m�?uceDupliZ:Kokkos::ScatterView::ResetDuplicates [duplicated_rho2_dev]d@��@etDuplicJ*Kokkos::View::initialization [] via memset�JO�? �\�I�?S3Kokkos::View::initialization [particles] via memset����?tion [pa\<Kokkos::View::initialization [particles_discards] via memset���g?nnamed IY9Kokkos::View::initialization [particles_masks] via memset��?tion [paR2Kokkos::View::initialization [rho2_dev] via memset��]?tion [rh3�Kokkos::ViewCopy-1D��>nnamed I3�Kokkos::ViewCopy-2D�(9?nnamed I<�N12deposit_impl10rho_zeroerEdX��?nnamed IQ1N12deposit_impl31sv_zyx_rho_reducer_non_periodicEd(\*#@rho_redu⏎
sajid@LAPTOP-CDJT2P3R ~/p/m/b/p/deposit (develop)>

However, I am able to use kp_kernel_timer_json to generate a valid json output file.

Provide MPI-aware versions of certain kokkos-tools

When I am running on more than 500,000 MPI ranks, I don't appreciate all the ranks printing out this half a million times to stdout:

KokkosP: Library Loaded: ./kokkos-tools/src/tools/memory-hwm/kp_hwm.so
KokkosP: Finalization of profiling library.
KokkosP: High water mark memory consumption: 420856 kB

Also opening and writing to half a million files for memory-events or memory-usage hammers the I/O system.

I know Kokkos and kokkos-tools don't know about MPI, but it would be useful if we could limit this output to MPI rank 0 and print out total min, average, and max memory consumption.

The HighWater-Process metric for CUDA in MemoryUsage tool is wrong

The HighWater-Process(MB) metric for the CUDA memory space (file *.Cuda.memspace_usage) is bogus, it is for the host and doesn't accurately reflect the CUDA value.

Is there any tutorial on these tools?

Basically, I'm looking for something like nvprof's functionality because nvprof fails to demangle CUDA kernel names from Kokkos.

Microtool to debug refcounting errors

People run into issues with refcounting, and currently they use kernel-logger to try to find out which pointer is causing that issue. We need a small tool to associate pointers with names so people can see "ah, [address] was associated with [name], that's where the refcount had problems."

memory-events: wrong output due to wrong integer specifier.

kernel names in nvprof

I am wondering how to properly use the nvprof-connector, so that the kokkos kernel names show up in demangled form in the visual debugger.

memory-events: print on load

Most tools print when they load, memory-events doesn't. That's confusing

Update simple kernel timer to write output to text file

Tool should send output to <hostname>.<process id>.simple_kernel_timer instead of default stdout.

kp_space_time_stack.cpp doesn't compile on BG/Q

Here is the compile error:

kp_space_time_stack.cpp:397:7: error: no matching function for call to
      'MPI_Send'
      MPI_Send(s.data(), string_size, MPI_CHAR, 0, 42, MPI_COMM_WORLD);
      ^~~~~~~~
/bgsys/drivers/V1R2M4/ppc64/comm/include/mpi.h:811:5: note: candidate function
      not viable: no known conversion from 'const value_type *'
      (aka 'const char *') to 'void *' for 1st argument
int MPI_Send(MPICH2_CONST void*, int, MPI_Datatype, int, int, MPI_Comm) ...

See PR #28

Weird (buggy?) behavior from the space-time-stack tool

I'm seeing some perplexing behavior from the space-time-stack tool. The d_sorted (particle:sorted) View in my code has no other references (reference count = 1 always) and is periodically reallocated:

    if (d_particles.dimension_0() > d_sorted.dimension_0()) {
      d_sorted = t_particle_1d("particle:sorted",d_particles.dimension_0());
    }

I can't understand how, but the d_sorted (particle:sorted) view appears twice in the space-time-stack tool output list for CUDA:

MPI RANK WITH MAX MEMORY: 0
MAX BYTES ALLOCATED: 8574610364
HOST ALLOCATIONS:
================
  19.4% particle:particles_mirror
  2.3% grid:cells_mirror
  1.2% grid:cinfo_mirror
  0.7% particle:mlist_mirror
  0.2% emit/face:tasks_mirror
  0.2% collide:vremax_mirror
  0.2% collide:remain_mirror
  0.2% thermal/grid:vector_grid_mirror
  0.2% thermal/grid:tally_mirror
CUDA ALLOCATIONS:
================
  19.4% particle:sorted
  19.4% particle:particles
  19.4% particle:sorted
  11.5% particle:plist
  2.3% grid:cells
  1.2% grid:cinfo
  0.7% particle:mlist
  0.2% emit/face:tasks
  0.2% collide:vremax
  0.2% collide:remain
  0.2% thermal/grid:vector_grid
  0.2% thermal/grid:tally

If I first destroy the d_sorted view like this:

    if (d_particles.dimension_0() > d_sorted.dimension_0()) {
      d_sorted = t_particle_1d();
      d_sorted = t_particle_1d("particle:sorted",d_particles.dimension_0());
    }

then d_sorted isn't listed twice in the CUDA output, but the overall high water mark doesn't change, and another copy of the particle:particles_mirror host view appears!

MPI RANK WITH MAX MEMORY: 0
MAX BYTES ALLOCATED: 8638309940
HOST ALLOCATIONS:
================
  19.3% particle:particles_mirror
  19.3% particle:particles_mirror
  2.3% grid:cells_mirror
  1.2% grid:cinfo_mirror
  0.7% particle:mlist_mirror
  0.2% emit/face:tasks_mirror
  0.2% collide:vremax_mirror
  0.2% collide:remain_mirror
  0.2% thermal/grid:vector_grid_mirror
  0.2% thermal/grid:tally_mirror
CUDA ALLOCATIONS:
================
  19.3% particle:particles
  19.3% particle:sorted
  12.1% particle:plist
  2.3% grid:cells
  1.2% grid:cinfo
  0.7% particle:mlist
  0.2% emit/face:tasks
  0.2% collide:vremax
  0.2% collide:remain
  0.2% thermal/grid:vector_grid
  0.2% thermal/grid:tally

@ibaned any ideas?

HWM for GPU memory

If I understand #21 correctly, it looks like the HWM tool reports on CPU memory usage (not GPU).

Is there any tool available in this repo that can help me get the HWM for GPU memory used by a program?

Space-time-stack prints out of order with MPI

@ibaned I noticed space-time-stack sometimes prints out of order when running on more than 1 MPI rank, see below. Perhaps some MPI Barriers are missing?

MAX BYTES ALLOCATED: 4222384
MPI RANK WITH MAX MEMORY: 2
ALLOCATIONS AT TIME OF HIGH WATER MARK:
  40.4% particle:particles
  23.3% grid:cells
  17.1% grid:pcells
  12.4% grid:cinfo
  1.7% comm:sbuf
  1.6% comm:sbuf
  1.6% particle:mlist
  0.8% Irregular:buf
  0.7% particle:plist


BEGIN KOKKOS PROFILING REPORT:
TOTAL TIME: 0.123602 seconds
TOP-DOWN TIME TREE:
<percent of total time> <percent MPI imbalance> <number of calls> <name> [type]
===================
|-> 35.1% 1.0% 130 N9SPARTA_NS16CollideVSSKokkosE/N9SPARTA_NS23TagCollideCollisionsOneILi0ELin1EEE [reduce]
|-> 22.5% 0.8% 190 N9SPARTA_NS12UpdateKokkosE/N9SPARTA_NS13TagUpdateMoveILi3ELi0ELin1EEE [reduce]
|-> 2.4% 1.5% 132 N9SPARTA_NS14ParticleKokkosE/N9SPARTA_NS15TagParticleSortILi0EEE [for]
|-> 1.7% 5.5% 323 Kokkos::View::initialization [for]
|-> 0.9% 4.5% 190 N9SPARTA_NS10CommKokkosE/N9SPARTA_NS23TagCommMigrateParticlesILi0EEE [for]
|-> 0.7% 0.6% 570 N9SPARTA_NS15IrregularKokkosE/N9SPARTA_NS22TagIrregularPackBufferILi0EEE [for]
|-> 0.4% 1.7% 190 N9SPARTA_NS14ParticleKokkosE/N9SPARTA_NS28TagParticleCompressReactionsE [for]
|-> 0.3% 3.2% 15 N9SPARTA_NS17ComputeTempKokkosE [reduce]
|-> 0.1% 11.4% 1 ZN9SPARTA_NS21CreateParticlesKokkos12create_localElEUliE_ [for]
|-> 0.1% 17.9% 320 "mlist_mirror"="particle:mlist" [copy]
|-> 0.1% 3.4% 131 N9SPARTA_NS14ParticleKokkosE/N9SPARTA_NS25TagParticleZero_cellcountE [for]

BOTTOM-UP TIME TREE:
<percent of total time> <percent MPI imbalance> <number of calls> <name> [type]
===================
|-> 35.1% 1.0% 130 N9SPARTA_NS16CollideVSSKokkosE/N9SPARTA_NS23TagCollideCollisionsOneILi0ELin1EEE [reduce]
|-> 22.5% 0.8% 190 N9SPARTA_NS12UpdateKokkosE/N9SPARTA_NS13TagUpdateMoveILi3ELi0ELin1EEE [reduce]
|-> 2.4% 1.5% 132 N9SPARTA_NS14ParticleKokkosE/N9SPARTA_NS15TagParticleSortILi0EEE [for]
|-> 1.7% 5.5% 323 Kokkos::View::initialization [for]
|-> 0.9% 4.5% 190 N9SPARTA_NS10CommKokkosE/N9SPARTA_NS23TagCommMigrateParticlesILi0EEE [for]
|-> 0.7% 0.6% 570 N9SPARTA_NS15IrregularKokkosE/N9SPARTA_NS22TagIrregularPackBufferILi0EEE [for]
|-> 0.4% 1.7% 190 N9SPARTA_NS14ParticleKokkosE/N9SPARTA_NS28TagParticleCompressReactionsE [for]
|-> 0.3% 3.2% 15 N9SPARTA_NS17ComputeTempKokkosE [reduce]
|-> 0.1% 11.4% 1 ZN9SPARTA_NS21CreateParticlesKokkos12create_localElEUliE_ [for]
|-> 0.1% 17.9% 320 "mlist_mirror"="particle:mlist" [copy]
|-> 0.1% 3.4% 131 N9SPARTA_NS14ParticleKokkosE/N9SPARTA_NS25TagParticleZero_cellcountE [for]

KOKKOS HOST SPACE:
===================
KOKKOS CUDA SPACE:
===================
MAX BYTES ALLOCATED: 0
MPI RANK WITH MAX MEMORY: 0
ALLOCATIONS AT TIME OF HIGH WATER MARK:

END KOKKOS PROFILING REPORT.

Profiling requires MPI be initialized

This might be a corner case. I have some tests in Trilinos that are single node only. They do not need MPI and, therefore, don't call MPI_Init in main (in cmake we restrict to one MPI process). However, if Trilinos is configured/built with MPI enabled, then it seems the stacked timer profiler requires all executables to initialize mpi. This causes a bunch of unit test failures in phalanx. An example is shown below. It would be convenient for the profiler could check if mpi was initialized and perform the init if needed. That way I could leave the profiling flag on during unit testing. When I unset the KOKKOS_PROFILE_LIBRARY flag, the error goes away and all the tests run fine.

[rppawlo@gge Kokkos]$ mpirun -np 1 ./Phalanx_tKokkosNestedLambda.exe 
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
*** The MPI_Comm_rank() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[gge.srn.sandia.gov:23656] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[60414,1],0]
  Exit code:    1
--------------------------------------------------------------------------

understanding MemoryUsage

Could you update the wiki so this is clear, but I see

Time(s) Size(MB) HighWater(MB) HighWater-Process(MB)

in the output. Size, is that total size in kokkos views? HighWater is that max(size) and is HighWater-Process size in view and malloc and stack and everything??

NGA FY22: improve build system

Develop CMake build system to build all the KokkosTools
Extent the tools cmake build system to optionally build Caliper and Apex (i.e. translate typical Kokkos
options such as Kokkos_ENABLE_CUDA to the respective settings for Caliper and Apex)
Add option to build all tools as a single library, and add an initialization interface to that library, which uses
the existing KokkosEventSet capability to load a specific tool in the library upon being called with the
corresponding option (e.g. KokkosTools::initialize("space_time_stack"));

PAPI connector: per thread metrics

I noticed that the PAPI connector records the events only on the master thread. This is because PAPI performance counters are thread-local and PAPI_hl_region_begin/PAPI_hl_region_end are called from the master thread only.

It would be nice if the PAPI connector could record the performance counters on all threads. However, I'm guessing this needs changes in Kokkos itself to call the profiling hooks from inside the parallel regions.