Giter Site home page Giter Site logo

kokkos / kokkos-tools Goto Github PK

View Code? Open in Web Editor NEW
105.0 40.0 55.0 14.98 MB

Kokkos C++ Performance Portability Programming Ecosystem: Profiling and Debugging Tools

License: Other

Makefile 0.10% C++ 99.52% DTrace 0.01% Shell 0.01% CMake 0.36%
profiling tools profiler memory-analysis timing kokkos debug debugging snl-performance-workflow snl-prog-models-runtimes

kokkos-tools's Introduction

Kokkos

Kokkos: Core Libraries

Kokkos Core implements a programming model in C++ for writing performance portable applications targeting all major HPC platforms. For that purpose it provides abstractions for both parallel execution of code and data management. Kokkos is designed to target complex node architectures with N-level memory hierarchies and multiple types of execution resources. It currently can use CUDA, HIP, SYCL, HPX, OpenMP and C++ threads as backend programming models with several other backends in development.

Kokkos Core is part of the Kokkos C++ Performance Portability Programming Ecosystem.

Kokkos is a Linux Foundation project.

Learning about Kokkos

To start learning about Kokkos:

Obtaining Kokkos

The latest release of Kokkos can be obtained from the GitHub releases page.

The current release is 4.3.01.

curl -OJ -L https://github.com/kokkos/kokkos/archive/refs/tags/4.3.01.tar.gz
# Or with wget
wget https://github.com/kokkos/kokkos/archive/refs/tags/4.3.01.tar.gz

To clone the latest development version of Kokkos from GitHub:

git clone -b develop  https://github.com/kokkos/kokkos.git

Building Kokkos

To build Kokkos, you will need to have a C++ compiler that supports C++17 or later. All requirements including minimum and primary tested compiler versions can be found here.

Building and installation instructions are described here.

You can also install Kokkos using Spack: spack install kokkos. Available configuration options can be displayed using spack info kokkos.

For the complete documentation: kokkos.org/kokkos-core-wiki/

Support

For questions find us on Slack: https://kokkosteam.slack.com or open a GitHub issue.

For non-public questions send an email to: crtrott(at)sandia.gov

Contributing

Please see this page for details on how to contribute.

Citing Kokkos

Please see the following page.

License

License

Under the terms of Contract DE-NA0003525 with NTESS, the U.S. Government retains certain rights in this software.

The full license statement used in all headers is available here or here.

kokkos-tools's People

Contributors

adamsimpson avatar anagainaru avatar anjohan avatar aprokop avatar blegouix avatar crtrott avatar csiefer2 avatar cwpearson avatar dalg24 avatar davidpoliakoff avatar davidpoliakoff-backup avatar delpinux avatar dholladay00 avatar etphipp avatar ibaned avatar jrmadsen avatar khuck avatar masterleinad avatar ndellingwood avatar nmhamster avatar rombur avatar romintomasetti avatar stanmoore1 avatar vbrunini avatar vlkale avatar vsurjadidjaja avatar wcohen avatar winklerf-zih avatar zfrye-llnl avatar zhangchonglin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kokkos-tools's Issues

Warning in roctx-connector

Using ROCM 5.2.0, compiling the roctx connector gives a warning:

g++ -shared -fPIC -O3 -g -I/opt/rocm-5.2.0/roctracer/include -I/autofs/nccs-svm1_home1/jjhu/crusher/kokkos-tools/src/kokkos-tools/profiling/roctx-connector/ -L/opt/rocm-5.2.0/lib \
	-o kp_roctx_connector.so /autofs/nccs-svm1_home1/jjhu/crusher/kokkos-tools/src/kokkos-tools/profiling/roctx-connector/kp_roctx_connector.cpp -lroctx64
In file included from /autofs/nccs-svm1_home1/jjhu/crusher/kokkos-tools/src/kokkos-tools/profiling/roctx-connector/kp_roctx_connector.cpp:43:0:
/opt/rocm-5.2.0/roctracer/include/roctx.h:26:119: note: #pragma message: This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with roctracer
 #pragma message("This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with roctracer")

kp_space_time_stack Assertion `it != alloc_set.end()' failed

@ibaned I'm getting the following error when trying to use space_time_stack, any ideas? I just updated to the latest version of kokkos-tools, and I'm using the latest master version of kokkos.

lmp_kokkos_phi: kp_space_time_stack.cpp:339: void <unnamed>::Allocations::deallocate(std::basic_string<char, std::char_traits<char>, std::allocator<char>> &&, void *, unsigned long, <unnamed>::StackNode *): Assertion `it != alloc_set.end()' failed.

Add profiling for atomics

I'd like to know how much time is spent in atomic operations. Is there currently a way to discover this using the Kokkos profiling tools, and if not could this be added?

Enhancement: Set max file size for kernel logger

I need to run the Kokkos kernel logger on hundreds of thousands of MPI ranks for many hours to debug an issue. I only care about the last few kernels, i.e. when the code stops. It would be nice if there was an option for the kernel logger to overwrite (i.e. start at the beginning) of the output files if they get bigger than some threshold. Otherwise too much data could be generated.

Output memory files incrementally instead of all at once on Kokkos finalize

If the code crashes while profiling, like due to out of memory, there is zero output from the memory tools, which doesn't help debugging. It would be nice if the tools could write to the files incrementally instead of all at once at the end. This could also reduce memory overhead from storing all of the memory events.

Memory report wrong for Macs

On my Mac laptop, the Kokkos memory profiling values from getrusage are off by a factor 1024. This is because getrusage uses units of bytes on Mac but kilobytes on Linux.

Document the hooks

Create a wiki page describing the key hooks that the profiling library can define, what their arguments are, and when they are called.

MPI imbalance documentation question

On https://github.com/kokkos/kokkos-tools/wiki/Space-Time-Stack
the documentation says
"The second column is the imbalance across MPI ranks, defined as the maximum time consumed by the kernel in any MPI rank divided by the average time consumed by the kernel over all MPI ranks."

That means the number reported should always be one or greater. But the values shown in the example are percentages all less than 100%.

How should one interpret the MPI imbalance field? For example, does 100% mean perfectly balanced, or that some process takes twice as long as the average?

Add times to breakdown in space-time-stack

@ibaned it would be really nice if space-time-stack had actual times in the breakdown instead of just relative percentages plus the total time. This would make it much easier to compare different runs that have a different total time. Right now I have to post-process the results by taking the total time and multiplying by all the percentages to get the breakdown times.

`get_current_dir_name` not supported on OSX

get_current_dir_name() is not supported on osx. Would/should something like:

  #define CWD_MAX 1024                                                             
  char cwd[CWD_MAX];                                                               
  getcwd(cwd, CWD_MAX);                                                            
  printf("KokkosP: Kernel timing written to %s/%s \n", cwd, fileOutput);  

Be prefered for kp_kernel_timer.cpp?

(As an aside, it would be nice if there was an option to write the file in plain text to avoid the need for kp_reader for an easier work flow on simple projects..)

Add sync/modify to Kokkos memory events

It would be nice if the memory events output could include sync/modify between different memory spaces, since this is a memory event. Right now the tool seems to only include allocate and deallocate.

Tool for catching common anti-patterns at compile-time

This is a feature request to provide a tool for finding/fixing common anti-patterns for GPU development.

For example: using class/structure member variables/methods within device lambda functions.

This supports Sierra (application) development.

This tool would be extremely helpful as typical (Sierra) workflow is to develop an increment on the CPU for faster compile/run turnaround and then test the increment on the GPU. Issues such as accessing through the implicit this pointer are difficult to track down in larger applications as the error may not be reported with the offending kernel.

Tuning Interface revision

My scatch of how I think this could look after the changes. Note how Tools::VariableValue is now only used in places where I actually semantically have to associate values with some id in the tool.

#include <Kokkos_Core.hpp>
#include <Kokkos_Macros.hpp>
#include <cmath>
#include <array>
#include <mpi.h>
#include <chrono>
#include <thread>
#include <iostream>
#include <vector>
constexpr const int num_iterations = 1000;
constexpr const float penalty = 0.1f;
void sleep_difference(int actual, int guess){
  std::this_thread::sleep_for(std::chrono::milliseconds(int(penalty * std::abs(guess-actual))));
}

// This is a simple tuning example, in which we have a function that takes two values and
// sleeps for their difference. The tool is given one value, and asked to guess the other
//
// The correct answer is to guess the same value the tool was given in the context

int main(int argc, char* argv[]) {
  MPI_Init(&argc, &argv);
  Kokkos::initialize(argc, argv);
  // *Here we want to declare a context variable, something a tool might tune using
  //

  Kokkos::Tools::VariableInfo context_info;
  // ** now we want to let the tool know about the semantics of that variable
  // *** it's an integer
  context_info.type          = kokkos_value_integer;
  // *** it's 'interval', roughly meaning that subtracting two values makes sense
  context_info.category      = kokkos_value_interval;
  // *** and the candidate values are in a range, not a set 
  context_info.valueQuantity = kokkos_value_range;
  Kokkos::Tools::SetOrRange value_range;

  // ** this is just the earlier "range" construction
  // ** the last two values are bools representing whether 
  // ** the range is open (endpoints not included) or closed (endpoints included)
  // ** Why would this need the context_variable_id?
  value_range.range = Kokkos::Tools::ValueRange{
      /*context_variable_id,*/ int64_t(0), int64_t(num_iterations),
      int64_t(1),         false,        false};
  // ** here we actually declare it to the tool
  size_t context_variable_id = Kokkos::Tools::declareVariableType("target_value",
                                         context_info, value_range);


   // ** its semantics exactly mirror the tuning variable, it's an
  // ** integer interval value
  Kokkos::Tools::VariableInfo tuning_variable_info;
  tuning_variable_info.category      = kokkos_value_interval;
  tuning_variable_info.type          = kokkos_value_integer;

  // ** Here I'm setting the candidate values to be from a set for two reasons
  // ** 1) It shows the other side of this interface
  // ** 2) ... the prototype tool doesn't support guessing from ranges yet
  tuning_variable_info.valueQuantity = kokkos_value_set;

  std::vector<int64_t> candidate_value_array { 0, 100, …, 900 };

  Kokkos::Tools::SetOrRange candidate_values;
  candidate_values.set = Kokkos::Tools::ValueSet{/*tuning_value_id,*/ 10,
                                                  candidate_value_array.data()};
  size_t tuning_value_id = Kokkos::Tools::declareTuningVariable("tuned_choice", 
                                        tuning_variable_info, candidate_values);

  {
    // * declaring a VariableValue which can hold the results
    // *   of the tuning tool
    Kokkos::Tools::VariableValue tuned_choice;
    tuned_choice.id = tuning_value_id;
    // ** Note that the default value must not crash the program
    tuned_choice.value.int_value = 0;

    // * Looping multiple times so the tool can converge
    for(int attempt =0;attempt<120;++attempt){
    for (int work_intensity = 0; work_intensity < num_iterations;
         work_intensity+=200) {
      std::cout << "Attempting a run with value "<<work_intensity<<std::endl;
      size_t contextId = Kokkos::Tools::getNewContextId();

      // *Here we tell the tool the value of the context variable
      Kokkos::Tools::VariableValue context_value = Kokkos::Tools::make_variable_value(
              context_variable_id, int64_t(work_intensity));
      Kokkos::Tools::setInputValues(
          contextId, 1, 
          &context_value);
            
      // *Now we ask the tool to give us the value it thinks will perform best
      Kokkos::Tools::getOutputValues(contextId, 1, &tuned_choice); 

      // *Call the function with those two values
      sleep_difference(work_intensity, tuned_choice.value.int_value);

      // *This call tells the tool the context is over, so it can
      // *take measurements
      Kokkos::Tools::endContext(contextId);
    }
    }
  }
  Kokkos::finalize();
  MPI_Finalize();
}

Memory in SpaceTimeStack tool seems wrong

The memory in the SpaceTimeStack tool seems wrong. The high water mark from this tool (MAX BYTES ALLOCATED) doesn't match the other Kokkos memory high water mark tools, and the percentages listed don't add up to 100%.

@ibaned

Profiling Filter Tool Ideas

This is a collection of ideas for setting the profiling filter via expressions

// start and stop based on entering region at a certain iteration
// no :ITERATION given start stop and any entry/exit
global_start [ region REGEX : ITERATION ]
global_stop [ region REGEX : ITERATION ]

// Specific Kernel Always
REGEX

// Specific Kernel Regex inside region
// ITERATION_START/STOP optional, only one number: only profile in that region instance
REGEX [ region REGION_REGEX : ITERATION_START : ITERATION_STOP ]

// Specific Kernel only if as parallel_reduce in two specific nested regions
REGEX [ type parallel_reduce ] [ region REGION_REGEX1 ] [ region REGION_REGEX2 ]

// Specific Kernel but only 1011 call
REGEX:1011

// Specific Kernel but only after sequence of other kernels
REGEX [ sequence REGEX1 REGEX2 REGEX3 ]

kp-space-time-stack aborts on vortex when more than 1 MPI rank is used

I was testing using the MueLu driver (Trilinos/packages/muelu/test/scaling/Driver.cpp) and 1 MPI rank is OK, more than one leads to an abort.

It woofs here:

if (stack_frame != &stack_root) {
  std::cerr << "Program ended before \"" << stack_frame->get_full_name()
            << "\" ended\n";
  abort();
}

The timer it complains was finished before the code terminates.

[kp_kernel_timer] output format is unreadable

Running with kp_kernel_timer as the profiling library, I see that the output is unreadable:

sajid@LAPTOP-CDJT2P3R ~/p/m/b/p/deposit (develop)> KOKKOS_PROFILE_LIBRARY=/home/sajid/packages/kokkos-tools/kp_kernel_timer.so OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./deposit_profile
KokkosP: Example Library Initialized (sequence is 0, version: 20211015)
timer label                   time(s)             cout
--------------------------------------------------------------
100its of old_method          13.97732            1

KokkosP: Kernel timing written to /home/sajid/packages/minigia/build/profiling/deposit/LAPTOP-CDJT2P3R-3994.dat

sajid@LAPTOP-CDJT2P3R ~/p/m/b/p/deposit (develop)> cat LAPTOP-CDJT2P3R-3994.dat
��?-@4�100its of old_method�Xd�+@nnamed I�[;Kokkos::ScatterView::ReduceDuplicates [duplicated_rho2_dev]d
                                                                                                        m�?uceDupliZ:Kokkos::ScatterView::ResetDuplicates [duplicated_rho2_dev]d@��@etDuplicJ*Kokkos::View::initialization [] via memset�JO�? �\�I�?S3Kokkos::View::initialization [particles] via memset����?tion [pa\<Kokkos::View::initialization [particles_discards] via memset���g?nnamed IY9Kokkos::View::initialization [particles_masks] via memset��?tion [paR2Kokkos::View::initialization [rho2_dev] via memset��]?tion [rh3�Kokkos::ViewCopy-1D��>nnamed I3�Kokkos::ViewCopy-2D�(9?nnamed I<�N12deposit_impl10rho_zeroerEdX��?nnamed IQ1N12deposit_impl31sv_zyx_rho_reducer_non_periodicEd(\*#@rho_redu⏎
sajid@LAPTOP-CDJT2P3R ~/p/m/b/p/deposit (develop)>

However, I am able to use kp_kernel_timer_json to generate a valid json output file.

Provide MPI-aware versions of certain kokkos-tools

When I am running on more than 500,000 MPI ranks, I don't appreciate all the ranks printing out this half a million times to stdout:

KokkosP: Library Loaded: ./kokkos-tools/src/tools/memory-hwm/kp_hwm.so
KokkosP: Finalization of profiling library.
KokkosP: High water mark memory consumption: 420856 kB

Also opening and writing to half a million files for memory-events or memory-usage hammers the I/O system.

I know Kokkos and kokkos-tools don't know about MPI, but it would be useful if we could limit this output to MPI rank 0 and print out total min, average, and max memory consumption.

Microtool to debug refcounting errors

People run into issues with refcounting, and currently they use kernel-logger to try to find out which pointer is causing that issue. We need a small tool to associate pointers with names so people can see "ah, [address] was associated with [name], that's where the refcount had problems."

kernel names in nvprof

I am wondering how to properly use the nvprof-connector, so that the kokkos kernel names show up in demangled form in the visual debugger.

kp_space_time_stack.cpp doesn't compile on BG/Q

Here is the compile error:

kp_space_time_stack.cpp:397:7: error: no matching function for call to
      'MPI_Send'
      MPI_Send(s.data(), string_size, MPI_CHAR, 0, 42, MPI_COMM_WORLD);
      ^~~~~~~~
/bgsys/drivers/V1R2M4/ppc64/comm/include/mpi.h:811:5: note: candidate function
      not viable: no known conversion from 'const value_type *'
      (aka 'const char *') to 'void *' for 1st argument
int MPI_Send(MPICH2_CONST void*, int, MPI_Datatype, int, int, MPI_Comm) ...

See PR #28

Weird (buggy?) behavior from the space-time-stack tool

I'm seeing some perplexing behavior from the space-time-stack tool. The d_sorted (particle:sorted) View in my code has no other references (reference count = 1 always) and is periodically reallocated:

    if (d_particles.dimension_0() > d_sorted.dimension_0()) {
      d_sorted = t_particle_1d("particle:sorted",d_particles.dimension_0());
    }

I can't understand how, but the d_sorted (particle:sorted) view appears twice in the space-time-stack tool output list for CUDA:

MPI RANK WITH MAX MEMORY: 0
MAX BYTES ALLOCATED: 8574610364
HOST ALLOCATIONS:
================
  19.4% particle:particles_mirror
  2.3% grid:cells_mirror
  1.2% grid:cinfo_mirror
  0.7% particle:mlist_mirror
  0.2% emit/face:tasks_mirror
  0.2% collide:vremax_mirror
  0.2% collide:remain_mirror
  0.2% thermal/grid:vector_grid_mirror
  0.2% thermal/grid:tally_mirror
CUDA ALLOCATIONS:
================
  19.4% particle:sorted
  19.4% particle:particles
  19.4% particle:sorted
  11.5% particle:plist
  2.3% grid:cells
  1.2% grid:cinfo
  0.7% particle:mlist
  0.2% emit/face:tasks
  0.2% collide:vremax
  0.2% collide:remain
  0.2% thermal/grid:vector_grid
  0.2% thermal/grid:tally

If I first destroy the d_sorted view like this:

    if (d_particles.dimension_0() > d_sorted.dimension_0()) {
      d_sorted = t_particle_1d();
      d_sorted = t_particle_1d("particle:sorted",d_particles.dimension_0());
    }

then d_sorted isn't listed twice in the CUDA output, but the overall high water mark doesn't change, and another copy of the particle:particles_mirror host view appears!

MPI RANK WITH MAX MEMORY: 0
MAX BYTES ALLOCATED: 8638309940
HOST ALLOCATIONS:
================
  19.3% particle:particles_mirror
  19.3% particle:particles_mirror
  2.3% grid:cells_mirror
  1.2% grid:cinfo_mirror
  0.7% particle:mlist_mirror
  0.2% emit/face:tasks_mirror
  0.2% collide:vremax_mirror
  0.2% collide:remain_mirror
  0.2% thermal/grid:vector_grid_mirror
  0.2% thermal/grid:tally_mirror
CUDA ALLOCATIONS:
================
  19.3% particle:particles
  19.3% particle:sorted
  12.1% particle:plist
  2.3% grid:cells
  1.2% grid:cinfo
  0.7% particle:mlist
  0.2% emit/face:tasks
  0.2% collide:vremax
  0.2% collide:remain
  0.2% thermal/grid:vector_grid
  0.2% thermal/grid:tally

@ibaned any ideas?

HWM for GPU memory

If I understand #21 correctly, it looks like the HWM tool reports on CPU memory usage (not GPU).

Is there any tool available in this repo that can help me get the HWM for GPU memory used by a program?

Space-time-stack prints out of order with MPI

@ibaned I noticed space-time-stack sometimes prints out of order when running on more than 1 MPI rank, see below. Perhaps some MPI Barriers are missing?

MAX BYTES ALLOCATED: 4222384
MPI RANK WITH MAX MEMORY: 2
ALLOCATIONS AT TIME OF HIGH WATER MARK:
  40.4% particle:particles
  23.3% grid:cells
  17.1% grid:pcells
  12.4% grid:cinfo
  1.7% comm:sbuf
  1.6% comm:sbuf
  1.6% particle:mlist
  0.8% Irregular:buf
  0.7% particle:plist


BEGIN KOKKOS PROFILING REPORT:
TOTAL TIME: 0.123602 seconds
TOP-DOWN TIME TREE:
<percent of total time> <percent MPI imbalance> <number of calls> <name> [type]
===================
|-> 35.1% 1.0% 130 N9SPARTA_NS16CollideVSSKokkosE/N9SPARTA_NS23TagCollideCollisionsOneILi0ELin1EEE [reduce]
|-> 22.5% 0.8% 190 N9SPARTA_NS12UpdateKokkosE/N9SPARTA_NS13TagUpdateMoveILi3ELi0ELin1EEE [reduce]
|-> 2.4% 1.5% 132 N9SPARTA_NS14ParticleKokkosE/N9SPARTA_NS15TagParticleSortILi0EEE [for]
|-> 1.7% 5.5% 323 Kokkos::View::initialization [for]
|-> 0.9% 4.5% 190 N9SPARTA_NS10CommKokkosE/N9SPARTA_NS23TagCommMigrateParticlesILi0EEE [for]
|-> 0.7% 0.6% 570 N9SPARTA_NS15IrregularKokkosE/N9SPARTA_NS22TagIrregularPackBufferILi0EEE [for]
|-> 0.4% 1.7% 190 N9SPARTA_NS14ParticleKokkosE/N9SPARTA_NS28TagParticleCompressReactionsE [for]
|-> 0.3% 3.2% 15 N9SPARTA_NS17ComputeTempKokkosE [reduce]
|-> 0.1% 11.4% 1 ZN9SPARTA_NS21CreateParticlesKokkos12create_localElEUliE_ [for]
|-> 0.1% 17.9% 320 "mlist_mirror"="particle:mlist" [copy]
|-> 0.1% 3.4% 131 N9SPARTA_NS14ParticleKokkosE/N9SPARTA_NS25TagParticleZero_cellcountE [for]

BOTTOM-UP TIME TREE:
<percent of total time> <percent MPI imbalance> <number of calls> <name> [type]
===================
|-> 35.1% 1.0% 130 N9SPARTA_NS16CollideVSSKokkosE/N9SPARTA_NS23TagCollideCollisionsOneILi0ELin1EEE [reduce]
|-> 22.5% 0.8% 190 N9SPARTA_NS12UpdateKokkosE/N9SPARTA_NS13TagUpdateMoveILi3ELi0ELin1EEE [reduce]
|-> 2.4% 1.5% 132 N9SPARTA_NS14ParticleKokkosE/N9SPARTA_NS15TagParticleSortILi0EEE [for]
|-> 1.7% 5.5% 323 Kokkos::View::initialization [for]
|-> 0.9% 4.5% 190 N9SPARTA_NS10CommKokkosE/N9SPARTA_NS23TagCommMigrateParticlesILi0EEE [for]
|-> 0.7% 0.6% 570 N9SPARTA_NS15IrregularKokkosE/N9SPARTA_NS22TagIrregularPackBufferILi0EEE [for]
|-> 0.4% 1.7% 190 N9SPARTA_NS14ParticleKokkosE/N9SPARTA_NS28TagParticleCompressReactionsE [for]
|-> 0.3% 3.2% 15 N9SPARTA_NS17ComputeTempKokkosE [reduce]
|-> 0.1% 11.4% 1 ZN9SPARTA_NS21CreateParticlesKokkos12create_localElEUliE_ [for]
|-> 0.1% 17.9% 320 "mlist_mirror"="particle:mlist" [copy]
|-> 0.1% 3.4% 131 N9SPARTA_NS14ParticleKokkosE/N9SPARTA_NS25TagParticleZero_cellcountE [for]

KOKKOS HOST SPACE:
===================
KOKKOS CUDA SPACE:
===================
MAX BYTES ALLOCATED: 0
MPI RANK WITH MAX MEMORY: 0
ALLOCATIONS AT TIME OF HIGH WATER MARK:

END KOKKOS PROFILING REPORT.

Profiling requires MPI be initialized

This might be a corner case. I have some tests in Trilinos that are single node only. They do not need MPI and, therefore, don't call MPI_Init in main (in cmake we restrict to one MPI process). However, if Trilinos is configured/built with MPI enabled, then it seems the stacked timer profiler requires all executables to initialize mpi. This causes a bunch of unit test failures in phalanx. An example is shown below. It would be convenient for the profiler could check if mpi was initialized and perform the init if needed. That way I could leave the profiling flag on during unit testing. When I unset the KOKKOS_PROFILE_LIBRARY flag, the error goes away and all the tests run fine.

[rppawlo@gge Kokkos]$ mpirun -np 1 ./Phalanx_tKokkosNestedLambda.exe 
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
*** The MPI_Comm_rank() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[gge.srn.sandia.gov:23656] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[60414,1],0]
  Exit code:    1
--------------------------------------------------------------------------

understanding MemoryUsage

Could you update the wiki so this is clear, but I see

Time(s) Size(MB) HighWater(MB) HighWater-Process(MB)

in the output. Size, is that total size in kokkos views? HighWater is that max(size) and is HighWater-Process size in view and malloc and stack and everything??

NGA FY22: improve build system

  • Develop CMake build system to build all the KokkosTools
  • Extent the tools cmake build system to optionally build Caliper and Apex (i.e. translate typical Kokkos
    options such as Kokkos_ENABLE_CUDA to the respective settings for Caliper and Apex)
  • Add option to build all tools as a single library, and add an initialization interface to that library, which uses
    the existing KokkosEventSet capability to load a specific tool in the library upon being called with the
    corresponding option (e.g. KokkosTools::initialize("space_time_stack"));

PAPI connector: per thread metrics

I noticed that the PAPI connector records the events only on the master thread. This is because PAPI performance counters are thread-local and PAPI_hl_region_begin/PAPI_hl_region_end are called from the master thread only.

It would be nice if the PAPI connector could record the performance counters on all threads. However, I'm guessing this needs changes in Kokkos itself to call the profiling hooks from inside the parallel regions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.