Giter Site home page Giter Site logo

rocm / omnitrace Goto Github PK

View Code? Open in Web Editor NEW
289.0 15.0 23.0 6.07 MB

Omnitrace: Application Profiling, Tracing, and Analysis

Home Page: https://rocm.docs.amd.com/projects/omnitrace/en/latest/

License: MIT License

CMake 15.10% Shell 2.62% C++ 62.35% C 12.66% Makefile 0.02% Python 4.52% Batchfile 0.02% CSS 2.32% Assembly 0.40%
binary-instrumentation cpu-profiler gpu-profiler profiling sampling-profiler tracing hardware-counters performance-analysis performance-metrics performance-monitoring

omnitrace's People

Contributors

ajanicijamd avatar amd-jnovotny avatar benrichard-amd avatar dependabot[bot] avatar dgaliffiamd avatar feizheng10 avatar jrmadsen avatar maetveis avatar ratamima avatar tbennun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

omnitrace's Issues

Issues with job execution due to DYNINST_API_RT using OMNITRACE_BUILD_DYNINST=ON

So I have been having trouble with compiling Omnitrace (Ubuntu 22.04) with recommended build configuration with various compilation failures within various Dyninst parts ( TBB, ELFUTILS, BOOST). But was able to get compilation using the following, (yes the ELFUTILS is enabled twice, but it wouldn't compile for me otherwise.) :

cmake                                                                 \
    -B omnitrace-build-dyninst                            \
    -D CMAKE_INSTALL_PREFIX=/opt/omnitrace \
    -D OMNITRACE_USE_HIP=OFF                       \
    -D OMNITRACE_USE_ROCM_SMI=OFF           \
    -D OMNITRACE_USE_ROCTRACER=OFF         \
    -D OMNITRACE_USE_PYTHON=ON                \
    -D OMNITRACE_USE_OMPT=ON                    \
    -D OMNITRACE_USE_MPI_HEADERS=ON       \
    -D OMNITRACE_BUILD_PAPI=ON                    \
    -D OMNITRACE_BUILD_LIBUNWIND=ON       \
    -D OMNITRACE_BUILD_DYNINST=ON            \
    -D DYNINST_BUILD_ELFUTILS=ON                  \
    -D DYNINST_BUILD_{TBB,ELFUTILS,BOOST,LIBIBERTY}=ON \
    -D OMNITRACE_BUILD_EXAMPLES=ON \
    omnitrace-source

But all the examples fail with the following DYNINST_API_RT assertion:

% omnitrace -- ./openmp-cg 
[omnitrace][exe] 
[omnitrace][exe] command :: '/scratch/software/omnitrace/omnitrace-build-dyninst/openmp-cg'...
[omnitrace][exe] 
[omnitrace][exe] DYNINST_API_RT: /opt/omnitrace/lib/omnitrace/libdyninstAPI_RT.so.11.0.1
openmp-cg: /scratch/software/omnitrace/omnitrace-source/external/dyninst/dyninstAPI_RT/src/RTlinux.c:454: r_debugCheck: Assertion `_r_debug.r_map' failed.
Error #68 (level 0): Dyninst was unable to create the specified process
Error #68 (level 0): create process failed bootstrap
[omnitrace][exe] Failed to create process: '/scratch/software/omnitrace/omnitrace-build-dyninst/openmp-cg '
terminate called after throwing an instance of 'std::runtime_error'
  what():  Failed to create process
Aborted (core dumped)
``

Segfault printing available for libfabric on Crusher w/ v1.2.0

$ source sw/omnitrace-devel/share/omnitrace/setup-env.sh
$ module load craype-accel-amd-gfx90a
$ module load PrgEnv-cray
$ module load rocm
$ omnitrace --print-available pair -- /opt/cray/libfabric/1.15.0.0/lib64/libfabric.so.1
[omnitrace][exe]
[omnitrace][exe] command :: '/opt/cray/libfabric/1.15.0.0/lib64/libfabric.so.1.17.0'...
[omnitrace][exe]
[omnitrace][exe] DYNINST_API_RT: /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/lib/omnitrace/libdyninstAPI_RT.so.11.0.1
omnitrace: /ccs/home/nicurtis/omnitrace/external/dyninst/common/src/addrtranslate-linux.C:289: Dyninst::LoadedLib* Dyninst::AddressTranslateSysV::getAOut(): Assertion `phdr_vaddr != (Address) -1' failed.
Aborted

Any way to configure width of Timemory output?

e.g.,

image

the kernel name is getting ellipsed (totally a verb) before I can see the relevant name. I imagine the "right" way to do this is to hook-up with KokkosP to rename the kernels, however it would be good to add (or document, if one exists already) a method to control the width of these tables.

Build from release tarball fails

wget https://github.com/AMDResearch/omnitrace/archive/refs/tags/v1.7.2.tar.gz
tar xvf omnitrace-1.7.2.tar.gz
cmake -B omnitrace-build -DCMAKE_INSTALL_PREFIX=/share/modules/omnitrace/1.7.2 -DOMNITRACE_BUILD_DYNINST=ON -DDYNINST_BUILD_{TBB,ELFUTILS,BOOST,LIBIBERTY}=ON omnitrace-1.7.2
...
fatal: not a git repository (or any of the parent directories): .git
-- function(omnitrace_checkout_git_submodule) failed.
CMake Error at cmake/MacroUtilities.cmake:230 (message):
  Command: "/usr/bin/git submodule update --init

                   external/dyninst"
Call Stack (most recent call first):
  cmake/Packages.cmake:263 (omnitrace_checkout_git_submodule)
  CMakeLists.txt:260 (include)

Talked to Jon, likely:

the tarballs don't have .gitmodules set up correctly to do the submodule clone.
The CMake has stuff to try to pull a specific repo branch if that file isn't present but the branch names I have in the CMake may be outdated

Missing perfetto counter track names

perfetto_counter_track needs to use:

    using name_map_t  = std::map<uint32_t, std::vector<std::unique_ptr<std::string>>>;

instead of

    using name_map_t  = std::map<uint32_t, std::vector<std::string>>;

because the underlying C-string passed to perfetto::CounterTrack (i.e. perfetto::CounterTrack{ _name.c_str() }) is occasionally invalidated when the vector is reallocated. This typically shows up in the rocm-smi GPU samples

Omnitrace hangs during post-processing

I'm experiencing some issues when profiling a Python application. The code I'm running is the language modeling example available in the ROCmSoftwarePlatform/transformers repository. It's executed as follows:

python run_mlm.py --model_name_or_path bert-large-uncased --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --logging_steps 1 --output_dir /tmp/test-mlm-bbu --overwrite_output_dir --per_device_train_batch_size 8 --fp16 --skip_memory_metrics=True --cache_dir /tmp/bert_cache --max_steps 160 

(The first time it's executed, it will download and cache input datasets from public sources.)

One key parameter is --max-steps. In shorter executions with a smaller number of steps (<= 16), Omnitrace seems to work just fine. In longer executions when the value is higher (>= 160), Omnitrace gets stuck during post-processing.

I'm running Omnitrace with Timemory, and disabling process sampling:

OMNITRACE_USE_PROCESS_SAMPLING=OFF OMNITRACE_USE_TIMEMORY=ON python3 -m omnitrace -- run_mlm.py [...] --max_steps 160

Sample Omnitrace log available. These are the last lines I see printed to stderr (edited):

[...]
[pid=13003][tid=3][timemory/source/timemory/operations/types/finalize/merge.hpp:124@'operator()']> [wall_clock]> merging 10816 hash-aliases into existing set of 19625 hash-aliases!...
[pid=13003][tid=3][timemory/source/timemory/operations/types/finalize/merge.hpp:174@'merge']> [wall_clock]> worker is merging 1 records into 62190 records...
[pid=13003][tid=3][timemory/source/timemory/operations/types/finalize/merge.hpp:223@'merge']> wall_clock master has 62191 records...
[pid=13003][tid=3][timemory/source/timemory/operations/types/finalize/merge.hpp:318@'merge']> [wall_clock]> clearing merged storage!...
[pid=13003][tid=3][timemory/source/timemory/storage/impl_storage_true.cpp:151@'~storage']> [tim::component::wall_clock|3]> destroying storage...
[pid=13003][tid=3][timemory/source/timemory/storage/impl_storage_true.cpp:162@'~storage']> [tim::component::wall_clock|3]> merging into primary instance...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:90@'merge']> [wall_clock]> merging rhs=1 into lhs=62191...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:105@'operator()']> [wall_clock]> merging 7207 hash-ids into existing set of 10449 hash-ids!...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:124@'operator()']> [wall_clock]> merging 10816 hash-aliases into existing set of 19625 hash-aliases!...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:174@'merge']> [wall_clock]> worker is merging 1 records into 62191 records...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:223@'merge']> wall_clock master has 62192 records...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:318@'merge']> [wall_clock]> clearing merged storage!...
[pid=13003][tid=4][timemory/source/timemory/storage/impl_storage_true.cpp:151@'~storage']> [tim::component::wall_clock|4]> destroying storage
[pid=13003][tid=3][timemory/source/timemory/storage/impl_storage_true.cpp:180@'~storage']> [tim::component::wall_clock|3]> deleting graph data...
[pid=13003][tid=3][timemory/source/timemory/storage/impl_storage_true.cpp:187@'~storage']> [tim::component::wall_clock|3]> storage destroyed...
[pid=13003][tid=3][timemory/source/timemory/storage/base_storage.cpp:103@'~storage']> base::storage instance 3 deleted for tim::component::wall_clock...
[pid=13003][tid=4][timemory/source/timemory/storage/impl_storage_true.cpp:180@'~storage']> [tim::component::wall_clock|4]> deleting graph data...
[pid=13003][tid=4][timemory/source/timemory/storage/impl_storage_true.cpp:187@'~storage']> [tim::component::wall_clock|4]> storage destroyed...
[pid=13003][tid=4][timemory/source/timemory/storage/base_storage.cpp:103@'~storage']> base::storage instance 4 deleted for tim::component::wall_clock...

New bash versions don't like current setup-env.sh

https://github.com/AMDResearch/omnitrace/blob/2718596e5a6808a9278c3f6c8fddfaf977d3bcb6/cmake/Templates/setup-env.sh.in#L4

On a from-source version of Linux (running GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)), I'm seeing an issue with the current setup-env.sh where it exits with:

"/home/amd/omnitrace does not exist"

Some light debugging, and I've found that this version of bash does not seem to like the code-pattern of:

BASEDIR=$(cd ${BASEDIR}/../.. && pwd)

Not 100% sure why (wasn't able to find any Bash 5 info on this), but:

BASEDIR=$(realpath ${BASEDIR}/../..)

seems to work insteadm

CMAKE_INSTALL_RPATH_USE_LINK_PATH does not play well with "set(CMAKE_INSTALL_RPATH" commands in Packages.cmake

I ended up commenting out all of the "set(CMAKE_INSTALL_RPATH" commands, which appears to have done what I want:

$ readelf -d /ccs/home/nicurtis/sw/omnitrace-devel/lib/libomnitrace.so

Dynamic section at offset 0x2035d38 contains 45 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libgotcha.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libunwind.so.8]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libamdhip64.so.5]
 0x0000000000000001 (NEEDED)             Shared library: [libroctracer64.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libdrm.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libdrm_amdgpu.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libnuma.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [librocprofiler64.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libhsa-runtime64.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [librocm_smi64.so.5]
 0x0000000000000001 (NEEDED)             Shared library: [libxpmem.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]
 0x000000000000000e (SONAME)             Library soname: [libomnitrace.so.1.3]
 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN:$ORIGIN/omnitrace:]

I think that CMAKE_INSTALL_RPATH_USE_LINK_PATH isn't playing harmoniously with those.

Originally posted by @arghdos in #92 (comment)

Enabling OMPT sometimes requires setting OMP_TOOL_LIBRARIES

It has been noted that some versions of OMPT in OpenMP v5 require:

export OMP_TOOL_LIBRARIES=libomnitrace-dl.so

In later versions, OpenMP will dlsym(RTLD_NEXT, ...) and look for ompt_start_tool.
To support older OMPT implementations omnitrace-dl should do this:

std::string _omni_omp_libs = "libomnitrace-dl.so";
const char* _omp_libs      = getenv("OMP_TOOL_LIBRARIES");
if(_omp_libs) 
    _omni_omp_libs = common::join(':', _omp_libs, "libomnitrace-dl.so");
setenv("OMP_TOOL_LIBRARIES", _omni_omp_libs.c_str(), 1);

Reformulate TRACE_COUNTER names for readability

Currently, CPU/GPU/Thread counters have a prefix like [<DESC> <#>] <NAME>, e.g. [Thread 0] Total Cycles. Perfetto appears to group by alphabetical + numeric order so this causes grouping by the <#> instead of the name, e.g.:

[Thread 0] Total Cycles (S)
[Thread 0] Total Instructions (S)
...
[Thread 2] Total Cycles (S)
[Thread 2] Total Instructions (S)

This makes it difficult to compare values between threads. An alternative scheme like:

Thread Total Cycles [0] (S)
Thread Total Cycles [1] (S)
...
Thread Total Instructions [0] (S)
Thread Total Instructions [1] (S)

is much more readable for comparison

Segfault after instrumentation phase

Attached code segfaults

Command is

omnitrace \
--verbose \
-E 'cqmc::engine::LMYEngine<double>::get_param' \
-E 'qmcplusplus::SlaterDetBuilder::createMSDFast' \
-E 'qmcplusplus::SoaCartesianTensor<double>::SoaCartesianTensor' \
-E 'qmcplusplus::SpaceGrid::initialize_rectilinear' \
-o qmcpack.inst -- ./bin/qmcpack

Backtrace is

#0  0x00007ffff7e89080 in Dyninst::Relocation::Instrumenter::handleCondDirExits(Dyninst::Relocation::RelocBlock*, Dyninst::Relocation::RelocGraph*, instPoint*) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#1  0x00007ffff7e89ed5 in Dyninst::Relocation::Instrumenter::funcExitInstrumentation(Dyninst::Relocation::RelocBlock*, Dyninst::Relocation::RelocGraph*) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#2  0x00007ffff7e8a0a3 in Dyninst::Relocation::Instrumenter::process(Dyninst::Relocation::RelocBlock*, Dyninst::Relocation::RelocGraph*) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#3  0x00007ffff7e870e8 in Dyninst::Relocation::Transformer::processGraph(Dyninst::Relocation::RelocGraph*) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#4  0x00007ffff7e73065 in Dyninst::Relocation::CodeMover::transform(Dyninst::Relocation::Transformer&) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#5  0x00007ffff7dff98d in AddressSpace::transform(boost::shared_ptr<Dyninst::Relocation::CodeMover>) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#6  0x00007ffff7dffe4b in AddressSpace::relocateInt(std::_Rb_tree_const_iterator<func_instance*>, std::_Rb_tree_const_iterator<func_instance*>, unsigned long) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#7  0x00007ffff7e01050 in AddressSpace::relocate() ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#8  0x00007ffff7ea06c7 in Dyninst::PatchAPI::DynInstrumenter::run() ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#9  0x00007ffff7321e6f in Dyninst::PatchAPI::Patcher::run() ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libpatchAPI.so.11.0
#10 0x00007ffff7321654 in Dyninst::PatchAPI::Command::commit() ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libpatchAPI.so.11.0
#11 0x00007ffff7df8af1 in AddressSpace::patch(AddressSpace*) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#12 0x00007ffff7dcad3f in BPatch_binaryEdit::writeFile(char const*) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#13 0x000055555558906e in ?? ()
#14 0x00007ffff75c3083 in __libc_start_main (main=0x55555557bb90, argc=14, argv=0x7fffffffc728, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fffffffc718) at ../csu/libc-start.c:308
#15 0x000055555558c7fe in ?? ()

Output from the run:
out_omnitrace.txt.gz

Executable: (compiled with LLVM 15, uses OMP offload)
qmcpack.tar.gz

(This file isn't from crusher, but the same problem occurs there with the same backtrace)

Deadlock in PIConGPU when instrumenting locks

To build, follow instructions in: #145
Use binary rewrite to instrument (no exclusions needed, as boost doesn't come in because of #144)

When running, it hangs at MPI_Init with:

0x00007ffff3d0b0ec in __lll_lock_wait_private () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff3d0b0ec in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007ffff3d83810 in malloc () from /lib64/libc.so.6
#2  0x00007ffff4142d6c in operator new(unsigned long) () from /lib64/libstdc++.so.6
#3  0x00007fffde74463d in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#4  0x00007fffde702e98 in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#5  0x00007fffdda4b26f in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#6  0x00007fffde4d80e0 in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#7  0x00007fffde5760d9 in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#8  0x00007fffde578661 in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#9  0x00007fffde55ef65 in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#10 0x00007fffe9d45d48 in ucm_event_enter () at event/event.c:161
#11 0x00007fffe9d46acf in ucm_sbrk (increment=139264) at event/event.c:376
#12 0x00007ffff3d8528d in __default_morecore () from /lib64/libc.so.6
#13 0x00007ffff3d814db in sysmalloc () from /lib64/libc.so.6
#14 0x00007ffff3d82659 in _int_malloc () from /lib64/libc.so.6
#15 0x00007ffff3d84486 in calloc () from /lib64/libc.so.6
#16 0x00007fffea6a36c1 in opal_hash_table_init2 () from /share/modules/gcc-8_3_1/openmpi/5.0.0rc2-ucx1.11.2/lib/libopen-pal.so.80
#17 0x00007fffea7272c2 in mca_base_pvar_init () from /share/modules/gcc-8_3_1/openmpi/5.0.0rc2-ucx1.11.2/lib/libopen-pal.so.80
#18 0x00007fffea723f25 in mca_base_var_init () from /share/modules/gcc-8_3_1/openmpi/5.0.0rc2-ucx1.11.2/lib/libopen-pal.so.80
#19 0x00007fffea6abbd2 in opal_init_util () from /share/modules/gcc-8_3_1/openmpi/5.0.0rc2-ucx1.11.2/lib/libopen-pal.so.80
#20 0x00007ffff60c4c3f in ompi_mpi_init () from /share/modules/gcc-8_3_1/openmpi/5.0.0rc2-ucx1.11.2/lib/libmpi.so.80
#21 0x00007ffff60fe301 in PMPI_Init () from /share/modules/gcc-8_3_1/openmpi/5.0.0rc2-ucx1.11.2/lib/libmpi.so.80
#22 0x00007fffde4cec1d in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#23 0x0000000001a90d67 in ?? ()
#24 0x0000000002026010 in ?? ()
#25 0x00007fffdf5b2b60 in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#26 0x00000000023b39e0 in ?? ()
#27 0x0000000000000000 in ?? ()

Disabling OMNITRACE_TRACE_THREAD_RW_LOCKS and OMNITRACE_TRACE_THREAD_SPIN_LOCKS allows progress

Libtbb not found

I did a binary rewrite for my executable, and tried running it via

srun -n 1 ./job.sh

in job.sh, I put

module use $OMNITRACE/share/modulefiles
module load omnitrace
omni_hip

I got this error message:

omni_hip: error while loading shared libraries: libtbb.so.2: can      not open shared object file: No such file or directory

To workaround the issue, I have to add the following to job.sh before calling the executable:

export LD_LIBRARY_PATH=/$OMNITRACE/lib/omnitrace:$LD_LIBRARY_PATH

Improve Dyninst Error Handling

Occasionally, dyninst segfaults during instrumentation. Eventually need to track down why the segfaults are happening but in the meantime, omnitrace needs to make it easier to figure out which function is causing the segfault so it can be excluded as a workaround.

Intermediate sampling flushing

Right now, there is no way to limit the amount of sampling data stored in memory beyond setting OMNITRACE_SAMPLING_DURATION. Need to add a way to occasionally flush the data stored in memory and an option to configure it.

Deprecate OMNITRACE_USE_THREAD_SAMPLING setting

The configuration setting OMNITRACE_USE_THREAD_SAMPLING was originally named as such because it enables sampling in a background thread as opposed to sampling during software interrupts (enabled via OMNITRACE_USE_SAMPLING).

The problem is that the former (THREAD_SAMPLING) takes measurements at the system and process scope whereas the latter (SAMPLING) takes measurements at the thread scope.

Thus, OMNITRACE_USE_THREAD_SAMPLING will be deprecated and the new configuration option will be OMNITRACE_USE_PROCESS_SAMPLING.

While it is deprecated, if OMNITRACE_USE_THREAD_SAMPLING is specified in either the env or a config file and OMNITRACE_USE_PROCESS_SAMPLING is NOT specified, a deprecation notice will be emitted and we will use the value of OMNITRACE_USE_THREAD_SAMPLING. If both are specified, a deprecation notice will be emitted and the value of OMNITRACE_USE_THREAD_SAMPLING will be ignored.

After one or two releases, OMNITRACE_USE_THREAD_SAMPLING will be removed and the strict config setting will cause a failure if it is specified.

Unclear how/if OMNITRACE_PERFETTO_COMBINE_TRACES works

Using:

# lvals starting with $ are variables
$ENABLE                         = ON
$SAMPLE                         = OFF

# use fields
OMNITRACE_USE_PERFETTO          = $ENABLE
OMNITRACE_USE_TIMEMORY          = $ENABLE
OMNITRACE_USE_SAMPLING          = $SAMPLE
OMNITRACE_USE_THREAD_SAMPLING   = $SAMPLE
OMNITRACE_CRITICAL_TRACE        = OFF

# debug
OMNITRACE_DEBUG                 = OFF
OMNITRACE_VERBOSE               = 1

# output fields
OMNITRACE_OUTPUT_PATH           = lmp-output
OMNITRACE_OUTPUT_PREFIX         = %tag%/
OMNITRACE_TIME_OUTPUT           = OFF
OMNITRACE_USE_PID               = OFF
OMNITRACE_PERFETTO_COMBINE_TRACES = ON
OMNITRACE_CRITICAL_TRACE        = OFF

# timemory fields
OMNITRACE_PAPI_EVENTS           = 
OMNITRACE_TIMEMORY_COMPONENTS   = wall_clock trip_count
OMNITRACE_MEMORY_UNITS          = MB
OMNITRACE_TIMING_UNITS          = sec

# sampling fields
OMNITRACE_SAMPLING_FREQ         = 10

# rocm-smi fields
OMNITRACE_ROCM_SMI_DEVICES      = 0,1,2,3,4,5,6,7

w/ 8 ranks, trying to get a single unified timeline, but I still get:

> ls -latr lmp-output/lmp.inst/perfetto-trace-*
-rw-r--r-- 1 nicurtis nicurtis 124463449 Jun 13 15:52 lmp-output/lmp.inst/perfetto-trace-7.proto
-rw-r--r-- 1 nicurtis nicurtis 140834113 Jun 13 15:52 lmp-output/lmp.inst/perfetto-trace-4.proto
-rw-r--r-- 1 nicurtis nicurtis 176635659 Jun 13 15:52 lmp-output/lmp.inst/perfetto-trace-3.proto
-rw-r--r-- 1 nicurtis nicurtis 162919265 Jun 13 15:52 lmp-output/lmp.inst/perfetto-trace-2.proto
-rw-r--r-- 1 nicurtis nicurtis 167251291 Jun 13 15:52 lmp-output/lmp.inst/perfetto-trace-6.proto
-rw-r--r-- 1 nicurtis nicurtis 159870315 Jun 13 15:52 lmp-output/lmp.inst/perfetto-trace-0.proto
-rw-r--r-- 1 nicurtis nicurtis 169955609 Jun 13 15:52 lmp-output/lmp.inst/perfetto-trace-1.proto
-rw-r--r-- 1 nicurtis nicurtis 176147761 Jun 13 15:52 lmp-output/lmp.inst/perfetto-trace-5.proto

build error in spack

I'm getting the following error when building in spack:

1 error found in build log:
     81    -- Looking for pthread_create in pthreads
     82    -- Looking for pthread_create in pthreads - not found
     83    -- Looking for pthread_create in pthread
     84    -- Looking for pthread_create in pthread - found
     85    -- Found Threads: TRUE
     86    -- hip::amdhip64 is SHARED_LIBRARY
  >> 87    CMake Error at cmake/Packages.cmake:120 (find_package):
     88      Could not find a package configuration file provided by "ROCmVersi
           on" with
     89      any of the following names:
     90    
     91        ROCmVersionConfig.cmake
     92        rocmversion-config.cmake
     93    

If it helps, here is my spec:

[lee218@rzvernal11:spack]$ ./bin/spack spec omnitrace@main %[email protected]
==> Warning: Missing a source id for omnitrace@main
Input spec
--------------------------------
omnitrace@main%[email protected]

Concretized
--------------------------------
omnitrace@main%[email protected]~caliper~ipo~mpi+mpi_headers+ompt+papi~perfetto_tools~python+rocm~strip~tau build_type=Release arch=cray-rhel8-zen
    ^[email protected]%[email protected]~doc+ncurses+ownlibs~qt build_type=Release arch=cray-rhel8-zen
    ^[email protected]%[email protected]~ipo+openmp~stat_dysect~static build_type=RelWithDebInfo arch=cray-rhel8-zen
        ^[email protected]%[email protected]+atomic+chrono~clanglibcpp~container~context~contract~coroutine+date_time~debug~exception~fiber+filesystem~graph~graph_parallel~icu~iostreams~json~locale~log~math~mpi+multithreaded~nowide~numpy~pic~program_options~python~random~regex~serialization+shared~signals~singlethreaded~stacktrace+system~taggedlayout~test+thread+timer~type_erasure~versionedlayout~wave cxxstd=98 patches=57a8401,a440f96 visibility=hidden arch=cray-rhel8-zen
        ^[email protected]%[email protected]~bzip2~debuginfod+nls~xz arch=cray-rhel8-zen
            ^[email protected]%[email protected]+bzip2+curses+git~libunistring+libxml2+tar+xz arch=cray-rhel8-zen
                ^[email protected]%[email protected]~debug~pic+shared arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected] libs=shared,static arch=cray-rhel8-zen
                ^[email protected]%[email protected]~python arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected]~pic libs=shared,static arch=cray-rhel8-zen
                    ^[email protected]%[email protected]+optimize+pic+shared patches=0d38234 arch=cray-rhel8-zen
                ^[email protected]%[email protected]~symlinks+termlib abi=none arch=cray-rhel8-zen
                ^[email protected]%[email protected] zip=pigz arch=cray-rhel8-zen
            ^[email protected]%[email protected]+sigsegv patches=3877ab5,fc9b616 arch=cray-rhel8-zen
        ^[email protected]%[email protected]~ipo+shared+tm build_type=RelWithDebInfo cxxstd=default patches=62ba015,ce1fb16,d62cb66 arch=cray-rhel8-zen
        ^[email protected]%[email protected]+pic arch=cray-rhel8-zen
    ^[email protected]%[email protected]~ipo build_type=Release patches=7ed1232 arch=cray-rhel8-zen
        ^[email protected]%[email protected]~ipo build_type=Release arch=cray-rhel8-zen
            ^[email protected]%[email protected]~ipo~link_llvm_dylib~llvm_dylib~openmp+rocm-device-libs build_type=Release patches=a08bbe1 arch=cray-rhel8-zen
                ^[email protected]%[email protected]+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4+uuid+zlib patches=0d98e93,4c24573,ebdca64,f2fd060 arch=cray-rhel8-zen
                    ^[email protected]%[email protected]+libbsd arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
                            ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected]~docs~shared certs=mozilla arch=cray-rhel8-zen
                    ^[email protected]%[email protected]+column_metadata+dynamic_extensions+fts~functions+rtree arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected]~gmp~ipo~python build_type=RelWithDebInfo arch=cray-rhel8-zen
            ^[email protected]%[email protected]~ipo build_type=Release arch=cray-rhel8-zen
        ^[email protected]%[email protected] arch=cray-rhel8-zen
            ^[email protected]%[email protected]+glx+llvm+opengl~opengles+osmesa~strip buildtype=release default_library=shared patches=ee737d1 arch=cray-rhel8-zen
                ^[email protected]%[email protected] patches=b72914f arch=cray-rhel8-zen
                ^[email protected]%[email protected]+lex~nls arch=cray-rhel8-zen
                ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected]~block_signals~conservative_checks~cxx_exceptions~debug~debug_frame+docs~pic+tests+weak_backtrace~xz~zlib components=none libs=shared,static arch=cray-rhel8-zen
                ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
                            ^[email protected]%[email protected] arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected]+cpanm+shared+threads arch=cray-rhel8-zen
                        ^[email protected]%[email protected]+cxx~docs+stl patches=26090f4,b231fcc arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected] patches=9c87472,aa6c50d arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
                            ^[email protected]%[email protected] arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
        ^[email protected]%[email protected]+image~ipo+shared build_type=Release patches=71e6851 arch=cray-rhel8-zen
            ^[email protected]%[email protected]~ipo+shared build_type=Release patches=f926273 arch=cray-rhel8-zen
                ^[email protected]%[email protected]~docs arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected] patches=4e1d78c,62fc8a8,ff37630 arch=cray-rhel8-zen
                    ^[email protected]%[email protected] patches=7793209 arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
            ^[email protected]%[email protected] arch=cray-rhel8-zen
        ^[email protected]%[email protected] arch=cray-rhel8-zen
        ^[email protected]%[email protected] arch=cray-rhel8-zen
            ^[email protected]%[email protected] arch=cray-rhel8-zen
        ^[email protected]%[email protected]~ipo build_type=Release arch=cray-rhel8-zen
        ^[email protected]%[email protected] arch=cray-rhel8-zen
    ^[email protected]%[email protected]~cuda+example~infiniband~lmsensors~nvml~powercap~rapl~rocm~rocm_smi~sde+shared~static_tools arch=cray-rhel8-zen
    ^[email protected]%[email protected]~ipo+shared build_type=Release patches=8bc40cc arch=cray-rhel8-zen
    ^[email protected]%[email protected]~ipo build_type=Release patches=16754a1 arch=cray-rhel8-zen
    ^[email protected]%[email protected]~ipo build_type=Release arch=cray-rhel8-zen
        ^[email protected]%[email protected] arch=cray-rhel8-zen
            ^[email protected]%[email protected] arch=cray-rhel8-zen

Rename OMNITRACE_ROCM_SMI_DEVICES option

Currently, if you want to specify the CPUs to sample for their frequency, you specify the OMNITRACE_SAMPLING_CPUS setting. However, if you want to specify the GPUs to sample for their power, temp, memory usage, and utilization, you specify the OMNITRACE_ROCM_SMI_DEVICES setting.

For consistency, renaming this option to OMNITRACE_SAMPLING_GPUS would make sense.

Feature: capture stack traces of API tracing

For API tracing (like HIP and MPI traces) it would be nice to see the call stack.
My use case is tracking down where certain slow or undesirable calls come from i.e. hipMalloc. Commonly these functions are not big enough to get selected by the default heuristics, and also may come from external libraries which might not be instrumented.

omnitrace-python has inconsistent options vs omnitrace exe

$ omnitrace-python --help
  ...
  -a [BOOL], --include-args [BOOL]
                        Encode the argument values
  -l [BOOL], --include-line [BOOL]
                        Encode the function line number
  -f [BOOL], --include-file [BOOL]
                        Encode the function filename
  ...

should have a --label option similar to omnitrace, e.g. --label args line file

Compile error building v1.3.0 on Crusher

Using:

module load rocm
module load gcc
module swap PrgEnv-cray PrgEnv-gnu
module load boost
module load intel-tbb
module load cray-python
cmake -B build-omnitrace -DOMNITRACE_USE_MPI=ON -DOMNITRACE_BUILD_DYNINST=ON -DDYNINST_BUILD_{LIBIBERTY,ELFUTILS}=ON -DCMAKE_INSTALL_PREFIX=${HOME}/sw/omnitrace-devel .
In file included from /ccs/home/nicurtis/omnitrace/source/bin/omnitrace/module_function.cpp:25:
/ccs/home/nicurtis/omnitrace/source/bin/omnitrace/omnitrace.hpp: In function 'bool omnitrace_get_is_executable(std::string_view, bool)':
/ccs/home/nicurtis/omnitrace/source/bin/omnitrace/omnitrace.hpp:168:28: error: 'exists' is not a member of 'tim::filepath'
  168 |         if(!tim::filepath::exists(std::string{ _cmd }))
      |                            ^~~~~~
In file included from /ccs/home/nicurtis/omnitrace/source/bin/omnitrace/details.cpp:25:
/ccs/home/nicurtis/omnitrace/source/bin/omnitrace/omnitrace.hpp: In function 'bool omnitrace_get_is_executable(std::string_view, bool)':
/ccs/home/nicurtis/omnitrace/source/bin/omnitrace/omnitrace.hpp:168:28: error: 'exists' is not a member of 'tim::filepath'
  168 |         if(!tim::filepath::exists(std::string{ _cmd }))
      |                            ^~~~~~
In file included from /ccs/home/nicurtis/omnitrace/source/bin/omnitrace/omnitrace.cpp:23:
/ccs/home/nicurtis/omnitrace/source/bin/omnitrace/omnitrace.hpp: In function 'bool omnitrace_get_is_executable(std::string_view, bool)':
/ccs/home/nicurtis/omnitrace/source/bin/omnitrace/omnitrace.hpp:168:28: error: 'exists' is not a member of 'tim::filepath'
  168 |         if(!tim::filepath::exists(std::string{ _cmd }))

Segfault in dyninst when instrumenting boost

Due to #144, I noticed a segfault in dyninst when instrumenting boost inside in runtime instrumentation mode. This happens inside of the finalization of the dyninst instrumentation:

[omnitrace][exe]  769 instrumented funcs in picongpu
[omnitrace][exe]
[omnitrace][exe] Finalizing insertion set...
[TheraC18:73504:0:73504] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
Segmentation fault (core dumped)

I am using my own boost, rather than building w/ Omni, as PIConGPU needs it as well (boost/1.75.0 built against gcc/8.3.0). This was also reported on Crusher though, so I doubt it's version specific

To repro:

export BASE_FOLDER=$(pwd)
export PICSRC=${BASE_FOLDER}/picongpu
export PIC_EXAMPLES=$PICSRC/share/picongpu/examples
export PIC_BACKEND="hip:gfx90a"

export PATH=$PATH:$PICSRC
export PATH=$PATH:$PICSRC/bin
export PATH=$PATH:$PICSRC/src/tools/bin


export PATH=$PATH:$PICSRC
export PATH=$PATH:$PICSRC/bin
export PATH=$PATH:$PICSRC/src/tools/bin

export CXX=hipcc
pic-create ${PICSRC}/share/picongpu/benchmarks/TWEAC-FOM/ fom
cd fom
pic-build -t 2

# run PIConGPU in aninteractive shell on one GPU for 100 steps and use GPU aware MPI (--mpiDirect)
omnitrace -v 3 -- ./bin/picongpu --mpiDirect -d 1 1 1 -g 240 272 224 --periodic 1 1 1 -s 100 -r 2

Missing path in modulefile?

I'm testing out omnitrace-1.4.0-opensuse-15.3-ROCm-50200-PAPI-OMPT-Python3 on ORNL's crusher cluster. After instrumenting my code, I discovered that there were some undefined symbols related to TBB:

        libtbb.so.2 => not found
        libtbbmalloc_proxy.so.2 => not found
        libtbbmalloc.so.2 => not found

I was able to fix this by adding a line to the omnitrace 1.4.0: module file:

prepend-path LD_LIBRARY_PATH "${ROOT}/lib/omnitrace"

Hang on collecting GRBM_GUI_ACTIVE in LAMMPS

Using:

OMNITRACE_CONFIG_FILE                              = 
OMNITRACE_USE_PERFETTO                             = true
OMNITRACE_USE_TIMEMORY                             = false
OMNITRACE_USE_SAMPLING                             = false
OMNITRACE_USE_PROCESS_SAMPLING                     = false
OMNITRACE_USE_ROCTRACER                            = true
OMNITRACE_USE_ROCM_SMI                             = true
OMNITRACE_USE_KOKKOSP                              = false
OMNITRACE_USE_PID                                  = true
OMNITRACE_USE_RCCLP                                = false
OMNITRACE_USE_ROCPROFILER                          = true
OMNITRACE_USE_ROCTX                                = false
OMNITRACE_OUTPUT_PATH                              = omnitrace-%tag%-output
OMNITRACE_OUTPUT_PREFIX                            = 
OMNITRACE_CRITICAL_TRACE                           = false
OMNITRACE_PAPI_EVENTS                              = PAPI_TOT_CYC
OMNITRACE_PERFETTO_BACKEND                         = inprocess
OMNITRACE_PERFETTO_BUFFER_SIZE_KB                  = 1024000
OMNITRACE_PERFETTO_FILL_POLICY                     = discard
OMNITRACE_PROCESS_SAMPLING_DURATION                = -1
OMNITRACE_PROCESS_SAMPLING_FREQ                    = 0
OMNITRACE_ROCM_EVENTS                              = GRBM_GUI_ACTIVE
OMNITRACE_SAMPLING_CPUS                            = all
OMNITRACE_SAMPLING_DELAY                           = 0.5
OMNITRACE_SAMPLING_DURATION                        = 0
OMNITRACE_SAMPLING_FREQ                            = 200
OMNITRACE_SAMPLING_GPUS                            = 0,1
OMNITRACE_TIME_OUTPUT                              = true
OMNITRACE_TIMEMORY_COMPONENTS                      = wall_clock
OMNITRACE_VERBOSE                                  = 0
OMNITRACE_ENABLED                                  = true
OMNITRACE_SUPPRESS_CONFIG                          = false
OMNITRACE_SUPPRESS_PARSING                         = false

hangs on the first kernel call:

$ AMD_LOG_LEVEL=3 /home/nicurtis/lammps_benchmarking/install/tpl/openmpi/bin/mpirun --mca pml ucx --mca btl ^vader,tcp,openib,uct -np 1 ./lmp -k on g 1 -sf kk -pk kokkos cuda/aware on neigh half neigh/qeq full newton on -v x 6 -v y 6 -v z 8 -v steps 25 -in in.reaxc.hns -nocite -log TheraC63/reaxff//log.lammps
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace


      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|

    
[066.998]       perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""

[omnitrace][pid=30219] MPI rank: 0 (0), MPI size: 1 (1)
LAMMPS (23 Jun 2022 - Update 1)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:105)
  will use up to 1 GPU(s) per node
:3:rocdevice.cpp            :416 : 81067696131 us: 30219: [tid:0x7f68d9031280] Initializing HSA stack.
:3:comgrctx.cpp             :33  : 81067696207 us: 30219: [tid:0x7f68d9031280] Loading COMGR library.
:3:rocdevice.cpp            :207 : 81067696378 us: 30219: [tid:0x7f68d9031280] Numa selects cpu agent[2]=0x5b88910(fine=0x5b88b60,coarse=0x5b97640) for gpu agent=0x5df3880
:3:rocdevice.cpp            :1611: 81067696802 us: 30219: [tid:0x7f68d9031280] HMM support: 1, xnack: 0, direct host access: 0

:3:rocdevice.cpp            :207 : 81067697588 us: 30219: [tid:0x7f68d9031280] Numa selects cpu agent[2]=0x5b88910(fine=0x5b88b60,coarse=0x5b97640) for gpu agent=0x5e30cb0
:3:rocdevice.cpp            :1611: 81067697802 us: 30219: [tid:0x7f68d9031280] HMM support: 1, xnack: 0, direct host access: 0

:3:rocdevice.cpp            :207 : 81067698438 us: 30219: [tid:0x7f68d9031280] Numa selects cpu agent[2]=0x5b88910(fine=0x5b88b60,coarse=0x5b97640) for gpu agent=0x5e6e3d0
:3:rocdevice.cpp            :1611: 81067698628 us: 30219: [tid:0x7f68d9031280] HMM support: 1, xnack: 0, direct host access: 0

:3:rocdevice.cpp            :207 : 81067699255 us: 30219: [tid:0x7f68d9031280] Numa selects cpu agent[2]=0x5b88910(fine=0x5b88b60,coarse=0x5b97640) for gpu agent=0x5eabad0
:3:rocdevice.cpp            :1611: 81067699441 us: 30219: [tid:0x7f68d9031280] HMM support: 1, xnack: 0, direct host access: 0

:3:rocdevice.cpp            :207 : 81067700248 us: 30219: [tid:0x7f68d9031280] Numa selects cpu agent[6]=0x5b9bfc0(fine=0x5b9c1e0,coarse=0x5b9c960) for gpu agent=0x5ee91e0
:3:rocdevice.cpp            :1611: 81067700432 us: 30219: [tid:0x7f68d9031280] HMM support: 1, xnack: 0, direct host access: 0

:3:rocdevice.cpp            :207 : 81067701884 us: 30219: [tid:0x7f68d9031280] Numa selects cpu agent[6]=0x5b9bfc0(fine=0x5b9c1e0,coarse=0x5b9c960) for gpu agent=0x5f26930
:3:rocdevice.cpp            :1611: 81067702074 us: 30219: [tid:0x7f68d9031280] HMM support: 1, xnack: 0, direct host access: 0

:3:rocdevice.cpp            :207 : 81067703320 us: 30219: [tid:0x7f68d9031280] Numa selects cpu agent[6]=0x5b9bfc0(fine=0x5b9c1e0,coarse=0x5b9c960) for gpu agent=0x5f64010
:3:rocdevice.cpp            :1611: 81067703500 us: 30219: [tid:0x7f68d9031280] HMM support: 1, xnack: 0, direct host access: 0

:3:rocdevice.cpp            :207 : 81067704752 us: 30219: [tid:0x7f68d9031280] Numa selects cpu agent[6]=0x5b9bfc0(fine=0x5b9c1e0,coarse=0x5b9c960) for gpu agent=0x5fa1710
:3:rocdevice.cpp            :1611: 81067704929 us: 30219: [tid:0x7f68d9031280] HMM support: 1, xnack: 0, direct host access: 0

:3:hip_context.cpp          :50  : 81067706380 us: 30219: [tid:0x7f68d9031280] Direct Dispatch: 1
:3:hip_device_runtime.cpp   :517 : 81067708010 us: 30219: [tid:0x7f68d9031280] hipGetDeviceCount: Returned hipSuccess : 
:3:hip_device.cpp           :346 : 81067708019 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties ( 0x3e5c2e0, 0 )
:3:hip_device.cpp           :348 : 81067708219 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties: Returned hipSuccess : 
:3:hip_device.cpp           :346 : 81067708237 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties ( 0x3e5c5f8, 1 )
:3:hip_device.cpp           :348 : 81067708254 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties: Returned hipSuccess : 
:3:hip_device.cpp           :346 : 81067708258 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties ( 0x3e5c910, 2 )
:3:hip_device.cpp           :348 : 81067708286 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties: Returned hipSuccess : 
:3:hip_device.cpp           :346 : 81067708298 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties ( 0x3e5cc28, 3 )
:3:hip_device.cpp           :348 : 81067708312 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties: Returned hipSuccess : 
:3:hip_device.cpp           :346 : 81067708316 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties ( 0x3e5cf40, 4 )
:3:hip_device.cpp           :348 : 81067708329 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties: Returned hipSuccess : 
:3:hip_device.cpp           :346 : 81067708333 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties ( 0x3e5d258, 5 )
:3:hip_device.cpp           :348 : 81067708356 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties: Returned hipSuccess : 
:3:hip_device.cpp           :346 : 81067708367 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties ( 0x3e5d570, 6 )
:3:hip_device.cpp           :348 : 81067708380 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties: Returned hipSuccess : 
:3:hip_device.cpp           :346 : 81067708385 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties ( 0x3e5d888, 7 )
:3:hip_device.cpp           :348 : 81067708395 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :530 : 81067708403 us: 30219: [tid:0x7f68d9031280] hipSetDevice ( 0 )
:3:hip_device_runtime.cpp   :535 : 81067708424 us: 30219: [tid:0x7f68d9031280] hipSetDevice: Returned hipSuccess : 
:3:hip_memory.cpp           :493 : 81067708445 us: 30219: [tid:0x7f68d9031280] hipMalloc ( 0x7fff288c3f20, 8448 )
:3:rocdevice.cpp            :2093: 81067708474 us: 30219: [tid:0x7f68d9031280] device=0x653dda0, freeMem_ = 0xfeffdf00
:3:hip_memory.cpp           :495 : 81067708478 us: 30219: [tid:0x7f68d9031280] hipMalloc: Returned hipSuccess : 0x7f6051b00000: duration: 33 us
:3:hip_memory.cpp           :1225: 81067708487 us: 30219: [tid:0x7f68d9031280] hipMemcpyAsync ( 0x7f6051b00000, 0x7fff288c40c0, 256, hipMemcpyDefault, stream:<null> )
:3:rocdevice.cpp            :2686: 81067708503 us: 30219: [tid:0x7f68d9031280] number of allocated hardware queues with low priority: 0, with normal priority: 0, with high priority: 0, maximum per priority is: 4
:3:rocdevice.cpp            :2757: 81067721343 us: 30219: [tid:0x7f68d9031280] created hardware queue 0x7f68680ca000 with size 4096 with priority 1, cooperative: 0
:3:devprogram.cpp           :2675: 81067924077 us: 30219: [tid:0x7f68d9031280] Using Code Object V4.
:3:devprogram.cpp           :2978: 81067925217 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_fillImage
:3:devprogram.cpp           :2978: 81067925223 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_fillBufferAligned2D
:3:devprogram.cpp           :2978: 81067925225 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_fillBufferAligned
:3:devprogram.cpp           :2978: 81067925227 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_copyImage1DA
:3:devprogram.cpp           :2978: 81067925228 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_copyBufferAligned
:3:devprogram.cpp           :2978: 81067925229 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_streamOpsWait
:3:devprogram.cpp           :2978: 81067925230 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_copyBuffer
:3:devprogram.cpp           :2978: 81067925232 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_streamOpsWrite
:3:devprogram.cpp           :2978: 81067925233 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_copyBufferRectAligned
:3:devprogram.cpp           :2978: 81067925234 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_gwsInit
:3:devprogram.cpp           :2978: 81067925236 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_copyBufferRect
:3:devprogram.cpp           :2978: 81067925237 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_copyImageToBuffer
:3:devprogram.cpp           :2978: 81067925238 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_copyBufferToImage
:3:devprogram.cpp           :2978: 81067925239 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_copyImage
:3:rocvirtual.hpp           :62  : 81067925542 us: 30219: [tid:0x7f68d9031280] Host active wait for Signal = (0x7f686811d180) for 100000 ns
:3:rocvirtual.cpp           :143 : 81067925558 us: 30219: [tid:0x7f68d9031280] Signal = (0x7f686811d180), start = 81067925545769, end = 81067925547369
:3:hip_memory.cpp           :1226: 81067925567 us: 30219: [tid:0x7f68d9031280] hipMemcpyAsync: Returned hipSuccess : : duration: 217080 us
:3:hip_stream.cpp           :450 : 81067925582 us: 30219: [tid:0x7f68d9031280] hipStreamSynchronize ( stream:<null> )
:3:rocdevice.cpp            :2636: 81067925599 us: 30219: [tid:0x7f68d9031280] No HW event
:3:hip_stream.cpp           :451 : 81067925601 us: 30219: [tid:0x7f68d9031280] hipStreamSynchronize: Returned hipSuccess : 
:3:hip_memory.cpp           :2461: 81067925613 us: 30219: [tid:0x7f68d9031280] hipMemset ( 0x7f6051b00100, 0, 8192 )
:3:rocvirtual.cpp           :679 : 81067925626 us: 30219: [tid:0x7f68d9031280] Arg3: ulong* bufULong = ptr:0x7f6051b00000 obj:[0x7f6051b00000-0x7f6051b02100]
:3:rocvirtual.cpp           :679 : 81067925628 us: 30219: [tid:0x7f68d9031280] Arg4: uchar* pattern = ptr:0x7f686807c080 obj:[0x7f686807c000-0x7f686807d000]
:3:rocvirtual.cpp           :753 : 81067925630 us: 30219: [tid:0x7f68d9031280] Arg5: uint patternSize = val:1
:3:rocvirtual.cpp           :753 : 81067925631 us: 30219: [tid:0x7f68d9031280] Arg6: ulong offset = val:32
:3:rocvirtual.cpp           :753 : 81067925633 us: 30219: [tid:0x7f68d9031280] Arg7: ulong size = val:1024
:3:rocvirtual.cpp           :2723: 81067925634 us: 30219: [tid:0x7f68d9031280] ShaderName : __amd_rocclr_fillBufferAligned
:3:rocvirtual.hpp           :62  : 81067935725 us: 30219: [tid:0x7f68d9031280] Host active wait for Signal = (0x7f686811d080) for -1 ns
# hangs here forever

module file: no environment variable with absolute install path

As a workaround to have the HIP kernel in the trace file I need to set the environment variable

export HSA_TOOLS_LIB=<path to onitrace install>/omnitrace-1.6.0/rocm-5.1.0/lib/libomnitrace.so

The problem is that the module file is not providing a environment variable e.g. omnitrace_ROOT or omnitrace_HOME which I can use to set HSA_TOOLS_LIB.
Currently, I need to use$omnitrace_DIR/../../.. which is not very elegant.

Tested with: omnitrace 1.6.0

Generated omnitrace config file disables reading config and parsing environment

  • When a config file is generated via omnitrace-avail -G, the settings OMNITRACE_SUPPRESS_CONFIG and OMNITRACE_SUPPRESS_PARSING, which suppress reading a config file and suppress parsing the environment, respectively, are always set to true. This is because when omnitrace is initialized, it sets these values to false to ensure that no config files or env values are read after initialization.

Omnitrace v1.7.2 with ROCm 5.3.0 rocprofiler_iterate_info issue

Hi, I've been running into an issue getting omnitrace up and running with ROCm 5.3.0. When running the omnitrace-avail command I get:

$ omnitrace-avail -G omnitrace.cfg --all
[omnitrace][0][0][fatal] 
[omnitrace][0][0][fatal] ERROR :: rocprofiler_iterate_info(), MetricsDict(), metrics .xml open error '/opt/rocm-5.3.0/rocprofiler/lib/metrics.xml'


### ERROR ### [omnitrace][PID=3425133][TID=0] signal=6 (SIGABRT) abort program (formerly SIGIOT). code: -6
Backtrace:
[PID=3425133][TID=0][0/9] __restore_rt
[PID=3425133][TID=0][1/9] gsignal +0x10f
[PID=3425133][TID=0][2/9] abort +0x127
[PID=3425133][TID=0][3/9] kokkosp_dual_view_sync.cold.4330 +0x16f93
[PID=3425133][TID=0][4/9] OnLoad +0x3b820
[PID=3425133][TID=0][5/9] OnLoad +0x3f5f7
[PID=3425133][TID=0][6/9] kokkosp_dual_view_sync.cold.4330 +0x49ee37
[PID=3425133][TID=0][7/9] __libc_start_main +0xf3
[PID=3425133][TID=0][8/9] kokkosp_dual_view_sync.cold.4330 +0x4cae26

Backtrace (demangled):
[PID=3425133][TID=0][0/9] /lib64/libpthread.so.0(+0x12ce0) [0x7f49167afce0]
[PID=3425133][TID=0][1/9] /lib64/libc.so.6(gsignal+0x10f) [0x7f49120e5a9f]
[PID=3425133][TID=0][2/9] /lib64/libc.so.6(abort+0x127) [0x7f49120b8e05]
[PID=3425133][TID=0][3/9] omnitrace-avail() [0x46047b]
[PID=3425133][TID=0][4/9] omnitrace-avail() [0x1044540]
[PID=3425133][TID=0][5/9] omnitrace-avail() [0x1048317]
[PID=3425133][TID=0][6/9] omnitrace-avail() [0x8e831f]
[PID=3425133][TID=0][7/9] /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f49120d1cf3]
[PID=3425133][TID=0][8/9] omnitrace-avail() [0x91430e]

Backtrace (demangled):
[PID=3425133][TID=0][0/9] __restore_rt
[PID=3425133][TID=0][1/9] gsignal +0x10f
[PID=3425133][TID=0][2/9] abort +0x127
[PID=3425133][TID=0][3/9] kokkosp_dual_view_sync.cold.4330 +0x16f93
[PID=3425133][TID=0][4/9] OnLoad +0x3b820
[PID=3425133][TID=0][5/9] OnLoad +0x3f5f7
[PID=3425133][TID=0][6/9] kokkosp_dual_view_sync.cold.4330 +0x49ee37
[PID=3425133][TID=0][7/9] __libc_start_main +0xf3
[PID=3425133][TID=0][8/9] kokkosp_dual_view_sync.cold.4330 +0x4cae26

Backtrace (lineinfo):
[omnitrace] realpath failed for 'omnitrace-avail' :: No such file or directory
[omnitrace] realpath failed for 'omnitrace-avail' :: No such file or directory
[PID=3425133][TID=0][0/6]
    [??:?] __GI_abort
[PID=3425133][TID=0][1/6]
    [/home/semiller/software/omnitrace/rocm-5.3.0/install/omnitrace/bin/omnitrace-avail:?] kokkosp_dual_view_sync.cold.4330
[PID=3425133][TID=0][2/6]
    [/home/semiller/omnitrace-avail:?] OnLoad
[PID=3425133][TID=0][3/6]
    [/home/semiller/omnitrace-avail:?] OnLoad
[PID=3425133][TID=0][4/6]
    [/home/semiller/software/omnitrace/rocm-5.3.0/install/omnitrace/bin/omnitrace-avail:?] kokkosp_dual_view_sync.cold.4330
[PID=3425133][TID=0][5/6]
    [/usr/lib64/libc-2.28.so:?] __libc_start_main

[omnitrace] Finalizing afer signal 6 ::  Signal:    SIGABRT (signal number:   6)          abort program (formerly SIGIOT)
omnitrace :: : Aborted (Signal sent by tkill() 3425133 10042)
Aborted (core dumped)

The same install process built on ROCm 5.2.3 gives:

$ omnitrace-avail -G omnitrace.cfg --all
[omnitrace-avail] Outputting text configuration file './omnitrace.cfg'...

Is there a workaround available for ROCm 5.3.0?

omnitrace-avail --advanced option for settings

  • Several configuration options do not need to be displayed in most scenarios
    • These options make the more important options less visible / noticeable
    • Examples include:
      • OMNITRACE_PERFETTO_SHMEM_SIZE_HINT_KB
      • OMNITRACE_CRITICAL_TRACE_BUFFER_COUNT (most of the critical-trace options, honestly)
      • etc.
  • Propose adding the "advanced" category to several options and only displaying/dumping these command-line options in omnitrace-avail if the --advanced flag is provided

Runtime instrumentation's defaults do not seem to match binary re-write's

Discovered when looking at PIConGPU, doing a:

omnitrace -v 3 -- ./bin/picongpu

will pull in modules from libc, boost, OMPI, UCX, HIP, HSA, etc., etc.
Something like 46k functions over 270 modules.

Whereas doing a binary rewrite seems to default to only symbols defined in the main binary (in this case, ~4k functions in 1 module)

omnitrace -v 3 -o picongpu -- ./bin/picongpu

Given the sometimes fragility of dyninst, I think it would be a safer choice for both modes to default to the binary-rewrite's current behaviour, and allow the user to expand the instrumentation as desired afterwards.

Specifically for PIConGPU, doing runtime instrumentation pulls in symbols from boost, which causes dyninst to segfault.

Segfault instrumenting Cray MPI w/ v1.2.0

To reproduce on Crusher:

source sw/omnitrace-devel/share/omnitrace/setup-env.sh
module load craype-accel-amd-gfx90a
module load PrgEnv-cray
module load rocm
omnitrace -o $(basename /opt/cray/pe/lib64/libmpi_cray.so.12) -v 3 -- /opt/cray/pe/lib64/libmpi_cray.so.12
...
<output in attached log>

Looking at the core file shows:

(gdb) bt
#0  0x00007fffed4ef26f in Dyninst::Relocation::Instrumenter::handleCondDirExits(Dyninst::Relocation::RelocBlock*, Dyninst::Relocation::RelocGraph*, instPoint*) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#1  0x00007fffed4f0015 in Dyninst::Relocation::Instrumenter::funcExitInstrumentation(Dyninst::Relocation::RelocBlock*, Dyninst::Relocation::RelocGraph*) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#2  0x00007fffed4f020b in Dyninst::Relocation::Instrumenter::process(Dyninst::Relocation::RelocBlock*, Dyninst::Relocation::RelocGraph*) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#3  0x00007fffed4ed280 in Dyninst::Relocation::Transformer::processGraph(Dyninst::Relocation::RelocGraph*) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#4  0x00007fffed4d8c32 in Dyninst::Relocation::CodeMover::transform(Dyninst::Relocation::Transformer&) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#5  0x00007fffed45cb4b in AddressSpace::transform(boost::shared_ptr<Dyninst::Relocation::CodeMover>) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#6  0x00007fffed45dcf3 in AddressSpace::relocateInt(std::_Rb_tree_const_iterator<func_instance*>, std::_Rb_tree_const_iterator<func_instance*>, unsigned long) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#7  0x00007fffed461fce in AddressSpace::relocate() () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#8  0x00007fffed506e1a in Dyninst::PatchAPI::DynInstrumenter::run() () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#9  0x00007fffed14f831 in Dyninst::PatchAPI::Patcher::run() () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libpatchAPI.so.11.0
#10 0x00007fffed14f010 in Dyninst::PatchAPI::Command::commit() () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libpatchAPI.so.11.0
#11 0x00007fffed45e97c in AddressSpace::patch(AddressSpace*) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#12 0x00007fffed429c7e in BPatch_binaryEdit::writeFile(char const*) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#13 0x000000000042670f in ?? ()
#14 0x00007fffe86ec2bd in __libc_start_main () from /lib64/libc.so.6
#15 0x00000000004299ea in ?? ()

Question: how to avoid hang during library instrumentation

When I attempt to instrument a particular library in the Trilinos project, the process doesn't finish, even running overnight.

This is with omnitrace release 1.7 on crusher. The library in question is libteuchosnumerics.so.13 and the command is

omnitrace -v -1 --print-instrumented functions -o /ccs/home/jjhu/crusher/libs-instrumented/libteuchosnumerics.so.13

The documentation presents a few options -- are there any that you'd recommend that I try?

I'm using Trilinos develop a76c1c4a9, and my module environment is

Currently Loaded Modules:
  1) libfabric/1.15.0.0                      9) cray-dsmml/0.2.2         17) rocm/5.2.0                        25) metis/5.1.0
  2) craype-network-ofi                     10) cray-mpich/8.1.16        18) cmake/3.22.1                      26) yaml-cpp/0.7.0
  3) perftools-base/22.05.0                 11) cray-libsci/21.08.1.2    19) ninja/1.10.2                      27) zlib/1.2.11
  4) xpmem/2.4.4-2.3_2.12__gff0e1d9.shasta  12) PrgEnv-cray/8.3.3        20) cray-hdf5-parallel/1.12.1.1       28) superlu/5.3.0
  5) cray-pmi/6.1.2                         13) xalt/1.3.0               21) cray-netcdf-hdf5parallel/4.8.1.1  29) omnitrace/1.7.0
  6) cce/14.0.0                             14) DefApps/default          22) parallel-netcdf/1.12.2
  7) tmux/3.2a                              15) craype-accel-amd-gfx90a  23) boost/1.78.0
  8) craype/2.7.15                          16) craype-x86-trento        24) parmetis/4.0.3

Can't instrument libfabric on Crusher

$ omnitrace -v 3 -r 64 -i 1024 --min-address-range-loop 64 -o $(basename /opt/cray/libfabric/1.15.0.0/lib64/libfabric.so.1) -- /opt/cray/libfabric/1.15.0.0/lib64/libfabric.so.1
[omnitrace][exe] 
[omnitrace][exe] command :: '/opt/cray/libfabric/1.15.0.0/lib64/libfabric.so.1.17.0'...
[omnitrace][exe] 
[omnitrace][exe] Option '--min-address-range-loop' specified but '--min-instructions-loop <N>' was not specified. Setting minimum instructions for loops to 0...
[omnitrace][exe] Option '--min-instructions' specified but '--min-instructions-loop <N>' was not specified. Setting minimum instructions for loops to 1024...
[omnitrace][exe] Resolved 'libomnitrace-rt.so' to '/autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/lib/libomnitrace-rt.so.11.0.1'...
[omnitrace][exe] DYNINST_API_RT: /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/lib/libomnitrace-rt.so.11.0.1
[omnitrace][exe] [dyninst-option]> TypeChecking         =   on
[omnitrace][exe] [dyninst-option]> SaveFPR              =   on
[omnitrace][exe] [dyninst-option]> DelayedParsing       =   on
[omnitrace][exe] [dyninst-option]> DebugParsing         =  off
[omnitrace][exe] [dyninst-option]> InstrStackFrames     =  off
[omnitrace][exe] [dyninst-option]> TrampRecursive       =  off
[omnitrace][exe] [dyninst-option]> MergeTramp           =   on
[omnitrace][exe] [dyninst-option]> BaseTrampDeletion    =  off
[omnitrace][exe] instrumentation target: /opt/cray/libfabric/1.15.0.0/lib64/libfabric.so.1.17.0
[omnitrace][exe] Opening '/opt/cray/libfabric/1.15.0.0/lib64/libfabric.so.1.17.0' for binary rewrite... Done
[omnitrace][exe] Getting the address space image, modules, and procedures...
[omnitrace][exe] Module size before loading instrumentation library: 125
### MODULES ###
|             ../../../libgcc/libgcc2.c |              ../sysdeps/x86_64/crti.S |                   libfabric.so.1.17.0 |            prov/cxi/src/cxip_atomic.c | 
|                prov/cxi/src/cxip_av.c |             prov/cxi/src/cxip_avset.c |              prov/cxi/src/cxip_cntr.c |              prov/cxi/src/cxip_coll.c | 
|                prov/cxi/src/cxip_cq.c |              prov/cxi/src/cxip_ctrl.c |              prov/cxi/src/cxip_curl.c |               prov/cxi/src/cxip_dom.c | 
|                prov/cxi/src/cxip_ep.c |                prov/cxi/src/cxip_eq.c |            prov/cxi/src/cxip_fabric.c |            prov/cxi/src/cxip_faults.c | 
|                prov/cxi/src/cxip_if.c |              prov/cxi/src/cxip_info.c |              prov/cxi/src/cxip_iomm.c |                prov/cxi/src/cxip_mr.c | 
|               prov/cxi/src/cxip_msg.c |       prov/cxi/src/cxip_ptelist_buf.c |          prov/cxi/src/cxip_rdzv_pte.c |            prov/cxi/src/cxip_repsum.c | 
|           prov/cxi/src/cxip_req_buf.c |               prov/cxi/src/cxip_rma.c |               prov/cxi/src/cxip_rxc.c |         prov/cxi/src/cxip_telemetry.c | 
|               prov/cxi/src/cxip_txc.c |            prov/cxi/src/cxip_zbcoll.c | prov/hook/ho...debug/src/hook_debug.c |        prov/hook/perf/src/hook_perf.c | 
|                  prov/hook/src/hook.c |               prov/hook/src/hook_av.c |               prov/hook/src/hook_cm.c |             prov/hook/src/hook_cntr.c | 
|               prov/hook/src/hook_cq.c |           prov/hook/src/hook_domain.c |               prov/hook/src/hook_ep.c |               prov/hook/src/hook_eq.c | 
|             prov/hook/src/hook_wait.c |             prov/rxd/src/rxd_atomic.c |                 prov/rxd/src/rxd_av.c |               prov/rxd/src/rxd_cntr.c | 
|                 prov/rxd/src/rxd_cq.c |             prov/rxd/src/rxd_domain.c |                 prov/rxd/src/rxd_ep.c |             prov/rxd/src/rxd_fabric.c | 
|               prov/rxd/src/rxd_init.c |                prov/rxd/src/rxd_msg.c |                prov/rxd/src/rxd_rma.c |             prov/rxd/src/rxd_tagged.c | 
|             prov/rxm/src/rxm_atomic.c |                 prov/rxm/src/rxm_av.c |               prov/rxm/src/rxm_conn.c |                 prov/rxm/src/rxm_cq.c | 
|             prov/rxm/src/rxm_domain.c |                 prov/rxm/src/rxm_ep.c |             prov/rxm/src/rxm_fabric.c |               prov/rxm/src/rxm_init.c | 
|                prov/rxm/src/rxm_rma.c |              prov/tcp/src/tcpx_attr.c |          prov/tcp/src/tcpx_conn_mgr.c |                prov/tcp/src/tcpx_cq.c | 
|            prov/tcp/src/tcpx_domain.c |                prov/tcp/src/tcpx_ep.c |                prov/tcp/src/tcpx_eq.c |            prov/tcp/src/tcpx_fabric.c | 
|              prov/tcp/src/tcpx_init.c |               prov/tcp/src/tcpx_msg.c |          prov/tcp/src/tcpx_progress.c |               prov/tcp/src/tcpx_rma.c | 
|        prov/tcp/src/tcpx_shared_ctx.c |                prov/udp/src/udpx_cq.c |            prov/udp/src/udpx_domain.c |                prov/udp/src/udpx_ep.c | 
|            prov/udp/src/udpx_fabric.c |              prov/udp/src/udpx_init.c |      prov/util/src/cuda_mem_monitor.c |      prov/util/src/rocr_mem_monitor.c | 
|           prov/util/src/util_atomic.c |             prov/util/src/util_attr.c |               prov/util/src/util_av.c |              prov/util/src/util_buf.c | 
|             prov/util/src/util_cntr.c |             prov/util/src/util_coll.c |               prov/util/src/util_cq.c |           prov/util/src/util_domain.c | 
|               prov/util/src/util_ep.c |               prov/util/src/util_eq.c |           prov/util/src/util_fabric.c |             prov/util/src/util_main.c | 
|        prov/util/src/util_mem_hooks.c |      prov/util/src/util_mem_monitor.c |         prov/util/src/util_mr_cache.c |           prov/util/src/util_mr_map.c | 
|               prov/util/src/util_ns.c |              prov/util/src/util_pep.c |             prov/util/src/util_poll.c |              prov/util/src/util_shm.c | 
|             prov/util/src/util_wait.c |        prov/util/src/ze_mem_monitor.c |                         src/abi_1_0.c |                          src/common.c | 
|                          src/enosys.c |                          src/fabric.c |                        src/fasthash.c |                        src/fi_tostr.c | 
|                            src/hmem.c |                       src/hmem_cuda.c |               src/hmem_cuda_gdrcopy.c |                       src/hmem_rocr.c | 
|                         src/hmem_ze.c |                         src/indexer.c |                             src/iov.c |                     src/linux/rdpmc.c | 
|                             src/log.c |                             src/mem.c |                            src/perf.c |                          src/rbtree.c | 
|                  src/shared/ofi_str.c |                            src/tree.c |                        src/unix/osd.c |                             src/var.c | 
| 

[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/available-instr.json'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/available-instr.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/overlapping-instr.json'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/overlapping-instr.txt'... Done
[omnitrace][exe] function: '_init' ... found
[omnitrace][exe] function: '_fini' ... found
[omnitrace][exe] function: 'main' ... not found
[omnitrace][exe] function: 'omnitrace_user_start_trace' ... not found
[omnitrace][exe] function: 'omnitrace_user_stop_trace' ... not found
[omnitrace][exe] function: 'MPI_Init' ... not found
[omnitrace][exe] function: 'MPI_Init_thread' ... not found
[omnitrace][exe] function: 'MPI_Finalize' ... not found
[omnitrace][exe] function: 'MPI_Comm_rank' ... not found
[omnitrace][exe] function: 'MPI_Comm_size' ... not found
[omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/lib/libomnitrace-dl.so.1.2.0'...
[omnitrace][exe] loading library: '/autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/lib/libomnitrace-dl.so.1.2.0'...
[omnitrace][exe] loadLibrary(/autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/lib/libomnitrace-dl.so.1.2.0) result = success
[omnitrace][exe] Finding instrumentation functions...
[omnitrace][exe] function: 'omnitrace_init' ... found
[omnitrace][exe] function: 'omnitrace_finalize' ... found
[omnitrace][exe] function: 'omnitrace_set_env' ... found
[omnitrace][exe] function: 'omnitrace_set_mpi' ... found
[omnitrace][exe] function: 'omnitrace_push_trace' ... found
[omnitrace][exe] function: 'omnitrace_pop_trace' ... found
[omnitrace][exe] function: 'omnitrace_register_source' ... found
[omnitrace][exe] function: 'omnitrace_register_coverage' ... found
[omnitrace][exe] function: '_main' ... not found
[omnitrace][exe] using '_init' and '_fini' in lieu of 'main'...
[omnitrace][exe] Finding init entry... [omnitrace][exe] Done
[omnitrace][exe] Finding fini exit... [omnitrace][exe] Done
[omnitrace][exe] Beginning insertion set...
[omnitrace][exe] Getting call expressions... [omnitrace][exe] Done
[omnitrace][exe] Getting call snippets... [omnitrace][exe] Done
[omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/lib/libomnitrace-dl.so.1.2.0'...
[omnitrace][exe] Adding main entry snippets...
[omnitrace][exe] Adding main exit snippets...
[omnitrace][exe] Beginning instrumentation loop...
[omnitrace][exe] 
[omnitrace][exe] [function][Instrumenting] no-constraint :: 'cxip_amo_common'...
[omnitrace][exe] [function][Instrumenting] no-constraint :: 'cxip_amo_emit_idc'...
[omnitrace][exe] [function][Instrumenting] no-constraint :: 'fi_cxi_ini'...
[omnitrace][exe] [function][Instrumenting] no-constraint :: 'cxip_rma_common'...
[omnitrace][exe] [function][Instrumenting] no-constraint :: 'rxm_handle_comp'...
[omnitrace][exe]    2 instrumented funcs in prov/cxi/src/cxip_atomic.c
[omnitrace][exe]    1 instrumented funcs in prov/cxi/src/cxip_info.c
[omnitrace][exe]    1 instrumented funcs in prov/cxi/src/cxip_rma.c
[omnitrace][exe]    1 instrumented funcs in prov/rxm/src/rxm_cq.c
[omnitrace][exe] 
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/available-instr.json'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/available-instr.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/instrumented-instr.json'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/instrumented-instr.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/excluded-instr.json'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/excluded-instr.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/overlapping-instr.json'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/overlapping-instr.txt'... Done
[omnitrace][exe] 
[omnitrace][exe] The instrumented executable image is stored in '/autofs/nccs-svm1_home1/nicurtis/allreduce_issue-master/libfabric.so.1'
[omnitrace][exe] End of omnitrace
[omnitrace][exe] Exit code: 0
(gdb) s
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace


      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|

    
[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 2. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[New Thread 0x7fffb808c700 (LWP 106872)]
[782.641]       perfetto.cc:55903 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""

[New Thread 0x7fff617fd700 (LWP 106876)]
0x00007fffe8a42179 in _dl_catch_exception () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fffe8a42179 in _dl_catch_exception () from /lib64/libc.so.6
#1  0x00007fffe8a4221f in _dl_catch_error () from /lib64/libc.so.6
#2  0x00007fffe7240ba5 in _dlerror_run () from /opt/rocm-5.1.0/lib/../../../lib64/libdl.so.2
#3  0x00007fffe72405bf in dlsym () from /opt/rocm-5.1.0/lib/../../../lib64/libdl.so.2
#4  0x00007fffd479a3d8 in dlsym_wrapper () from /ccs/home/nicurtis/sw/omnitrace-devel/lib/omnitrace/libgotcha.so.2
#5  0x00007fffdf28722d in cuda_hmem_init () at src/common.c:106
#6  0x00007fffdf285d9f in cuda_copy_to_dev (device=140737488316752, dst=0x1, src=0x7fffe8a42179 <_dl_catch_exception+171>, size=5601056) at src/hmem_cuda.c:143
#7  0x00007fffdf27e501 in fi_dupinfo_ (info=0x7fffffff6c80) at src/fabric.c:1154
#8  0x00007fffdf27eac7 in fi_open_ (version=<optimized out>, name=<optimized out>, attr=<optimized out>, attr_len=<optimized out>, flags=140737149617536, fid=0x1000b, context=0x7fffffff6df0) at src/fabric.c:1296
#9  0x00007fffeb4d9ce0 in open_fabric () from /opt/cray/pe/lib64/libmpi_cray.so.12
#10 0x00007fffeb4db0d0 in MPIDI_OFI_mpi_init_hook () from /opt/cray/pe/lib64/libmpi_cray.so.12
#11 0x00007fffeb33667f in MPID_Init () from /opt/cray/pe/lib64/libmpi_cray.so.12
#12 0x00007fffe9a408a5 in MPIR_Init_thread () from /opt/cray/pe/lib64/libmpi_cray.so.12
#13 0x00007fffe9a40674 in PMPI_Init () from /opt/cray/pe/lib64/libmpi_cray.so.12
#14 0x0000000000301d37 in ?? ()
#15 0x00000001ffff0200 in ?? ()
#16 0x00000001ebff80b5 in ?? ()
#17 0x000000000020e38e in ?? ()
#18 0x00007fffffff7418 in ?? ()
#19 0x000000000020e38e in ?? ()
#20 0x0000000000000001 in ?? ()
#21 0x00007fffffff72a8 in ?? ()
#22 0x0000000000000001 in ?? ()
#23 0x00007fffffff7418 in ?? ()
#24 0x00007fffe89e8331 in _getopt_internal () from /lib64/libc.so.6
#25 0x00007fffffff72a8 in ?? ()
#26 0x0000000000000001 in ?? ()
#27 0x00007fffffff7418 in ?? ()
#28 0x000000000020e38e in ?? ()
#29 0x0000000000302493 in ?? ()
#30 0xffff720100000025 in ?? ()
#31 0x0000000000000064 in ?? ()
#32 0x00007fffffff7290 in ?? ()
#33 0x0000000000000000 in ?? ()

Dyninst trap issue redux

Same issue we saw previously in LAMMPS where dyninst isn't catching traps correctly, but now in PIConGPU.
To repro use the instructions in #145 but with a binary rewrite and run with:

./picongpu --mpiDirect -d 1 1 1 -g 240 272 224 --periodic 1 1 1 -s 100 -r 2
...
PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3XorMin seed: 42
PIConGPUVerbose PHYSICS(1) | Field solver condition: c * dt <= 1.00502 ? (c * dt = 1)
PIConGPUVerbose PHYSICS(1) | Resolving plasma oscillations?
   Estimates are based on DensityRatio to BASE_DENSITY of each species
   (see: density.param, speciesDefinition.param).
   It and does not cover other forms of initialization
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? (omega_p * dt = 0.00104301)
PIConGPUVerbose PHYSICS(1) | macro particles per device: 365568000
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 1.6384
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 6.53658e-17
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 1.95962e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 1.49248e-30
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 2.62501e-19
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 2.60765e+13
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 86981.7
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 1.34138e-13
PIConGPUVerbose PHYSICS(1) | Resolving Debye length for species "e"?
PIConGPUVerbose PHYSICS(1) | Estimate used momentum variance in 57120 supercells with at least 10 macroparticles each
PIConGPUVerbose PHYSICS(1) | 57120 (100 %) supercells had local Debye length estimate not resolved by a single cell
PIConGPUVerbose PHYSICS(1) | Estimated weighted average temperature 0.00049991 keV and corresponding Debye length 1.31401e-08 m.
   The grid has 0.0821258 cells per average Debye length
Trace/breakpoint trap (core dumped)

Using the workaround of:

export OMNITRACE_IGNORE_DYNINST_TRAMPOLINE=1

fails with:

### ERROR ###  [ rank : 0 ] Error code : 11 @ 0 :  Signal:    SIGSEGV (signal number:  11)                   segmentation violation. Unknown segmentation fault error: 128.
[PID=144196][TID=0][0/5]> omnitrace_pop_region +0x59b3
[PID=144196][TID=0][1/5]> omnitrace_pop_region +0x5ee8
[PID=144196][TID=0][2/5]> __restore_rt
[PID=144196][TID=0][3/5]> _ZN5pmacc11TaskReceiveINS_4math6VectorIfLi3ENS1_16StandardAccessorENS1_17StandardNavigatorENS1_6detail17Vector_componentsIfLi3EEEEELj3EE13executeInternEv +0x23c
[PID=144196][TID=0][4/5]> pmacc::Manager::execute_dyninst +0x186

Current workaround is to simply exclude TaskRecieve

Add option to disable debug args in perfetto

In several places, calls to perfetto add information about the timestamps / arguments. This can significantly inflate the size of the perfetto file. Need to add an option to disable this behavior.

Loop instrumentation option does not appear to be instrumenting loops

It appears that the -l / --instrument-loops option to the omnitrace binary instrumenter is no longer generating instrumentation around the loops. This regression likely arose during the refactoring to support code coverage. Testing needs to be devised to prevent this regression in the future.

Omnitrace 1.7: error out when data workers are used to async move minibatches from host->dev

Repro:

import torch
import numpy as np


assert torch.cuda.is_available(), "GPU is not available"

device = torch.device("cuda")

# Data workers > 0 leads to bug (pinned memory has no effect)
# https://pytorch.org/docs/stable/data.html#multi-process-data-loading
kwargs = {'num_workers': 1, 'pin_memory': True}

samples = 1000
shape = 5
out_elems = 2

# Inputs
train_tensorx = torch.Tensor(np.ones([samples, shape, shape]))
# Outputs
train_tensory = torch.Tensor(np.ones([samples, out_elems])) 

train_dataset = torch.utils.data.TensorDataset(train_tensorx, train_tensory)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=4, shuffle=True, **kwargs)

sumd = np.zeros([shape,shape])

for batch_idx, (data, _) in enumerate(train_loader):
    data = data.to(device)

print("Complete!") 

fails with e.g.:

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.