tud-zih-energy / lo2s Goto Github PK

View Code? Open in Web Editor NEW

45.0 8.0 13.0 1.7 MB

Linux OTF2 Sampling - A Lightweight Node-Level Performance Monitoring Tool

Home Page: https://tu-dresden.de/zih/forschung/projekte/lo2s?set_language=en

License: GNU General Public License v3.0

CMake 7.61% C++ 92.32% Makefile 0.08%

linux trace otf2 sampling linux-perf-bindings cpu-profiling profiling kernel monitoring-tool

lo2s's People

Contributors

Stargazers

Watchers

Forkers

ckroh rschoene houhongyi bertwesarg s9105947 lunas21 lutzbrusch sisyga mk-guptas fbnrst mingminglee teto519f oeffland

lo2s's Issues

Summary size for large traces is wrong

Trace contains alot of metrics

$ lo2s -m 1024 -a -t sched/sched_switch -t power/cpu_idle
...
[ lo2s: system mode, monitored processes: 214, 111.666s CPU, 2045.56s total ]
[ lo2s: 69 wakeups, wrote 16777216.00 TiB lo2s_trace_2018-03-16T19-50-50 ]

$ du -sh lo2s_trace_2018-03-16T19-50-50
2.7G lo2s_trace_2018-03-16T19-50-50

Minor typo

lo2s/src/config.cpp

Line 134 in 7667799

"clock used for perf timestamps (see --list_clockids for supported arguments)")

You probably mean --list-clockids.

Segfault when tracing cmake

In https://github.com/tud-zih-energy/lo2s/blob/master/src/mmap.cpp#L163 it can happen that mapping.dso == nullptr. This causes dead kittens and segfaults. 😿

As stuff goes wrong there anyways, we could just throw something. Though it would be better to fix the null dso in the first place.

Happens when tracing cmake .. in an empty build dir.

Value of ip:

(const lo2s::Address &) @0x2ad3c40009f0: {v_ = 47190393701150}

Contents of the map_:

std::map with 9 elements = {[{start = {v_ = 4194304}, end = {v_ = 5398528}}] = {start = {v_ = 4194304}, end = {v_ = 5398528}, pgoff = {v_ = 0}, dso = @0x1cbd628}, [{start = {v_ = 47190391128064}, end = {
      v_ = 47190391283712}}] = {start = {v_ = 47190391128064}, end = {v_ = 47190391283712}, pgoff = {v_ = 0}, dso = @0x2ad3a800c128}, [{start = {v_ = 47190391296000}, end = {v_ = 47190391304192}}] = {start = {
      v_ = 47190391296000}, end = {v_ = 47190391304192}, pgoff = {v_ = 0}, dso = @0x2ad3a8015ee8}, [{start = {v_ = 47190393389056}, end = {v_ = 47190396817408}}] = {start = {v_ = 47190393389056}, end = {
      v_ = 47190396817408}, pgoff = {v_ = 0}, dso = @0x0}, [{start = {v_ = 47190396817408}, end = {v_ = 47190398930944}}] = {start = {v_ = 47190396817408}, end = {v_ = 47190398930944}, pgoff = {v_ = 0}, 
    dso = @0x0}, [{start = {v_ = 47190398930944}, end = {v_ = 47190402904064}}] = {start = {v_ = 47190398930944}, end = {v_ = 47190402904064}, pgoff = {v_ = 0}, dso = @0x2ad3a8033128}, [{start = {
      v_ = 47190402904064}, end = {v_ = 47190405107712}}] = {start = {v_ = 47190402904064}, end = {v_ = 47190405107712}, pgoff = {v_ = 0}, dso = @0x2ad3a8033128}, [{start = {v_ = 47190405107712}, end = {
      v_ = 47190407278592}}] = {start = {v_ = 47190405107712}, end = {v_ = 47190407278592}, pgoff = {v_ = 0}, dso = @0x2ad3c4000c68}, [{start = {v_ = 18446744073699065856}, end = {
      v_ = 18446744073699069952}}] = {start = {v_ = 18446744073699065856}, end = {v_ = 18446744073699069952}, pgoff = {v_ = 0}, dso = @0xa15708}}

Create TraceMerger

We will someday need a tool, which allows us to merge two or more OTF2 traces.

Score-P independend environment variables for metric plugins

Currently lo2s reuses the SCOREP_METRIC_PLUGINS to add metric plugins. It would be good if there is an Score-P independent environment variable to add metric plugins, e.g. a LO2S_METRIC_PLUGINS variable.

This would avoid confusion with Score-P metric plugins as LO2S has a slightly different interface (synchronous plug ins are not handled as far as I know) and unintended side effects are avoided if Score-P is traced.

Best,

Andreas

Make use of SystemTreeDomain definitions

OTF2 allows us to specify, what system tree nodes represent. We should write SystemTreeDomain definitions.

Metric leader twice in trace

When the metric leader is chosen as one of the default metrics, it will get added twice.

Add support for raw metrics

In perf, I can pass raw metrics via the -e flag. This should also be supported by lo2s for -e and -E

Raw metrics are encoded in the following way rNNNN. More information is given in man perf-list

Default metric leader selection

On systems there ref-cycles is not available, one has to manually change the metric leader.
We should have a list of suitable metric-leaders we try before nagging the user about it.

Possible list in this sequence:

ref-cycles
cpu-cycles

@rschoene anything else?

Does not work on kernel.perf_event_paranoid=2

Investigate why lo2s does not work on kernel.perf_event_paranoid=2 and print appropriate error message.

Remove default metric channels

I think we shouldn't use any default metrics for various reasons:

They are only there because of legacy reasons and laziness
There is not this one apparent set of default metrics
They introduce overhead by default without the possibility to disable them
We haven't documented, which metrics are added by default
Passing one additional metric suddenly removes all default metrics, which is a surprising behavior

Refactor Monitors

Currently there are sometimes multiple monitors running on the same core, also there are certain monitors that should be FdMonitors, but currently are IntervalMonitors.

Open Questions:

Flexibility:
How flexible should this be configurable at run-time, e.g.

Configure time intervals for reading x86adapt and x86energy separately?
Configure sample reader to be read at interval and/or watermark?

Consolidation:
Should multiple buffers (e.g. tracepoints and metrics) be both read even if only one triggers the watermark?

Pro: Overall less switches to userspace
Contra: Possibly more overhead from reading almost-empty buffers

Dynamic/static:

Static (current): Each specific Monitor class knows it's specific reader types. For optional readers: if (foo_reader) { foo_reader->read(); } Indexing for fd's (non-consolidated) would be tricky.
Dynamic: All Readers need a common base class with a virtual read(). The Monitor would then contain a vector of them. Somewhat simpler, more generic code and easier fd-indexing. Performance cost of virtual dispatch, once per wake-up and Reader. Unholy mix of virtual and CRTP 😱.

In the end, we want as little threads and wake-ups as possible, but as many as necessary to exploit the parallelism. There is also the lingering TODO of splitting the MetricMonitor.

parse /proc/cpuinfo just once

src/platform.cpp:detect_processor(void) should be processed only once, and not per created thread.

Disable --clockid / --list-clockids options if built without USE_PERF_CLOCKID

If USE_PERF_CLOCKID==OFF, any execution of lo2s (regardless of arguments) will result in a warning:

[1513092780631385996][pid: 19291][tid:139737018783488][ WARN]: This installation was built without support for setting a perf reference clock.
[1513092780631422019][pid: 19291][tid:139737018783488][ WARN]: Any parameter to -k/--clockid will only affect the local reference clock.

The warning should not appear, the respective options should not be available.

Wrong calculation for scaled counter values in legacy code path

I'm not sure if this supposed to be that way, but it seems like the line here should instead of

return (static_cast<double>(value) / running) * value;

return (static_cast<double>(enabled) / running) * value;

What do you think?

Measurement summary

Show a short summary after execution. It is suppressed with -q.

Possible things to display

Number of spawned threads/processes during execution
Name and arguments of executed binary
Execution wall time / CPU time
Name of generated trace file
Size of generated trace file
Number of wakeups

Maybe this is already too much, we need to keep it concise.

Sort and group options in usage message

This is possible when using more than one po::options_description.

The usage page should look more like this:

Usage:
  ./lo2s [options] ./a.out
  ./lo2s [options] -- ./a.out --option-to-a-out
  ./lo2s [options] --pid $(pidof some-process)

Allowed options:
  --help                                produce help message
  --version                             print version information
  -q [ --quiet ]                        suppress output
  -v [ --verbose ]                      verbose output (specify multiple times 
                                        to get increasingly more verbose 
                                        output)
  -o [ --output-trace ] arg             output trace directory
  --list-clockids                       list all available clockids
  --list-events                         list all available events
  -m [ --mmap-pages ] arg (=16)         number of pages to be used by each 
                                        internal buffer
  -k [ --clockid ] arg (=monotonic-raw) clock used for perf timestamps (see 
                                        --list-clockids for supported 
                                        arguments)

System-wide monitoring:
  -a [ --all-cpus ]                     System-wide monitoring of all CPUs.

Sampling options:
  --command arg
  -c [ --count ] arg (=11010113)        sampling period (# of events specified 
                                        by -e)
  -e [ --event ] arg (=instructions)    interrupt source event for sampling
  -g [ --call-graph ]                   call-graph recording
  -n [ --no-ip ]                        do not record instruction pointers [NOT
                                        CURRENTLY SUPPORTED]
  -p [ --pid ] arg (=-1)                attach to specific pid
  -i [ --readout-interval ] arg (=100)  time interval between metric and 
                                        sampling buffer readouts in 
                                        milliseconds
  --disassemble                         enable augmentation of samples with 
                                        instructions (default if supported)
  --no-disassemble                      disable augmentation of samples with 
                                        instructions
  --kernel                              include events happening in kernel 
                                        space (default)
  --no-kernel                           exclude events happening in kernel 
                                        space

Kernel trace point options:
  -t [ --tracepoint ] arg               enable global recording of a raw 
                                        tracepoint event (usually requires 
                                        root)

Perf metric options:
  -E [ --metric-event ] arg             the name of a perf event to measure
  --metric-leader arg (=ref-cycles)     name of leading perf event
  --metric-count arg                    # of events to elapse by metric leader 
                                        before reading metric buffer
  --metric-frequency arg                metric buffer reads per second

x86_adapt options:
  -x [ --x86-adapt-cpu-knob ] arg       add x86_adapt knobs as recordings. 
                                        Append #accumulated_last for semantics.

Add lo2s version

cmake gets git commit hash or tag
add output on help page // --version
add info in creator otf2 archive metadata

Don't record metric leader when there is no valid metric event

While nothing isn't a valid metric event, lo2s will still setup a metric recoding with only the group leader:

$ lo2s -E nothing -vv ls
[1518784342110364342][pid: 99654][tid: 22907724121920][ INFO]: caching event 'nothing'.
[1518784342110418762][pid: 99654][tid: 22907724121920][ INFO]: failed to cache event (reason: missing '/' in event description)
[1518784342110437550][pid: 99654][tid: 22907724121920][ WARN]: 'nothing' does not name a known event, ignoring! (reason: missing '/' in event description)
[1518784342119710536][pid: 99654][tid: 22907724121920][DEBUG]: perf::counter::Reader: sample_freq: 10Hz
[1518784342119717759][pid: 99654][tid: 22907724121920][DEBUG]: perf::counter::Reader: leader event: 'ref-cycles'

Unexpected exit in -a mode if /sys/kernel/debug is read-only

Run lo2s:

$ ./lo2s -a -- true
[1521149802129234474][pid: 25222][tid: 140210198903680][ERROR]: Aborting: basic_ios::clear: iostream error

I traced this down to the constructor of lo2s::perf::tracepoint::EventFormat throwing here, potentially here too.

What would be the correct behavior here? Rethrow a meaningful exception instead of std::ios_base::failure? Log and exit? Check lo2s::perf::tracepoint::get_sched_switch_event() before starting the trace?

Make radare dependency optional

... just like x86_adapt
@rschoene

Configurable perf metrics

Select perf metrics through command line
Add program_option, which prints all available metrics
Sample perf metrics also for -a
Configurable interrupt source
Use PERF_FORMAT_GROUP and reduce number of reads

Check Linux version during runtime

Currently needs 4.1. uname reports a char* :(

Support older Linux versions

Currently needs 4.1 (data_offset / data_size). Need to figure out if there other compatibility issues (time_enabled / time_active bug).

Need this for taurus.

Split location container in trace

Split into for unique identifiable locations. These can be repeatable accessed with the identifier:

thread_locations_
cpu_locations_
cpu_metric_locations_

And for fuzzy named locations, these will give a new location on every access:

named_metric_locations_

Reduce code duplication in trace::metric_writer

Better time synchronization

Make use of use_clockid (since 4.1) to set a good clock for perf
Add option to list all available clocks
Use time_shift, time_mult, time_offset (since forever) // time_zero (since 3.12) to make time conversion

Look into PEBS

Can we use PEBS through perf_event with recent kernels?
Or maybe manually use PEBS? (https://github.com/andikleen/pmu-tools/tree/master/simple-pebs)

Overhead introspection

Write event records for perf mmap buffer flushes
After execution, print number of wakeups if not quiet.

add sampling source as counter

When specifying a different sampling event (e.g., via -e cpu-cycles), the event is not recorded as a counter. Please add it.

Improve error message when x86_energy isn't installed

Starting with d41121e, trying to generate the build files with cmake yields the following message:

CMake Warning at CMakeLists.txt:101 (find_package):
  By not providing "Findx86_energy.cmake" in CMAKE_MODULE_PATH this project
  has asked CMake to find a package configuration file provided by
  "x86_energy", but CMake did not find one.

  Could not find a package configuration file provided by "x86_energy"
  (requested version 2.0) with any of the following names:

    x86_energyConfig.cmake
    x86_energy-config.cmake

  Add the installation prefix of "x86_energy" to CMAKE_PREFIX_PATH or set
  "x86_energy_DIR" to a directory containing one of the above files.  If
  "x86_energy" provides a separate development package or SDK, be sure it has
  been installed.

There is no file cmake/Findx86_energy.cmake, did someone forget to add that to a commit or am I missing something?

lo2s fails as event 'ref-cycles' is not available as a metric leader!

I tried to run lo2s on taurus. Lo2s fails giving error

[1518523527725817000][pid: 14438][tid: 46969806325664][ INFO]: failed to cache event (reason: missing '/' in event description) [1518523527725831034][pid: 14438][tid: 46969806325664][ERROR]: event 'ref-cycles' is not available as a metric leader!

setting the flag -e bus-cycles also gives the above mentioned error.

"Failed to get pid of" on spawning threads

Apparently threads_.at(newpid); on PTRACE_EVENT_CLONE fails, e.g. with FIRESTARTER

Use perf list syntax for tracepoint events

Currently, tracepoint events are specified with slashes e.g., exceptions/page_fault_kernel, perf list names them with colons, e.g., exceptions:page_fault_kernel.

I'd vote for the perf list style. Any other opinions?
@tilsche @bmario @AndreasGocht @phijor @cvonelm

(Vote yes/no/0, everyone has one vote, as soon as there's a majority for "yes" the issue will be assigned, as soon as there's a majority for "no", the issue will be closed, 0 is for abstention)

Multi-node and instrumentation/event support

create a PMPI library that writes MPI events to an mmap'ed buffer. These events should include, for example: timestamp, pid, rank, MPI function, communication partners
read that buffer via lo2s, sort the events with the given sampling events and write them to OTF2
do the same for OMPT
check whether to use caliper for MPI events and buffer.

Error when not passing an executable to trace

Here:
https://github.com/tud-zih-energy/lo2s/blame/c8b867af83150f612b59fcf084ca525fec755caf/src/config.cpp#L157
config.monitor_type is not yet set. which leads to the following output when running los2 without any argument:
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)

Add some integration test to travis

Possible tests (within TravisCI):

Running without any arguments prints usage
--help prints help message
--version prints version info
Short -a works
Short process sampling works (e.g. sleep 1)

Note: Travis is a VM, so there is probably no perf possible

Check perf counter multiplexing

ref-cycles is only avail at kernel >= 3.3.0, but still default for metric-leader

The default will fail for older kernels!

lo2s/src/config.cpp

Line 134 in 791ca03

    
           ("metric-leader", po::value(&config.metric_leader)->default_value("ref-cycles"),

Support synchronous per host metrics

It would be nice if synchronous per-host metrics could be read at the readout interval.

Handle cmd argument on lo2s -a

If a command is passed lo2s -a -- ./command. The command is spawned and lo2s will record until it is finished - but not monitor it otherwise.

Add uname information to otf2

When tracing lots of different kernel versions it would be nice to have some uname information available in the trace.

I suggest to include several uname outputs, most importantly uname -r and uname -v, others won't hurt.

I suggest to use archive_.set_property("LO2S::...") as this is easily accessible in Vampir. It is a bit of "creative use", technically it should probably be some global definition like a system tree node property. But that would be much harder to show in Vampir.

Fails with tracing bash

When running "lo2s bash", it crashes
./lo2s bash
[1508939755939220588][pid: 28270][tid:139947869783872][ERROR]: mmap failed. You can decrease the buffer size or try to increase /proc/sys/kernel/perf_event_mlock_kb
[1508939755939303320][pid: 28270][tid:139947869783872][ERROR]: Destructing IntervalMonitor before being stopped. This should not happen, but it's fine anyway.
lo2s: /home/rschoene/tmp/thrift/lo2s/src/monitor/interval_monitor.cpp:44: virtual void lo2s::monitor::IntervalMonitor::stop(): Assertion `thread_.joinable()' failed.
Abgebrochen (Speicherabzug geschrieben)

When reducing the mmap size to 32, it does not crash instantaneously, but when you call another bash within the traced bash
./lo2s -m 32 bash
$ bash
[1508939940847189757][pid: 28442][tid:140234546366272][ERROR]: mmap failed. You can decrease the buffer size or try to increase /proc/sys/kernel/perf_event_mlock_kb
[1508939940847269042][pid: 28442][tid:140234546366272][ERROR]: Destructing IntervalMonitor before being stopped. This should not happen, but it's fine anyway.
lo2s: /home/rschoene/tmp/thrift/lo2s/src/monitor/interval_monitor.cpp:44: virtual void lo2s::monitor::IntervalMonitor::stop(): Assertion `thread_.joinable()' failed.
Abgebrochen (Speicherabzug geschrieben)

When reducing the mmap size to 16, it does not crash instantaneously, but when you call another bash in a bash in a bash within the traced bash
./lo2s -m 16 bash
$ bash
$ bash
$ bash
[1508939999239586817][pid: 28592][tid:140343153018688][ERROR]: mmap failed. You can decrease the buffer size or try to increase /proc/sys/kernel/perf_event_mlock_kb
[1508939999239677983][pid: 28592][tid:140343153018688][ERROR]: Destructing IntervalMonitor before being stopped. This should not happen, but it's fine anyway.
lo2s: /home/rschoene/tmp/thrift/lo2s/src/monitor/interval_monitor.cpp:44: virtual void lo2s::monitor::IntervalMonitor::stop(): Assertion `thread_.joinable()' failed.
Abgebrochen (Speicherabzug geschrieben)

Configurable interrupt source and metrics

Make lo2s more generally useful.

Improve help message

With #82 merged, we should look once again at the categories and sort once more
The useful behavior of -- should also be documented.
How to add perf probes

void __attribute__((optimize("O0")))
my_marker(int some_variable)
{
}

sudo perf probe -x ./a.out my_marker some_variable
lo2s -t probe_a:my_marker ...

document LO2S_OUTPUT_LINK (once merged)
document LO2S_OUTPUT_TRACE

hostname
pid

Other options:

Slurm * id

Enable all pre-defined events as listed at perf list

please make events, defined in enum perf_hw_id, enum perf_hw_cache*id, and enum perf_sw_ids available under the naming scheme used in perf list,

e.g.,
lo2s -e minor-faults ...
should map to PERF_COUNT_SW_PAGE_FAULTS_MIN

Sampling in global view

The holy grail

Record per-cpu call stack samples into the global view (-a)
Create a thread-view trace from the global recordings optional, both or through Vampir