Giter Site home page Giter Site logo

tud-zih-energy / lo2s Goto Github PK

View Code? Open in Web Editor NEW
45.0 8.0 13.0 1.7 MB

Linux OTF2 Sampling - A Lightweight Node-Level Performance Monitoring Tool

Home Page: https://tu-dresden.de/zih/forschung/projekte/lo2s?set_language=en

License: GNU General Public License v3.0

CMake 7.61% C++ 92.32% Makefile 0.08%
linux trace otf2 sampling linux-perf-bindings cpu-profiling profiling kernel monitoring-tool

lo2s's People

Contributors

bertwesarg avatar bmario avatar cvonelm avatar phijor avatar rschoene avatar s9105947 avatar tilsche avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

lo2s's Issues

Summary size for large traces is wrong

Trace contains alot of metrics

$ lo2s -m 1024 -a -t sched/sched_switch -t power/cpu_idle
...
[ lo2s: system mode, monitored processes: 214, 111.666s CPU, 2045.56s total ]
[ lo2s: 69 wakeups, wrote 16777216.00 TiB lo2s_trace_2018-03-16T19-50-50 ]

$ du -sh lo2s_trace_2018-03-16T19-50-50
2.7G lo2s_trace_2018-03-16T19-50-50

Segfault when tracing cmake

In https://github.com/tud-zih-energy/lo2s/blob/master/src/mmap.cpp#L163 it can happen that mapping.dso == nullptr. This causes dead kittens and segfaults. ๐Ÿ˜ฟ

As stuff goes wrong there anyways, we could just throw something. Though it would be better to fix the null dso in the first place.

Happens when tracing cmake .. in an empty build dir.

Value of ip:

(const lo2s::Address &) @0x2ad3c40009f0: {v_ = 47190393701150}

Contents of the map_:

std::map with 9 elements = {[{start = {v_ = 4194304}, end = {v_ = 5398528}}] = {start = {v_ = 4194304}, end = {v_ = 5398528}, pgoff = {v_ = 0}, dso = @0x1cbd628}, [{start = {v_ = 47190391128064}, end = {
      v_ = 47190391283712}}] = {start = {v_ = 47190391128064}, end = {v_ = 47190391283712}, pgoff = {v_ = 0}, dso = @0x2ad3a800c128}, [{start = {v_ = 47190391296000}, end = {v_ = 47190391304192}}] = {start = {
      v_ = 47190391296000}, end = {v_ = 47190391304192}, pgoff = {v_ = 0}, dso = @0x2ad3a8015ee8}, [{start = {v_ = 47190393389056}, end = {v_ = 47190396817408}}] = {start = {v_ = 47190393389056}, end = {
      v_ = 47190396817408}, pgoff = {v_ = 0}, dso = @0x0}, [{start = {v_ = 47190396817408}, end = {v_ = 47190398930944}}] = {start = {v_ = 47190396817408}, end = {v_ = 47190398930944}, pgoff = {v_ = 0}, 
    dso = @0x0}, [{start = {v_ = 47190398930944}, end = {v_ = 47190402904064}}] = {start = {v_ = 47190398930944}, end = {v_ = 47190402904064}, pgoff = {v_ = 0}, dso = @0x2ad3a8033128}, [{start = {
      v_ = 47190402904064}, end = {v_ = 47190405107712}}] = {start = {v_ = 47190402904064}, end = {v_ = 47190405107712}, pgoff = {v_ = 0}, dso = @0x2ad3a8033128}, [{start = {v_ = 47190405107712}, end = {
      v_ = 47190407278592}}] = {start = {v_ = 47190405107712}, end = {v_ = 47190407278592}, pgoff = {v_ = 0}, dso = @0x2ad3c4000c68}, [{start = {v_ = 18446744073699065856}, end = {
      v_ = 18446744073699069952}}] = {start = {v_ = 18446744073699065856}, end = {v_ = 18446744073699069952}, pgoff = {v_ = 0}, dso = @0xa15708}}

Create TraceMerger

We will someday need a tool, which allows us to merge two or more OTF2 traces.

Score-P independend environment variables for metric plugins

Currently lo2s reuses the SCOREP_METRIC_PLUGINS to add metric plugins. It would be good if there is an Score-P independent environment variable to add metric plugins, e.g. a LO2S_METRIC_PLUGINS variable.

This would avoid confusion with Score-P metric plugins as LO2S has a slightly different interface (synchronous plug ins are not handled as far as I know) and unintended side effects are avoided if Score-P is traced.

Best,

Andreas

Add support for raw metrics

In perf, I can pass raw metrics via the -e flag. This should also be supported by lo2s for -e and -E

Raw metrics are encoded in the following way rNNNN. More information is given in man perf-list

Default metric leader selection

On systems there ref-cycles is not available, one has to manually change the metric leader.
We should have a list of suitable metric-leaders we try before nagging the user about it.

Possible list in this sequence:

  • ref-cycles
  • cpu-cycles

@rschoene anything else?

Remove default metric channels

I think we shouldn't use any default metrics for various reasons:

  • They are only there because of legacy reasons and laziness
  • There is not this one apparent set of default metrics
  • They introduce overhead by default without the possibility to disable them
  • We haven't documented, which metrics are added by default
  • Passing one additional metric suddenly removes all default metrics, which is a surprising behavior

Refactor Monitors

Currently there are sometimes multiple monitors running on the same core, also there are certain monitors that should be FdMonitors, but currently are IntervalMonitors.

Open Questions:

  1. Flexibility:
    How flexible should this be configurable at run-time, e.g.
  • Configure time intervals for reading x86adapt and x86energy separately?
  • Configure sample reader to be read at interval and/or watermark?
  1. Consolidation:
    Should multiple buffers (e.g. tracepoints and metrics) be both read even if only one triggers the watermark?
  • Pro: Overall less switches to userspace
  • Contra: Possibly more overhead from reading almost-empty buffers
  1. Dynamic/static:
  • Static (current): Each specific Monitor class knows it's specific reader types. For optional readers: if (foo_reader) { foo_reader->read(); } Indexing for fd's (non-consolidated) would be tricky.
  • Dynamic: All Readers need a common base class with a virtual read(). The Monitor would then contain a vector of them. Somewhat simpler, more generic code and easier fd-indexing. Performance cost of virtual dispatch, once per wake-up and Reader. Unholy mix of virtual and CRTP ๐Ÿ˜ฑ.

In the end, we want as little threads and wake-ups as possible, but as many as necessary to exploit the parallelism. There is also the lingering TODO of splitting the MetricMonitor.

Disable --clockid / --list-clockids options if built without USE_PERF_CLOCKID

If USE_PERF_CLOCKID==OFF, any execution of lo2s (regardless of arguments) will result in a warning:

[1513092780631385996][pid: 19291][tid:139737018783488][ WARN]: This installation was built without support for setting a perf reference clock.
[1513092780631422019][pid: 19291][tid:139737018783488][ WARN]: Any parameter to -k/--clockid will only affect the local reference clock.

The warning should not appear, the respective options should not be available.

Measurement summary

Show a short summary after execution. It is suppressed with -q.

Possible things to display

  • Number of spawned threads/processes during execution
  • Name and arguments of executed binary
  • Execution wall time / CPU time
  • Name of generated trace file
  • Size of generated trace file
  • Number of wakeups

Maybe this is already too much, we need to keep it concise.

Sort and group options in usage message

This is possible when using more than one po::options_description.

The usage page should look more like this:

Usage:
  ./lo2s [options] ./a.out
  ./lo2s [options] -- ./a.out --option-to-a-out
  ./lo2s [options] --pid $(pidof some-process)

Allowed options:
  --help                                produce help message
  --version                             print version information
  -q [ --quiet ]                        suppress output
  -v [ --verbose ]                      verbose output (specify multiple times 
                                        to get increasingly more verbose 
                                        output)
  -o [ --output-trace ] arg             output trace directory
  --list-clockids                       list all available clockids
  --list-events                         list all available events
  -m [ --mmap-pages ] arg (=16)         number of pages to be used by each 
                                        internal buffer
  -k [ --clockid ] arg (=monotonic-raw) clock used for perf timestamps (see 
                                        --list-clockids for supported 
                                        arguments)

System-wide monitoring:
  -a [ --all-cpus ]                     System-wide monitoring of all CPUs.

Sampling options:
  --command arg
  -c [ --count ] arg (=11010113)        sampling period (# of events specified 
                                        by -e)
  -e [ --event ] arg (=instructions)    interrupt source event for sampling
  -g [ --call-graph ]                   call-graph recording
  -n [ --no-ip ]                        do not record instruction pointers [NOT
                                        CURRENTLY SUPPORTED]
  -p [ --pid ] arg (=-1)                attach to specific pid
  -i [ --readout-interval ] arg (=100)  time interval between metric and 
                                        sampling buffer readouts in 
                                        milliseconds
  --disassemble                         enable augmentation of samples with 
                                        instructions (default if supported)
  --no-disassemble                      disable augmentation of samples with 
                                        instructions
  --kernel                              include events happening in kernel 
                                        space (default)
  --no-kernel                           exclude events happening in kernel 
                                        space

Kernel trace point options:
  -t [ --tracepoint ] arg               enable global recording of a raw 
                                        tracepoint event (usually requires 
                                        root)

Perf metric options:
  -E [ --metric-event ] arg             the name of a perf event to measure
  --metric-leader arg (=ref-cycles)     name of leading perf event
  --metric-count arg                    # of events to elapse by metric leader 
                                        before reading metric buffer
  --metric-frequency arg                metric buffer reads per second

x86_adapt options:
  -x [ --x86-adapt-cpu-knob ] arg       add x86_adapt knobs as recordings. 
                                        Append #accumulated_last for semantics.

Add lo2s version

  • cmake gets git commit hash or tag
  • add output on help page // --version
  • add info in creator otf2 archive metadata

Don't record metric leader when there is no valid metric event

While nothing isn't a valid metric event, lo2s will still setup a metric recoding with only the group leader:

$ lo2s -E nothing -vv ls
[1518784342110364342][pid: 99654][tid: 22907724121920][ INFO]: caching event 'nothing'.
[1518784342110418762][pid: 99654][tid: 22907724121920][ INFO]: failed to cache event (reason: missing '/' in event description)
[1518784342110437550][pid: 99654][tid: 22907724121920][ WARN]: 'nothing' does not name a known event, ignoring! (reason: missing '/' in event description)
[1518784342119710536][pid: 99654][tid: 22907724121920][DEBUG]: perf::counter::Reader: sample_freq: 10Hz
[1518784342119717759][pid: 99654][tid: 22907724121920][DEBUG]: perf::counter::Reader: leader event: 'ref-cycles'

Unexpected exit in -a mode if /sys/kernel/debug is read-only

Run lo2s:

$ ./lo2s -a -- true
[1521149802129234474][pid: 25222][tid: 140210198903680][ERROR]: Aborting: basic_ios::clear: iostream error

I traced this down to the constructor of lo2s::perf::tracepoint::EventFormat throwing here, potentially here too.

What would be the correct behavior here? Rethrow a meaningful exception instead of std::ios_base::failure? Log and exit? Check lo2s::perf::tracepoint::get_sched_switch_event() before starting the trace?

Configurable perf metrics

  • Select perf metrics through command line
  • Add program_option, which prints all available metrics
  • Sample perf metrics also for -a
  • Configurable interrupt source
  • Use PERF_FORMAT_GROUP and reduce number of reads

Support older Linux versions

Currently needs 4.1 (data_offset / data_size). Need to figure out if there other compatibility issues (time_enabled / time_active bug).

Need this for taurus.

Split location container in trace

Split into for unique identifiable locations. These can be repeatable accessed with the identifier:

  • thread_locations_
  • cpu_locations_
  • cpu_metric_locations_

And for fuzzy named locations, these will give a new location on every access:

  • named_metric_locations_

Reduce code duplication in trace::metric_writer

Better time synchronization

  • Make use of use_clockid (since 4.1) to set a good clock for perf
  • Add option to list all available clocks
  • Use time_shift, time_mult, time_offset (since forever) // time_zero (since 3.12) to make time conversion

Overhead introspection

  • Write event records for perf mmap buffer flushes
  • After execution, print number of wakeups if not quiet.

add sampling source as counter

When specifying a different sampling event (e.g., via -e cpu-cycles), the event is not recorded as a counter. Please add it.

Improve error message when x86_energy isn't installed

Starting with d41121e, trying to generate the build files with cmake yields the following message:

CMake Warning at CMakeLists.txt:101 (find_package):
  By not providing "Findx86_energy.cmake" in CMAKE_MODULE_PATH this project
  has asked CMake to find a package configuration file provided by
  "x86_energy", but CMake did not find one.

  Could not find a package configuration file provided by "x86_energy"
  (requested version 2.0) with any of the following names:

    x86_energyConfig.cmake
    x86_energy-config.cmake

  Add the installation prefix of "x86_energy" to CMAKE_PREFIX_PATH or set
  "x86_energy_DIR" to a directory containing one of the above files.  If
  "x86_energy" provides a separate development package or SDK, be sure it has
  been installed.

There is no file cmake/Findx86_energy.cmake, did someone forget to add that to a commit or am I missing something?

lo2s fails as event 'ref-cycles' is not available as a metric leader!

I tried to run lo2s on taurus. Lo2s fails giving error

[1518523527725817000][pid: 14438][tid: 46969806325664][ INFO]: failed to cache event (reason: missing '/' in event description) [1518523527725831034][pid: 14438][tid: 46969806325664][ERROR]: event 'ref-cycles' is not available as a metric leader!

setting the flag -e bus-cycles also gives the above mentioned error.

Use perf list syntax for tracepoint events

Currently, tracepoint events are specified with slashes e.g., exceptions/page_fault_kernel, perf list names them with colons, e.g., exceptions:page_fault_kernel.

I'd vote for the perf list style. Any other opinions?
@tilsche @bmario @AndreasGocht @phijor @cvonelm

(Vote yes/no/0, everyone has one vote, as soon as there's a majority for "yes" the issue will be assigned, as soon as there's a majority for "no", the issue will be closed, 0 is for abstention)

Multi-node and instrumentation/event support

  • create a PMPI library that writes MPI events to an mmap'ed buffer. These events should include, for example: timestamp, pid, rank, MPI function, communication partners
  • read that buffer via lo2s, sort the events with the given sampling events and write them to OTF2
  • do the same for OMPT
  • check whether to use caliper for MPI events and buffer.

Error when not passing an executable to trace

Here:
https://github.com/tud-zih-energy/lo2s/blame/c8b867af83150f612b59fcf084ca525fec755caf/src/config.cpp#L157
config.monitor_type is not yet set. which leads to the following output when running los2 without any argument:
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)

Add some integration test to travis

Possible tests (within TravisCI):

  • Running without any arguments prints usage
  • --help prints help message
  • --version prints version info
  • Short -a works
  • Short process sampling works (e.g. sleep 1)

Note: Travis is a VM, so there is probably no perf possible

Handle cmd argument on lo2s -a

If a command is passed lo2s -a -- ./command. The command is spawned and lo2s will record until it is finished - but not monitor it otherwise.

Add uname information to otf2

When tracing lots of different kernel versions it would be nice to have some uname information available in the trace.

I suggest to include several uname outputs, most importantly uname -r and uname -v, others won't hurt.

I suggest to use archive_.set_property("LO2S::...") as this is easily accessible in Vampir. It is a bit of "creative use", technically it should probably be some global definition like a system tree node property. But that would be much harder to show in Vampir.

Fails with tracing bash

When running "lo2s bash", it crashes
./lo2s bash
[1508939755939220588][pid: 28270][tid:139947869783872][ERROR]: mmap failed. You can decrease the buffer size or try to increase /proc/sys/kernel/perf_event_mlock_kb
[1508939755939303320][pid: 28270][tid:139947869783872][ERROR]: Destructing IntervalMonitor before being stopped. This should not happen, but it's fine anyway.
lo2s: /home/rschoene/tmp/thrift/lo2s/src/monitor/interval_monitor.cpp:44: virtual void lo2s::monitor::IntervalMonitor::stop(): Assertion `thread_.joinable()' failed.
Abgebrochen (Speicherabzug geschrieben)

When reducing the mmap size to 32, it does not crash instantaneously, but when you call another bash within the traced bash
./lo2s -m 32 bash
$ bash
[1508939940847189757][pid: 28442][tid:140234546366272][ERROR]: mmap failed. You can decrease the buffer size or try to increase /proc/sys/kernel/perf_event_mlock_kb
[1508939940847269042][pid: 28442][tid:140234546366272][ERROR]: Destructing IntervalMonitor before being stopped. This should not happen, but it's fine anyway.
lo2s: /home/rschoene/tmp/thrift/lo2s/src/monitor/interval_monitor.cpp:44: virtual void lo2s::monitor::IntervalMonitor::stop(): Assertion `thread_.joinable()' failed.
Abgebrochen (Speicherabzug geschrieben)

When reducing the mmap size to 16, it does not crash instantaneously, but when you call another bash in a bash in a bash within the traced bash
./lo2s -m 16 bash
$ bash
$ bash
$ bash
[1508939999239586817][pid: 28592][tid:140343153018688][ERROR]: mmap failed. You can decrease the buffer size or try to increase /proc/sys/kernel/perf_event_mlock_kb
[1508939999239677983][pid: 28592][tid:140343153018688][ERROR]: Destructing IntervalMonitor before being stopped. This should not happen, but it's fine anyway.
lo2s: /home/rschoene/tmp/thrift/lo2s/src/monitor/interval_monitor.cpp:44: virtual void lo2s::monitor::IntervalMonitor::stop(): Assertion `thread_.joinable()' failed.
Abgebrochen (Speicherabzug geschrieben)

Improve help message

  • With #82 merged, we should look once again at the categories and sort once more
  • The useful behavior of -- should also be documented.
  • How to add perf probes
void __attribute__((optimize("O0")))
my_marker(int some_variable)
{
}

sudo perf probe -x ./a.out my_marker some_variable
lo2s -t probe_a:my_marker ...

  • document LO2S_OUTPUT_LINK (once merged)
  • document LO2S_OUTPUT_TRACE

Use PERF_RECORD_COMM

PERF_RECORD_COMM provides information about name changes of processes and threads. This should allow better naming information. Requires Linux 3.16, so probably only optional for now.

Add process tree

Somehow embed the process tree in the trace, e.g. by making another system-tree with class "process" and hang the LOCATION_GROUPs there.

Enable all pre-defined events as listed at perf list

please make events, defined in enum perf_hw_id, enum perf_hw_cache*id, and enum perf_sw_ids available under the naming scheme used in perf list,

e.g.,
lo2s -e minor-faults ...
should map to PERF_COUNT_SW_PAGE_FAULTS_MIN

Sampling in global view

The holy grail

  • Record per-cpu call stack samples into the global view (-a)
  • Create a thread-view trace from the global recordings optional, both or through Vampir

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.