koute / not-perf Goto Github PK

View Code? Open in Web Editor NEW

850.0 29.0 40.0 99.42 MB

A sampling CPU profiler for Linux

License: Apache License 2.0

Rust 93.41% Shell 2.25% C 0.38% Assembly 3.83% C++ 0.12%

not-perf's Introduction

A sampling CPU profiler for Linux similar to `perf`

Features

Support for AMD64, ARM, AArch64 and MIPS64 architectures (where MIPS64 requires a tiny out-of-tree patch to the kernel to work)
Support for offline and online stack trace unwinding
Support for profiling of binaries without any debug info (without the .debug_frame section)
- using .eh_frame based unwinding (this is how normal C++ exception handling unwinds the stack) without requiring .eh_frame_hdr (which, depending on the compiler, may not be emitted)
- using .ARM.exidx + .ARM.extab based unwinding (which is ARM specific and is used instead of .eh_frame)
Support for cross-architectural data analysis
Fully architecture-agnostic data format
Built-in flamegraph generation

Why should I use this instead of `perf`?

If perf already works for you - great! Keep on using it.

This project was born out of a few limitations of the original perf which make it non-ideal for CPU profiling in embedded-ish environments. Some of those are as follows:

lack of support for MIPS64,
the big size of generated CPU profiling data due to offline-only stack unwinding, so if you only have a limited amount of storage space you either need to profile with a very low frequency, or for a very short amount of time;
lack of support for cross-architectural analysis - if you run perf record on ARM then you also need to run perf report either on ARM or under QEMU, and running the analysis under QEMU (depending on how you've compiled your binaries and with what flags you've launched perf) can take hours;
and poor support for profiling binaries which have limited or no debug info, which is often the case in big, embedded-lite projects where the debug info can't even fit on the target machine, or is not readily available.

Building

Install at least Rust 1.31
Build it:
```
 $ cd cli
 $ cargo build --release
```
Grab the binary from target/release/.

Cross-compiling

Configure the linker for your target architecture in your ~/.cargo/config, e.g.:

[target.mips64-unknown-linux-gnuabi64]
linker = "/path/to/your/sdk/mips64-octeon2-linux-gnu-gcc"
rustflags = [
  "-C", "link-arg=--sysroot=/path/to/your/sdk/sys-root/mips64-octeon2-linux-gnu"
]

[target.armv7-unknown-linux-gnueabihf]
linker = "/path/to/your/sdk/arm-cortexa15-linux-gnueabihf-gcc"
rustflags = [
  "-C", "link-arg=--sysroot=/path/to/your/sdk/sys-root/arm-cortexa15-linux-gnueabihf"
]

Compile, either for ARM or for MIPS64:

 $ cargo build --release --target=mips64-unknown-linux-gnuabi64
 $ cargo build --release --target=armv7-unknown-linux-gnueabihf

Grab the binary from target/mips64-unknown-linux-gnuabi64/ or target/armv7-unknown-linux-gnueabihf/.

Basic usage

Profiling an already running process by its PID:

$ cargo run record -p $PID_OF_YOUR_PROCESS -o datafile

Profiling a process by its name and waiting if it isn't running yet:

$ cargo run record -P cpu-hungry-program -w -o datafile

Generating a CPU flame graph from the gathered data:

$ cargo run flamegraph datafile > flame.svg

Replace cargo run with the path to the executable if you're running the profiler outside of its build directory.

License

Licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

not-perf's People

Contributors

Stargazers

Watchers

not-perf's Issues

aarch64 crash - asking for help

During memory profiling with bytehound I encountered crash with bracktrace containing 2 entries:

#0 0x7faf1993fc in gsignal+0xcc from /usr/lib64/libc.so.6+0x323fc
#1 0x7fb33ada8c from /opt/memprof/aarch64/libbytehound.so+0xf5a8c

this leads to

addr2line -e libbytehound.so 0xf5a8c
/root/.cargo/git/checkouts/not-perf-af1a46759dd83df9/51003a4/nwind/src/arch/aarch64_trampoline.s:22

aarch64_trampoline.s from this revision
revision 51003a4 is quite old IMO - but maybe you're aware of some bugs related to that part of code or give a hint how should I debug such issues?

Support for pprof profiling data

Hello everybody 😉

This is a feature request for supporting the profiling format of pprof

https://github.com/google/pprof/blob/main/proto/profile.proto

Because the native output format seems not to be compatible with the perf profiling data, it is currently not possible to use the tool pprof. Correct me if I am wrong here.

Chromium trace output does not work

When parsing something into the chromium trace format, the JSON that is output is broken, it has a missing field and each element in the array starts with a , which is illegal json format (and the chrome tracing load complains):

[,{"name": "0x0000000000431FD7 [main]","ph": "B","ts": 3627138130963.042,"pid": 10357,"tid": 10357}

This after: # ./nperf trace-events -o main_prof.trace --granularity line main_prof

PowerPC support

Currently nwind does not support PowerPC, from what I could see to support a new architecture src/arch/${arch}.rs and src/arch/${arch}_get_regs.s are the main components.

here the ABI specification.

Support compressed debug sections

I'm trying to profile an application that has compressed debug sections, but the resulting flamegraph does not show the function names.
Decompressing the symbol file prior to analysis work, but it would be nice if we could support compressed debug sections. Here's the reproduction steps:

mkdir /demo
mkdir /usr/lib/debug/demo/
cd /demo
wget -O demo.c https://gist.githubusercontent.com/thiagovice/d0f65a8b5e8cc254e840e70477cff77e/raw/1eb029244b60b6d52b373a87eb8cb8dd7984539d/demo.c

gcc -ggdb -o demo -Wl,--compress-debug-sections=zlib demo.c
objcopy --only-keep-debug demo demo.sym
strip -s demo
mv demo.sym /usr/lib/debug/demo/demo
./demo &

timeout 10s ./nperf record -F 997 -p `pgrep -f 'demo'` -o datafileDemo
./nperf flamegraph datafileDemo -d /usr/lib/debug/ > flame.svg

The resulting flame.svg, won't show the function properly, only addressees.
Removing the linking option "-Wl,--compress-debug-sections=zlib" and repeating the process gets a correct flamegraph

Provide some versioning (and maybe upload to crates.io)

I think, at least we should add git tags and gh releases. To make for users alignments more straightforward. Each release could hold short description of the changes in exposed modules and what alignments are required.

I think it could be uploaded to crates.io, but I do not if there are some cons? (maybe nwind should be separated to different repository or maybe it can be embedded in nperf crate?).

abort() called from nwind_on_exception_through_trampoline on AArch64

We have an occasional issue where using not-perf crashes profiled process in AArch64 HW.
Don't have much data yet, but the callstack is as follows:

  - tid: 23863 # --------------------------------------------------
    proc_dump: ~
    user_time: 4.170000
    system_time: 2.450000
    registers: [
      0x0000000000000000, 0x0000007ff4cdfdf0, 0x0000000000000000, 0x0000000000000008,
      0x0000000000000000, 0x0000007ff4cdfdf0, 0xffffffffffffffff, 0xffffffffffffffff,
      0x0000000000000087, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff,
      0xffffffffffffffff, 0xffffffffffffffff, 0x0000000000000000, 0x0000000000000035,
      0x0000007f7d2b2a60, 0x0000007f78d08ec8, 0x0000007f78e3a7f4, 0x0000000000000006,
      0x0000007f783b2020, 0x0000007f783b2720, 0x0000000000000100, 0x0000007f77f66028,
      0x0000007ff4ce0618, 0x0000007ff4ce06b0, 0x0000007ff4ce0638, 0x0000000000000000,
      0x0000000000000000, 0x0000007ff4cdfdd0, 0x0000007f78d07f0c, 0x0000007ff4cdfdd0,
      0x0000007f78d07f0c, 0x0000000000000000 ]
    backtrace: [
      { a: 0000007f78d07f0c, s: gsignal,              o:  0x9c, l:  0xcc, e: 0, S: 0, f: "/usr/lib64/libc-2.28.so" },
      { a: 0000007f78d09000, s: abort,                o: 0x138, l: 0x22c, e: 0, S: 0, f: "/usr/lib64/libc-2.28.so" },
      { a: 0000007f7d1b5e1c, s: _ZN5nwind15local_unwinding5abort17h255a5769eb294e0dE,                        o:  0xcc, l:  0xd0, e: 0, S: 0, f: "/opt/memprof/aarch64/libmemory_profiler.so" },
      { a: 0000007f7d1b6a4c, s: nwind_on_exception_through_trampoline,                        o: 0x444, l: 0x448, e: 0, S: 0, f: "/opt/memprof/aarch64/libmemory_profiler.so" },
      { a: 0000007f7d27c1f8, s: nwind_ret_trampoline, o:  0x44, l:  0x50, e: 1, S: 0, f: "/opt/memprof/aarch64/libmemory_profiler.so" } ]

Unfortunately I don't know yet if it's first or second abort() in said function :(

Did this happen before? Could it be issue of application (e.g. memory corruption), or some corner case bug in not-perf as this seems to be during exception handling (?).

Missing Cargo.lock

Hi,

Thanks for the useful tool! I had trouble building this project though, because the Cargo.lock file is missing and so defaulted to using gimli 0.16.1 for some of the dependencies. This caused the replace to not work correctly.

Thanks!

Crashes due to lru on 1.48.0-nightly

We have an app which requires nightly channel and uses this crate as a dep. Lru in version "0.2.0" on 1.48.0-nightly is probably outdated, so we encounter many panics.

running 47 tests
test cmd_trace_events::test_emit_events_4 ... ok
test cmd_trace_events::test_emit_events_5 ... ok
test cmd_trace_events::test_emit_events_1 ... ok
test cmd_trace_events::test_emit_events_2 ... ok
test cmd_trace_events::test_emit_events_3 ... ok
test cmd_trace_events::test_emit_events_8 ... ok
test cmd_trace_events::test_emit_events_6 ... ok
test cmd_trace_events::test_emit_events_7 ... ok
test data_reader::test::collate_aarch64_hot_spot_usleep_in_a_loop_fp ... FAILED
test data_reader::test::collate_aarch64_perfect_unwinding_usleep_in_a_loop_no_fp ... FAILED
test data_reader::test::collate_aarch64_perfect_unwinding_usleep_in_a_loop_fp ... FAILED
test data_reader::test::collate_aarch64_hot_spot_usleep_in_a_loop_no_fp ... FAILED
test data_reader::test::collate_aarch64_noreturn ... FAILED
test data_reader::test::collate_aarch64_perfect_unwinding_floating_point ... FAILED
test data_reader::test::collate_amd64_hot_spot_usleep_in_a_loop_fp ... FAILED
test data_reader::test::collate_amd64_hot_spot_usleep_in_a_loop_no_fp ... FAILED
test data_reader::test::collate_amd64_hot_spot_usleep_in_a_loop_no_fp_online ... FAILED
test data_reader::test::collate_amd64_inline_functions ... FAILED
test data_reader::test::collate_amd64_perfect_unwinding_floating_point ... FAILED
test data_reader::test::collate_amd64_noreturn ... FAILED
test data_reader::test::collate_amd64_perfect_unwinding_pthread_cond_wait ... FAILED
test data_reader::test::collate_amd64_perfect_unwinding_usleep_in_a_loop_fp ... FAILED
test data_reader::test::collate_amd64_perfect_unwinding_usleep_in_a_loop_fp_only_eh_frame_hdr ... FAILED
test data_reader::test::collate_amd64_perfect_unwinding_usleep_in_a_loop_fp_only_loaded_eh_frame ... FAILED
test data_reader::test::collate_arm_hot_spot_usleep_in_a_loop_fp ... FAILED
test data_reader::test::collate_arm_hot_spot_usleep_in_a_loop_no_fp ... FAILED
test data_reader::test::collate_arm_inline_functions ... FAILED
test data_reader::test::collate_arm_perfect_unwinding_floating_point ... FAILED
test data_reader::test::collate_mips64_inline_functions ... ignored
test data_reader::test::collate_arm_noreturn ... FAILED
test data_reader::test::collate_amd64_pthread_cond_wait ... FAILED
test data_reader::test::collate_amd64_perfect_unwinding_usleep_in_a_loop_no_fp ... FAILED
test data_reader::test::collate_arm_perfect_unwinding_usleep_in_a_loop_fp ... FAILED
test data_reader::test::collate_arm_perfect_unwinding_usleep_in_a_loop_no_fp ... FAILED
test mount_info::test_parse_mountinfo ... ok
test mount_info::test_path_resolver ... ok
test profiler::tests::reload_which_clears_base_address_does_not_panic ... ok
test data_reader::test::collate_mips64_hot_spot_usleep_in_a_loop_fp ... FAILED
test profiler::tests::spurious_reload_with_no_base_address_does_not_panic ... ok
test profiler::tests::test_update_maps_basic ... ok
test data_reader::test::collate_mips64_perfect_unwinding_usleep_in_a_loop_fp ... FAILED
test data_reader::test::collate_mips64_perfect_unwinding_floating_point ... FAILED
test data_reader::test::collate_mips64_noreturn ... FAILED
test data_reader::test::collate_mips64_hot_spot_usleep_in_a_loop_no_fp ... FAILED
test data_reader::test::collate_mips64_perfect_unwinding_usleep_in_a_loop_no_fp ... FAILED
test data_reader::test::collate_mips64_pthread_cond_wait ... FAILED
test profiler::tests::reloading_never_panics ... ok

failures:

---- data_reader::test::collate_aarch64_hot_spot_usleep_in_a_loop_fp stdout ----
thread 'data_reader::test::collate_aarch64_hot_spot_usleep_in_a_loop_fp' panicked at 'attempted to leave type `lru::LruEntry<u64, nwind::frame_descriptions::CachedUnwindInfo>` uninitialized, which is invalid', /home/mzwolins/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/mem/mod.rs:658:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
.
.
.
<more same panics>

It looks like any alignment but bumping is not needed. Minimal working lru version is 0.4. Would you mind updating it?

Needs examples

There needs to be many more examples on how to use this tool except the single example in the README.md which just shows how to profile into a flamegraph (which is not very useful).

What are all the other options about?

Is the tool output compatible with any existing other parsing tools made for "perf"?

How do you get a source-code annotated sampled histogram?

Even with the flamegraph built-in thing and full debug symbols in the binary I'm profiling I just get a bunch of 0x123124125 adresses and the flamegraph SVG is just cut off and unreadable at any of the more tiny slice sizes..

How to run nperf with "cargo bench"?

I would like to profile benchmarks running instead of a target bin file directly. What would be a correct CLI syntax for that? I've tried the following unsuccessfully (the output datafile is not generated):

% cargo run record -P $(cargo bench) -w -o datafile
   Compiling htsget-benchmarks v0.1.0 (/Users/rvalls/dev/umccr/htsget-rs/htsget-benchmarks)
   Finished bench [optimized + debuginfo] target(s) in 7.37s
   Running benches/refserver_benchmarks.rs (/Users/rvalls/dev/umccr/htsget-rs/target/release/deps/refserver_benchmarks-fecd9ace2aeca2e9)
     Running benches/request_benchmarks.rs (/Users/rvalls/dev/umccr/htsget-rs/target/release/deps/request_benchmarks-f78a77d95f70b5d1)
     Running benches/search_benchmarks.rs (/Users/rvalls/dev/umccr/htsget-rs/target/release/deps/search_benchmarks-407b85d5c1d86b5d)
Benchmarking Queries/[LIGHT] Bam query all
Benchmarking Queries/[LIGHT] Bam query all: Warming up for 3.0000 s
Benchmarking Queries/[LIGHT] Bam query all: Collecting 50 samples in estimated 30.048 s (487k iterations)
Benchmarking Queries/[LIGHT] Bam query all: Analyzing
Benchmarking Queries/[LIGHT] Bam query specific
Benchmarking Queries/[LIGHT] Bam query specific: Warming up for 3.0000 s
Benchmarking Queries/[LIGHT] Bam query specific: Collecting 50 samples in estimated 30.260 s (66k iterations)
Benchmarking Queries/[LIGHT] Bam query specific: Analyzing
Benchmarking Queries/[LIGHT] Bam query header
Benchmarking Queries/[LIGHT] Bam query header: Warming up for 3.0000 s
Benchmarking Queries/[LIGHT] Bam query header: Collecting 50 samples in estimated 30.096 s (282k iterations)
Benchmarking Queries/[LIGHT] Bam query header: Analyzing
error: a bin target must be available for `cargo run`

/cc @mmalenic

profiling process startup / specify command to run and profile

i'm trying to profile the startup of my application, which should be done in the 100ms that we're waiting between attempts at finding the process; would there be interest in such a feature? maybe some guidance on where to start/add that? i.e. i'd be willing to help implementing, just making sure i didn't get the docs wrong and it already exists, and/or avoid putting in time and effort and the feature being not implemented on purpose.

Are there any hosted compiled binaries for linux?

Little/nothing that I know about Rust, I'm unable to compile this.

➜ $?=0 ➤ cargo build --release
   Compiling vec_map v0.8.0
   Compiling lazy_static v1.0.0
   Compiling sc v0.2.2
   Compiling unicode-width v0.1.4
   Compiling regex v0.2.11
   Compiling regex-syntax v0.5.6
   Compiling time v0.1.39
   Compiling rand v0.4.2
   Compiling memmap v0.6.2
   Compiling num_cpus v1.8.0
   Compiling memchr v2.0.1
   Compiling atty v0.2.10
   Compiling num-integer v0.1.36
error[E0554]: #![feature] may not be used on the stable release channel
  --> /home/arastogi/.cargo/registry/src/github.com-1ecc6299db9ec823/sc-0.2.2/src/lib.rs:15:1
   |
15 | #![feature(asm)]
   | ^^^^^^^^^^^^^^^^

error: aborting due to previous error

error: Could not compile `sc`.
warning: build failed, waiting for other jobs to finish...
error: build failed

➜ ➤ cargo --version
cargo 0.26.0

➜ ➤

demangling support

Would it be easy to add demangling support based on a filter program, so users can plug in e.g. c++filt?

Mis-match error for inode

What could be the possible cause for inode mis-match?

Building fails with SIGSEV

Executing repro.sh.txt produces compilation failure with sigsev, logs: log.txt

Timestamp support?

First of all, amazing work. I just had a chance to test it and works wonderfully and addresses an important use case for me.

While the collated output works great with flamegraph.pl, I was also hoping to be able to use it with https://github.com/Netflix/flamescope.

Flamescope seems to be unable to parse the timestamps to divide the perf output to discrete time intervals. Is is because this is missing in the output unlike the normal perf output would?

Thanks.

Windows support

hi, any chance for adding windows support / removing unix dependencies?

Extend from and to arguments to specify time from execution end

For example if I want to flamegraph the last 30seconds of a program, but I don’t know how long it ran exactly. Perhaps can provide negative values to from and to? Happy to make a PR.

RAII API for profiling sections of code

Hi!

I am looking for a profiling tool for rust-analyzer, and I wonder if not-perf could be of help. I have some very specific requirements, but I am not a profiling expert, so I don't know if what I ask is at all possible, hence this feature-request/support issue :) Feel free to just close with "out of scope" if I ask for something silly!

rust-analyzer relies heavily on incremental computation, and I'd love to profile the incremental case. The interesting benchmark looks like this:

load_data_from_disk(); // 200ms of IO
compute_completion(); // triggers initial analysis, 10 seconds
{
    change_single_file();
    compute_completion(); // re computation after a change, 300 ms
}

I am only interested in profiling the block. Although the running-time of the benchmark is dominated by initial analysis, I explicitly don't care much about its performance.

So, what I like is to do

sampling profiling (so that I don't have to instrument my code / bias times)
of fairly short-lived blocks of CPU-heavy code (hundreds of milliseconds)
from withing the application itself (so that I can start/stop profiling for specific section of code)
without depending on C code (just because building C is a pain)

Is this possible at least in theory (ie, are there sampling tools than can give such data)? Could not-perf be of help here? My dream API would look like this:

load_data_from_disk();
compute_completion();
{
    let _p = not_perf::record("~/tmp/profile_data.perf")
        .append(); // append to the data file, so that I can run this in a loop
    change_single_file();
    compute_completion();
    // stops profiling on Drop
}

Feature: Increase concurrency of flamegraph rendering

Background

Typically I use nperf to render a flamegraph like so.

    nperf flamegraph --merge-threads perf.data > perf.svg

htop shows it just uses a single cpu core used:

It takes a long while to render the flamegraph on a Machine with 8 CPUs, 16GB RAM.

Will it be possible add an option to increase concurrency of flamegraph generation? I'm not too familiar with the internals of nperf.

I will definitely be more than happy to investigate, and try implementing it, if it's a good feature to have.

The perf_event_open syscall failed for PID XXX

Hi, i am trying to run not-perf on an armv7 machine but when i try to run nperf record -p XXX it says perf_event_open failed, maybe it's some silly mistake of mine and I have to edit some other file that I have no idea about or something like that but I haven't found more information on the internet

This is the machine

# lscpu
Architecture:          armv7l
Byte Order:            Little Endian
CPU(s):                1
On-line CPU(s) list:   0
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             1
Model name:            ARMv7 Processor rev 10 (v7l)

the perf_event_paranoid file

# cat /proc/sys/kernel/perf_event_paranoid
0

this is the output from nperf

[2001-01-08T22:20:43Z INFO  nperf_core::profiler] Opening "20010108_222043_00223_web_server.nperf" for writing...
[2001-01-08T22:20:43Z INFO  nperf_core::cmd_record] Opening perf events for process with PID 223...
[2001-01-08T22:20:43Z ERROR perf_event_open::perf] The perf_event_open syscall failed for PID 223: Operation not permitted (os error 1)
[2001-01-08T22:20:43Z WARN  nwind::frame_descriptions] No .eh_frame section found for '[vdso]'
[2001-01-08T22:20:43Z INFO  nperf_core::cmd_record] Enabling perf events...
[2001-01-08T22:20:43Z INFO  nperf_core::cmd_record] Running...
[2001-01-08T22:20:43Z INFO  nperf_core::profiler] Finished output file initialization
[2001-01-08T22:20:52Z INFO  nperf_core::profiler] Collected 0 samples in total!

Add Support for profiling in Tabular form

Hi,

Currently, with this tool after doing profiling of a process, we can only get the details in a flame graph which is in SVG Format, the main issue of the flame which is a good option but the main issue is, it is really hard to read and do rapid analysis due to nearly not readable flame graphs.
So if we could provide a good way to profile the same after generating the data, it will be really helpful.

Best,
Vibhoothi

Flamegraph backtrace discrepancy

Looking at the following flamegraph:
flaminggraph.zip

The follow_finalized_head is appearing twice as part of tokio-runtime-w [THREAD=96]. The appearing after std::sys::unix::thread::Thread::new::thread_start [khala-node] is also having much more samples. I would have assumed that the one originated from the raw poll is having more samples.

So, my question. Is everything working as expected or is there maybe some bug?

thread 'main' panicked at 'attempt to subtract with overflow',

Hi,

So I was trying to profile rav1e using not-perf, so after few seconds it stops, its saying thread 'main' panicked at 'attempt to subtract with overflow' here

vibhoothiiaanand@coneBox:~/not-perf$ RUST_BACKTRACE=1 sudo /home/vibhoothiiaanand/not-perf/target/debug/nperf record -P rav1e -w -o datafile
[2019-08-16T05:54:14Z INFO  nperf::ps] Waiting for process named 'rav1e'...
[2019-08-16T05:54:14Z INFO  nperf::ps] Process 'rav1e' found with PID 4032!
[2019-08-16T05:54:14Z INFO  nperf::profiler] Opening "datafile" for writing...
[2019-08-16T05:54:14Z INFO  nperf::cmd_record] Opening perf events for 4032...
[2019-08-16T05:54:14Z INFO  nperf::profiler] Ready to write profiling data!
[2019-08-16T05:54:15Z INFO  nperf::cmd_record] Enabling perf events...
[2019-08-16T05:54:15Z INFO  nperf::cmd_record] Running...
thread 'main' panicked at 'attempt to subtract with overflow', /home/vibhoothiiaanand/not-perf/nwind/src/dwarf.rs:179:56
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
[2019-08-16T05:54:42Z INFO  nperf::profiler] Collected 27445 samples in total!
vibhoothiiaanand@coneBox:~/not-perf$

Device Specs:
Device: Raspberry Pi 3 B+
RAM: 1 GB
Arch: aarch64
Processor: Cortex-A53 (ARMv8) 64-bit SoC @ 1.4GHz
OS: Ubuntu 18.04.2 LTS

Cross-compiling for ARMV7?

Hi! There are some suggestions in the README that this perf-version was designed with cross compilation in mind etc and there are some very brief instructions on how to cross compile it. Could you please expand this into a working example?

For example, on Ubuntu 20.04 you can install gcc-9-arm-linux-gnueabihf and in it you can find a gcc (the "linker" for Rust .config) but where do you find the "sys-root"?

Hard-to-diagnose panic if `/proc/kallsyms` file is absent

Example message from command nperf record -P <command name> -w:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/execution_queue.rs:28:60
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted

Compare against perf's handling of this case:

Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.

System-wide profiling support

First of all, thank you for this amazing work.
It's a nightmare to cross compile perf for arm/aarch64 with libdwarf support.
not-perf is really useful for embeeded Linux profiling and built-in flamegraph generator is also nice feature. :)

Currently, it seems not-perf only support profiling specific process(pid) with -p/-P argument.
I am wondering is it easy to add support for system-wide profiling option, just like the -a/--all-cpus.
Thanks again for this great work.