rr-debugger / rr Goto Github PK

Record and Replay Framework

License: Other

Shell 0.69% C++ 63.19% Python 5.24% C 27.33% Assembly 1.36% HTML 0.07% CMake 1.58% GDB 0.01% Cap'n Proto 0.28% Julia 0.23% Dockerfile 0.02%

debugger gdb reverse-execution linux

rr's Introduction

Overview

rr is a lightweight tool for recording, replaying and debugging execution of applications (trees of processes and threads). Debugging extends gdb with very efficient reverse-execution, which in combination with standard gdb/x86 features like hardware data watchpoints, makes debugging much more fun. More information about the project, including instructions on how to install, run, and build rr, is at https://rr-project.org. The best technical overview is currently the paper Engineering Record And Replay For Deployability: Extended Technical Report.

Or go directly to the installation and building instructions.

Please contribute! Make sure to review the pull request checklist before submitting a pull request.

If you find rr useful, please add a testimonial.

rr development is sponsored by Pernosco and was originated by Mozilla.

System requirements

Linux kernel >= 4.7 (for support of __WALL in waitid())
- rr 5.6.0 worked with kernel 3.11 (requiring PTRACE_SETSIGMASK)
rr currently requires either:
- An Intel CPU with Nehalem (2010) or later microarchitecture.
- Certain AMD Zen or later processors (see https://github.com/rr-debugger/rr/wiki/Zen)
- Certain AArch64 microarchitectures (e.g. ARM Neoverse N1 or the Apple Silicon M-series)
Running in a VM guest is supported, as long as the VM supports virtualization of hardware performance counters. (VMware and KVM are known to work; Xen does not.)

rr's People

Contributors

Stargazers

Watchers

Forkers

ctalbert sanyaade-mobiledev joneschrisg smillaedler wisicn tromey imclab atuljangra jonathanmarvens iwpiv shanbady kursh saszilla bowei dardevelin skitt jryans fanfannothing lervard u20024804 lcode jiaqiang bitilandu xjzhou cocoding dipanjannag passimm staktrace shpedoikal ehsan danieldc froydnj spacekpe abioy koolhazz khuey luser gowdasunil neuroradiology zjf354847283 eagle518 bgirard fleischkatapult dannas motoma sarvex garrison benjaminschubert pm215 dturner-tw kavindaw-optumsoft tisma tbsaunde zentner-kyle glandium nbp heycam fitzgen kayceesrk drupalhunter padenot hotelzululima pspacek nicholaswalton 8l zhangf911 yamakaky nilsbore meizhoubao janisozaur zmaril 2015xray snf claudiouzelac melbshark jeremyroman dequis mpdn vinodpanicker huayl blaisorblade szborows xs3c mhoye zhuomingliang chubbymaggie curtiszimmerman martindale mgrundy yuyichao jgoldfar dakotasanchez ckousik dreiss zixan keno digideskio changeyourname rahulpathakgit zhmz90

rr's Issues

Replay divergence caused by redirecting stdout in bash script

This seems to be the cause of the issue referred to in #89, and most likely the cause of #62 as well. What happens is

$ bash breakpoint1.run.disabled 
[snip]
.FAILED: expecting "calling C"
[snip]
before (last 100 chars): ) stop reason: 857f :133  pending sig: 0
Internal error: syscalls out of sync: rec: 252  now: 1
$ rr --replay --autopilot trace_0/
[FATAL] (/home/cjones/rr/rr/src/replayer/rep_process_event.c:202: errno: None) stop reason: 857f :133  pending sig: 0
Internal error: syscalls out of sync: rec: 252  now: 1

Aborted (core dumped)

So running this test using the harness diverges on replay, and also diverges when the test trace is replayed manually from the command line. But if I manually record and replay, it passes normally!

252 is exit_group and 1 is exit, so something in the test harness causes libc to exit_group instead of exit, but not during debug replay, or replay from the command line.

Stack smashing with syscall wrapper when running xpcshell test (blocks #40)

Turns out that #40 hasn't been running with the wrapper lib! There's was a problem with the path passed to rr. But when I fix that, we start crashing

*** stack smashing detected ***: /mnt/hgfs/rr/workbench/.ff/bin/xpcshell terminated
======= Backtrace: =========
/lib/i386-linux-gnu/libc.so.6(__fortify_fail+0x45)[0x589530e5]
/lib/i386-linux-gnu/libc.so.6(+0x10409a)[0x5895309a]
/tmp/librr_wrap_syscalls.so(+0x241a)[0x5577c41a]
[0x55770032]

This is src/share/wrap_syscalls.c:216, which just the entry to static void setup_buffer() {. That's odd.

The wrapper script that drives xpcshell tests uses a brazillon env vars, so we may be tickling a pre-existing bug.

Firefox not recording under filter lib

rr: /home/cjones/rr/rr/src/share/util.c:1112: inject_and_execute_syscall: Assertion `((0xFF0000 & ctx->status) >> 16) == 0' failed.

At the callsite,

(gdb) p ((0xFF0000 & ctx->status) >> 16)
$1 = 8

I don't understand this code particularly well yet, but 8 doesn't seem to be a ptrace status code, so looks like something bad is happening.

Integrate tests with build system

Right now they're built with a script, but as the compilation environment gets more diverse (cross-compiling, various flags, different OSes, etc.) it behooves us to share that configuration with the build system.

Implement wrappers for common syscalls

From @rocallahan

gettimeofday
clock_gettime
read
write
poll
futex
(and maybe a few others)

"ptrace_error: request: 5" when replaying firefox

This happens both with and without the syscall buffer.

rr/src/share/sys.c:184: errno: Input/output error) ptrace_error: request: 5 of tid: 44958: addr 0x557fb728, data 0xaf7a

Investigate why Firefox's layout/generic tests are fast to record, but layout/base is slow

From @rocallahan

On my machine the overhead on layout/generic is pretty good and on layout/base it's pretty bad. I'm not sure exactly why they're different but to get to a really good state, we have to be much more aggressive about wrapping syscalls.

firefox23 uses ~100% CPU when shutting down

To reproduce, just open firefox23 under recording, and then close it. I'm not using the filter lib.

rr not showing most FF output

During replay, the first output seen is

nsStringStats
 => mAllocCount:              4
 => mReallocCount:            0
 => mFreeCount:               4
 => mShareCount:              1
 => mAdoptCount:              0
 => mAdoptFreeCount:          0

which is right around shutdown. If NSPR is dup'ing stdout/stderr for logging, then this might make sense. It would also be really annoying.

Failing to read stack memory

This is obviously making gdb very unhappy! When I do the following

gcc -g -m32 -o spin ../rr/src/test/interrupt.c
rr --record ./spin 
rr --replay --dbgport=1111 trace_0/

and then connect gdb to :1111, then gdb says

(gdb) c
Continuing.
  C-c C-c
[Thread 32144] #1 stopped.
[Switching to Thread 32144]
0x0804847d in spin () at ../rr/src/test/interrupt.c:13
(gdb) bt
#0  0x0804847d in spin () at ../rr/src/test/interrupt.c:13
Cannot access memory at address 0xffffd6ac

Huh? That's weird. proc/maps says

$ cat /proc/32031/maps 
[snip]
fffdd000-ffffe000 rw-p 00000000 00:00 0                                  [stack]

so the address should be mapped and readable. Really hope this isn't a ptrace glitch ....

Unify recorder and replayer syscall definitions

I'm restructuring the way syscalls are defined in replay, but I'm not going to touch src/recorder/ this time around. We should keep this information in one place for easier maintenance.

Make flushing the syscall buffer interruptible during replay

Needed for debugger commands to work in the middle of buffered-syscall execution.

Set up PPA for rr and build deps

This will greatly simplify the setup process. The only snag is that the libpfm build system is kind of a PITA, so getting cross-compiled libs built may be an issue.

Intermittent failure in test "alarm" (different issue than #31)

I've seen this a few times recently, but only was able to get the error output today. I'd been assuming it was #31 but the symptom is different

1: esi registers do not match: syscall now: 80482a0 and recorded: 8048274
rr/src/replayer/replayer.c:267: errno: None) [syscall number 197, state 0, trace file line 129]

This is fstat64 being replayed differently. It only seems to happen in the test run that uses the wrapper lib, and only just after I recompile on a VMWare filesystem, so I wonder if this is a non-rr failure.

Mozilla bug 845190 isn't reproducing in rr

Bug reproduces seemingly every time outside of rr, haven't hit it in rr after hundreds of runs. I think that rr may be changing execution semantics. There's an error message printed from rr runs that I don't think I see sans rr. But unlucky scheduling is always possible.

Investigate why rr can get different address space layout depending on invocation

This is bad, obviously, because it can cause replay divergence.

See #121: when the rr binary is invoked directly during recording, but is invoked through a symlink during replay, the AS layout is different.

rr hangs on firefox shutdown when recording (missed waitpid())

$ ps -e | egrep -e '(rr|firefox)'
 3302 pts/2    00:00:20 rr
 3303 pts/2    00:00:10 firefox

Firefox has shut down and is waiting to be reaped.

(gdb) bt
#0  0x55577430 in __kernel_vsyscall ()
#1  0x557099e3 in __waitpid_nocancel () at ../sysdeps/unix/syscall-template.S:82
#2  0x08050fc2 in rec_sched_deregister_thread (ctx_ptr=0xffffd4e8) at /home/cjones/rr/rr/src/recorder/rec_sched.c:184
#3  0x08052973 in handle_ptrace_event (ctx_ptr=0xffffd4e8) at /home/cjones/rr/rr/src/recorder/recorder.c:560
#4  0x08052f9a in start_recording (rr_flags=...) at /home/cjones/rr/rr/src/recorder/recorder.c:804
#5  0x0804c852 in start (argc=4, argv=0xffffd6b0, envp=0xffffd6c4) at /home/cjones/rr/rr/src/main.c:197
#6  0x0804d083 in main (argc=7, argv=0xffffd6a4, envp=0xffffd6c4) at /home/cjones/rr/rr/src/main.c:372
(gdb) f 2
#2  0x08050fc2 in rec_sched_deregister_thread (ctx_ptr=0xffffd4e8) at /home/cjones/rr/rr/src/recorder/rec_sched.c:184
(gdb) p ctx->child_tid
$1 = 3303

This code is

    do {
=>      ret = waitpid(ctx->child_tid, &ctx->status, __WALL | __WCLONE);
    } while (ret != -1);

so rr is trying to reap ff, it just missed somehow.

Investigate why gdb interface is so slow

It takes seconds to do simple things like setting a single breakpoint. (rr isn't doing anything stupid.) gdb is extremely chatty with rr and makes a ton of redundant requests, so it looks like gdb's main logic is written assuming some primary backend and we're getting a fallback interface. ISTR reading about a "file xfer" extension, so that might be missing magic sauce.

Implement a performace-testing script

Most basically, the script would run a workload, then run the same workload under rr and compare elapsed wall-clock time. Something like

$ ./script/measure.sh firefox --blah
'firefox --blah' took 10.3s
'rr --record firefox --blah' took 13.6s

Later on it might also be interesting to measure context switches, memory usage, and other such things that we find are important.

Revise syscall wrapper implementation

@rocallahan has a WIP at https://github.com/rocallahan/rr/tree/rework-syscall-wrapping, "but it's not complete yet." More details forthcoming.

Get rr working with valgrind (or vice versa), as far as possible

It appears that valgrind identifies itself as Merom, which apparently isn't supported yet. I'm testing with distro-supplied valgrind 3.7.0, so things may be different in later valgrind. If not, then we can either add Merom support to rr, or add later support for a newer architecture to valgrind.

The question of how well rr+valgrind could work is also interesting. Needs some thought. But even basic checking allowed me to diagnose #15.

Investigate enabling -Wall or -Werror

Generally a good idea, but rr's type of low-level code may result in so many meaningless warnings that the cost exceeds the benefit. But worth looking into eventually.

Give an informative error when trying to gdb the wrong binary in replay

Instead of crapping out. For example,

$ rr --record ./foo
$ rr --replay --dbgport=1111 trace_0
$ gdb bar
(gdb) target remote :1111
# crap out, usually

Signal-stop for debugger stops in handler frame, not at offending instruction

We notify the debugger of the stop after setting up the signal handler frame (if there was one). Non-remote gdb instead stops at the instruction that triggered the signal. I don't think this is a big deal yet but it's worth keeping in mind. It would be somewhat difficult to "fix" this.

"unknown syscall 9"

This appears to be link(). It happens when I launch firefox on an ubuntu 12.04 x86-64 machine running on bare metal, but doesn't happen in any of my VMs.

firefox23 doesn't start up correctly when certain syscalls are wrapped

The window doesn't open and the usual glib/gtk spew isn't seen on stderr.

Investigate using faster syscall interfaces

Currently wrap_syscalls.c enters the kernel through an int $0x80 instruction. That's a full synchronous interrupt and can be hurty in syscall-heavy applications. Intel and AMD added faster instructions to do this, syscall and sysenter (though I forget whose is whose). We'll most likely get a speedup by using those when available.

Investigate why recording many Sunspider and V8 tests results in ~2x slowdown

This data is from rr.pdf.

These tests should be doing almost no I/O and should be relatively straight-line code without too many syscalls. A 2x slowdown is surprising. The data is old and AFAIK predates syscall wrapping, so the first step (obviously) is to reproduce the results on tip.

Investigate not recording until the first recorded program execs on top of rr fork

There's a fair amount of code devoted to "have we exec'd yet? if so, do X", where it would be simpler to say, "do X". I don't know yet why things are currently done the way they are; this may not be possible.

Implement handling of blocking syscalls

From @rocallahan

The current implementation has a few problems but the main one is that it doesn't handle potentially-blocking syscalls at all. Blocking syscalls are tricky to handle; ideally when a wrapped syscall suspends its thread, the rr supervisor process would be invoked and handle the wrapped syscall much like a regular syscall --- we'd suspend the thread and to try to wake up another thread. However it's not easy to detect when a syscall will block. For example our read syscalls usually do not block, but might. My current best idea for handling this is to use perf to listen for context-switch events and when a context-switch event is triggered during a wrapped syscall, treat that as a signal that the syscall has blocked. I did some experiements monitoring context-switch events on test programs, and it looks promising, but I never got around to hacking that into rr.

This is a good idea because it doesn't introduce the concept of a "blocking syscall", which is a bit hard to define. Instead it handles all syscalls the same way. More details forthcoming.

Intermittent failure in test "alarm"

I may only be seeing this when I use the filter lib, but I'm not 100% sure. Here's the output from one failed run

1: Test alarm FAILED: output from recording different than replay
1: Output from recording:
1: --------------------------------------------------
1: .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
1: Signal caught, Counter is 134851179
1: --------------------------------------------------
1: Output from replay:
1: --------------------------------------------------
1: .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
1: Signal caught, Counter is 134800000

Should kill be a regular EMU, 0 syscall?

Currently it's manually implemented, but the definition is almost exactly the same as for regular EMU, 0 syscalls, except that arguments aren't check on syscall entry.

Replaying FF fails with: "eip registers do not match: replaying: 55577430 and recorded: 55577431"

This happens after we get the "nsStringStats" spew, so around shutdown. Seems like a bad bug but can defer for a bit if it's "just" a shutdown problem. Needs some diagnosis.

Add "magic" mechanism for tests to report failures

Right now some tests (I think only ones I've written) rely on rr not implementing SIGABRT to signal recording failures. When we add abort, that little trick won't work, since we'll happily record the abrt then play it back (which is what all other consumers want).

So we should have some mechanism to signal an always-fatal, non-recorded error, when some flag is passed to rr. A magic unallocated syscall number is one way. Or we could just have a flag to treat SIGABRT as a failure, but that seems fragile to me.

Detect ELF-64 executables and print an informative error message

Instead of just falling over, which is what happens currently. I hit this about once a day.

(Until x86-64 is supported, of course :).)

recorder: unknown syscall 250

Seen with 6/20 ff build. This is fadvise.

Failing to read memory after startup

When gdb connects to rr, it fails to read some memory and then apparently gives up. Edited log follows

[INFO] (/home/cjones/rr/rr/src/replayer/dbg_gdb.c:121) rr debug server listening on :1111
[snip]
DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:566: gdb requests memory (addr=0x8049F6C, len=4)
DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:198: write_flush: '$72665F65#BB'
DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:198: write_flush: '+'
DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:566: gdb requests memory (addr=0x655F6676, len=4)
DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:198: write_flush: '$#00'
DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:198: write_flush: '+'
DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:566: gdb requests memory (addr=0x655F6672, len=4)
DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:198: write_flush: '$#00'
[FATAL] (/home/cjones/rr/rr/src/replayer/dbg_gdb.c:182: errno: Input/output error) Error reading from gdb

Not clear yet what's wrong.

Figure out why gdb tests aren't failing when run with |make check|

But are failing when run on their own. They should be failing currently because when we connect gdb to the rr "inferior", we tell gdb that the process has the wrong executable image loaded. (rr instead of the traced binary.)

rr not recording getpriority() (Failing to record firefox-23 on x86: "recorder: unknown syscall 96 -- bailing out")

This is with today's nightly i686 build, running in an x86 VM on an x86-64 system. Lots of moving parts here, not sure yet where the problem is. Full output below.

$ rr --record ./firefox/firefox -no-remote -P garbage
[INFO] (/home/cjones/rr/rr/src/main.c:164) Start recording...

.[ERROR] (/home/cjones/rr/rr/src/recorder/rec_process_event.c:2162: errno: None) recorder: unknown syscall 96 -- bailing out
[ERROR] (/home/cjones/rr/rr/src/recorder/rec_process_event.c:2163: errno: None) execuction state: 4 sig 0
Printing register file:
eax: 14
ebx: 0
ecx: 0
edx: b7c6aff4
esi: 1
edi: b781a240
ebp: bfffbd98
esp: bfffbd48
eip: b7fdd424
eflags 200292
orig_eax 60
xcs: 73
xds: 7b
xes: 7b
xfs: 0
xgs: 33
xss: 7b

[ERROR] (/home/cjones/rr/rr/src/share/sys.c:117: errno: None) Exiting
rr: /home/cjones/rr/rr/src/share/sys.c:120: sys_exit: Assertion `0' failed.
Aborted (core dumped)

Find libc test suite to import and run under rr

(More of a "note to self".) We're kind of playing whack-a-mole with libc/syscall support, which is fine and a good way to bootstrap quickly, but we want to know ahead of time how complete our support is.

Debugging firefox fails with "Remote failure reply: E3FEF19E..."

I think what's happening here is that we're sending upper-case hex digits, which is ambiguous with the error-packet syntax

Reply:

‘XX...’
    Each byte of register data is described by two hex digits ...

                   -> g
                   <- xxxxxxxx00000000xxxxxxxx00000000


‘E NN’
    for an error.

So here the first register happens to start with hex 'E', and gdb thinks we sent an error reply.

Replaying firefox recording fails with: "eip registers do not match: syscall now: f7fdd1d0 and recorded: 555561d0"

eip registers do not match: syscall now: f7fdd1d0 and recorded: 555561d0
[FATAL] (/home/cjones/rr/rr/src/replayer/rep_process_event.c:88: errno: None) [syscall number 11, state 1, trace file line 7]

This is execve, so something is going wrong in the rr child before firefox is loaded. Appears to be a regression.

prctl(PR_GET_NAME) returns different value under rr than not under rr

STR

Run xpcshell test under rr. See https://github.com/cgjones/rr-workbench for an example of how to do this.
LD_PRELOAD librrmon.so in the workbench repo.
Note that xpcshell prints its name as RRMON[], meaning prctl() returned the empty string.
Run xpcshell outside of rr, also PRELOADing the mon lib. Note that the output is RRMON[xpcshell], as expected.

rr might not be implementing prctl correctly.

Use the exec*p() helper

For example,

rr python foo.py

fails, with

[ERROR] (/home/cjones/rr/rr/src/main.c:127: errno: No such file or directory) The specified file 'python' does not exist or is not executable

This isn't what users expect. Making this change though, means that the external input of the value of $PATH is introduced into the execution.

Failing to replay firefox ("rr/src/share/sys.c:186: errno: Input/output error) ptrace_error: request: 5 of tid ...")

This reproduces 100% on current master. I know replaying firefox worked a couple of weeks ago, so this is an apparent regression.

Find a way to test breakpoints interrupting async signals

To test interrupting a signal event, we have to construct a trace like

(some trace event)
ensure that a specific function is called, that we break on
signal-delivered event

It's really easy to do this for synchronous signals, since we can arrange for a synchronous interrupt right after calling our function. It's also pretty easy to test interrupting time-slice pseudo-signals, because a program can call the magic function and then just spin the CPU until the hpc fires.

It's much harder to test interruption from an async signal, reliably. The signal has to be delivered by a program external to the test, or else the raise() event will come between the breakpoint and the signal delivery. So we have to spin CPU to receive the signal, but then if we do, then we can't reliably call the breakpoint function before the signal arrives or we might see a time-slice pseudo-signal.

In the near future, the async signal code will look almost exactly like the time-slice interrupt code, so this won't be a huge problem.

Fix gettimeofday/clock_gettime in the wrapper lib

Was hoping this would be quick, but getting more involved. We're definitely resolving the symbols, just something bad is happening when they're invoked.

Breakpoint on trap instruction isn't hit

The problem is that if the breakpoint target is already a trap instruction, gdb never asks rr to set the breakpoint. rr has logic to handle breakpoint-on-trap-instruction, but gdb doesn't allow that logic to kick in. And for some reason gdb doesn't know how to handle the notifications that rr sends it, which AFAICT follow spec.

I consider this to be something of a problem, but I don't think it's worth the time right now to figure out what dance gdb wants us to do.

Make epoll_wait and poll IRREGULAR (checked) syscalls

Currently it's (EMU, 1). The man pages say EFAULT is returned if the events structure isn't writeable, so I guess we should be checking this somehow.

Run unit tests with checksumming enabled

In addition to without. Could help catch some subtle bugs.

rr-debugger / rr Goto Github PK

rr's Introduction

Overview

System requirements

rr's People

Contributors

Stargazers

Watchers

Forkers

rr's Issues

Recommend Projects

Recommend Topics

Recommend Org