rr-debugger / rr Goto Github PK
View Code? Open in Web Editor NEWRecord and Replay Framework
Home Page: http://rr-project.org/
License: Other
Record and Replay Framework
Home Page: http://rr-project.org/
License: Other
Instead of crapping out. For example,
$ rr --record ./foo
$ rr --replay --dbgport=1111 trace_0
$ gdb bar
(gdb) target remote :1111
# crap out, usually
STR
LD_PRELOAD
librrmon.so in the workbench repo.RRMON[]
, meaning prctl() returned the empty string.PRELOAD
ing the mon lib. Note that the output is RRMON[xpcshell]
, as expected.rr might not be implementing prctl
correctly.
From @rocallahan
On my machine the overhead on layout/generic is pretty good and on layout/base it's pretty bad. I'm not sure exactly why they're different but to get to a really good state, we have to be much more aggressive about wrapping syscalls.
@rocallahan has a WIP at https://github.com/rocallahan/rr/tree/rework-syscall-wrapping, "but it's not complete yet." More details forthcoming.
From @rocallahan
The current implementation has a few problems but the main one is that it doesn't handle potentially-blocking syscalls at all. Blocking syscalls are tricky to handle; ideally when a wrapped syscall suspends its thread, the rr supervisor process would be invoked and handle the wrapped syscall much like a regular syscall --- we'd suspend the thread and to try to wake up another thread. However it's not easy to detect when a syscall will block. For example our read syscalls usually do not block, but might. My current best idea for handling this is to use perf to listen for context-switch events and when a context-switch event is triggered during a wrapped syscall, treat that as a signal that the syscall has blocked. I did some experiements monitoring context-switch events on test programs, and it looks promising, but I never got around to hacking that into rr.
This is a good idea because it doesn't introduce the concept of a "blocking syscall", which is a bit hard to define. Instead it handles all syscalls the same way. More details forthcoming.
There's a fair amount of code devoted to "have we exec'd yet? if so, do X", where it would be simpler to say, "do X". I don't know yet why things are currently done the way they are; this may not be possible.
From @rocallahan
Currently wrap_syscalls.c enters the kernel through an int $0x80
instruction. That's a full synchronous interrupt and can be hurty in syscall-heavy applications. Intel and AMD added faster instructions to do this, syscall
and sysenter
(though I forget whose is whose). We'll most likely get a speedup by using those when available.
eip registers do not match: syscall now: f7fdd1d0 and recorded: 555561d0
[FATAL] (/home/cjones/rr/rr/src/replayer/rep_process_event.c:88: errno: None) [syscall number 11, state 1, trace file line 7]
This is execve
, so something is going wrong in the rr child before firefox is loaded. Appears to be a regression.
Turns out that #40 hasn't been running with the wrapper lib! There's was a problem with the path passed to rr. But when I fix that, we start crashing
*** stack smashing detected ***: /mnt/hgfs/rr/workbench/.ff/bin/xpcshell terminated ======= Backtrace: ========= /lib/i386-linux-gnu/libc.so.6(__fortify_fail+0x45)[0x589530e5] /lib/i386-linux-gnu/libc.so.6(+0x10409a)[0x5895309a] /tmp/librr_wrap_syscalls.so(+0x241a)[0x5577c41a] [0x55770032]
This is src/share/wrap_syscalls.c:216
, which just the entry to static void setup_buffer() {
. That's odd.
The wrapper script that drives xpcshell tests uses a brazillon env vars, so we may be tickling a pre-existing bug.
The problem is that if the breakpoint target is already a trap instruction, gdb never asks rr to set the breakpoint. rr has logic to handle breakpoint-on-trap-instruction, but gdb doesn't allow that logic to kick in. And for some reason gdb doesn't know how to handle the notifications that rr sends it, which AFAICT follow spec.
I consider this to be something of a problem, but I don't think it's worth the time right now to figure out what dance gdb wants us to do.
This reproduces 100% on current master. I know replaying firefox worked a couple of weeks ago, so this is an apparent regression.
This is bad, obviously, because it can cause replay divergence.
See #121: when the rr binary is invoked directly during recording, but is invoked through a symlink during replay, the AS layout is different.
Currently it's (EMU, 1). The man pages say EFAULT is returned if the events
structure isn't writeable, so I guess we should be checking this somehow.
Seen with 6/20 ff build. This is fadvise.
Right now some tests (I think only ones I've written) rely on rr not implementing SIGABRT to signal recording failures. When we add abort, that little trick won't work, since we'll happily record the abrt then play it back (which is what all other consumers want).
So we should have some mechanism to signal an always-fatal, non-recorded error, when some flag is passed to rr. A magic unallocated syscall number is one way. Or we could just have a flag to treat SIGABRT as a failure, but that seems fragile to me.
Most basically, the script would run a workload, then run the same workload under rr and compare elapsed wall-clock time. Something like
$ ./script/measure.sh firefox --blah
'firefox --blah' took 10.3s
'rr --record firefox --blah' took 13.6s
Later on it might also be interesting to measure context switches, memory usage, and other such things that we find are important.
To reproduce, just open firefox23 under recording, and then close it. I'm not using the filter lib.
In addition to without. Could help catch some subtle bugs.
This happens both with and without the syscall buffer.
rr/src/share/sys.c:184: errno: Input/output error) ptrace_error: request: 5 of tid: 44958: addr 0x557fb728, data 0xaf7a
This data is from rr.pdf.
These tests should be doing almost no I/O and should be relatively straight-line code without too many syscalls. A 2x slowdown is surprising. The data is old and AFAIK predates syscall wrapping, so the first step (obviously) is to reproduce the results on tip.
We notify the debugger of the stop after setting up the signal handler frame (if there was one). Non-remote gdb instead stops at the instruction that triggered the signal. I don't think this is a big deal yet but it's worth keeping in mind. It would be somewhat difficult to "fix" this.
This is obviously making gdb very unhappy! When I do the following
gcc -g -m32 -o spin ../rr/src/test/interrupt.c rr --record ./spin rr --replay --dbgport=1111 trace_0/
and then connect gdb to :1111
, then gdb says
(gdb) c Continuing. C-c C-c [Thread 32144] #1 stopped. [Switching to Thread 32144] 0x0804847d in spin () at ../rr/src/test/interrupt.c:13 (gdb) bt #0 0x0804847d in spin () at ../rr/src/test/interrupt.c:13 Cannot access memory at address 0xffffd6ac
Huh? That's weird. proc/maps says
$ cat /proc/32031/maps [snip] fffdd000-ffffe000 rw-p 00000000 00:00 0 [stack]
so the address should be mapped and readable. Really hope this isn't a ptrace glitch ....
Needed for debugger commands to work in the middle of buffered-syscall execution.
This happens after we get the "nsStringStats" spew, so around shutdown. Seems like a bad bug but can defer for a bit if it's "just" a shutdown problem. Needs some diagnosis.
To test interrupting a signal event, we have to construct a trace like
(some trace event)
ensure that a specific function is called, that we break on
signal-delivered event
It's really easy to do this for synchronous signals, since we can arrange for a synchronous interrupt right after calling our function. It's also pretty easy to test interrupting time-slice pseudo-signals, because a program can call the magic function and then just spin the CPU until the hpc fires.
It's much harder to test interruption from an async signal, reliably. The signal has to be delivered by a program external to the test, or else the raise()
event will come between the breakpoint and the signal delivery. So we have to spin CPU to receive the signal, but then if we do, then we can't reliably call the breakpoint function before the signal arrives or we might see a time-slice pseudo-signal.
In the near future, the async signal code will look almost exactly like the time-slice interrupt code, so this won't be a huge problem.
I may only be seeing this when I use the filter lib, but I'm not 100% sure. Here's the output from one failed run
1: Test alarm FAILED: output from recording different than replay 1: Output from recording: 1: -------------------------------------------------- 1: ..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... 1: Signal caught, Counter is 134851179 1: -------------------------------------------------- 1: Output from replay: 1: -------------------------------------------------- 1: ..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... 1: Signal caught, Counter is 134800000
$ ps -e | egrep -e '(rr|firefox)' 3302 pts/2 00:00:20 rr 3303 pts/2 00:00:10 firefox
Firefox has shut down and is waiting to be reaped.
(gdb) bt #0 0x55577430 in __kernel_vsyscall () #1 0x557099e3 in __waitpid_nocancel () at ../sysdeps/unix/syscall-template.S:82 #2 0x08050fc2 in rec_sched_deregister_thread (ctx_ptr=0xffffd4e8) at /home/cjones/rr/rr/src/recorder/rec_sched.c:184 #3 0x08052973 in handle_ptrace_event (ctx_ptr=0xffffd4e8) at /home/cjones/rr/rr/src/recorder/recorder.c:560 #4 0x08052f9a in start_recording (rr_flags=...) at /home/cjones/rr/rr/src/recorder/recorder.c:804 #5 0x0804c852 in start (argc=4, argv=0xffffd6b0, envp=0xffffd6c4) at /home/cjones/rr/rr/src/main.c:197 #6 0x0804d083 in main (argc=7, argv=0xffffd6a4, envp=0xffffd6c4) at /home/cjones/rr/rr/src/main.c:372 (gdb) f 2 #2 0x08050fc2 in rec_sched_deregister_thread (ctx_ptr=0xffffd4e8) at /home/cjones/rr/rr/src/recorder/rec_sched.c:184 (gdb) p ctx->child_tid $1 = 3303
This code is
do { => ret = waitpid(ctx->child_tid, &ctx->status, __WALL | __WCLONE); } while (ret != -1);
so rr is trying to reap ff, it just missed somehow.
During replay, the first output seen is
nsStringStats => mAllocCount: 4 => mReallocCount: 0 => mFreeCount: 4 => mShareCount: 1 => mAdoptCount: 0 => mAdoptFreeCount: 0
which is right around shutdown. If NSPR is dup'ing stdout/stderr for logging, then this might make sense. It would also be really annoying.
I think what's happening here is that we're sending upper-case hex digits, which is ambiguous with the error-packet syntax
Reply: ‘XX...’ Each byte of register data is described by two hex digits ... -> g <- xxxxxxxx00000000xxxxxxxx00000000 ‘E NN’ for an error.
So here the first register happens to start with hex 'E', and gdb thinks we sent an error reply.
The window doesn't open and the usual glib/gtk spew isn't seen on stderr.
Currently it's manually implemented, but the definition is almost exactly the same as for regular EMU, 0 syscalls, except that arguments aren't check on syscall entry.
I've seen this a few times recently, but only was able to get the error output today. I'd been assuming it was #31 but the symptom is different
1: esi registers do not match: syscall now: 80482a0 and recorded: 8048274 rr/src/replayer/replayer.c:267: errno: None) [syscall number 197, state 0, trace file line 129]
This is fstat64
being replayed differently. It only seems to happen in the test run that uses the wrapper lib, and only just after I recompile on a VMWare filesystem, so I wonder if this is a non-rr failure.
But are failing when run on their own. They should be failing currently because when we connect gdb to the rr "inferior", we tell gdb that the process has the wrong executable image loaded. (rr
instead of the traced binary.)
This will greatly simplify the setup process. The only snag is that the libpfm build system is kind of a PITA, so getting cross-compiled libs built may be an issue.
Bug reproduces seemingly every time outside of rr, haven't hit it in rr after hundreds of runs. I think that rr may be changing execution semantics. There's an error message printed from rr runs that I don't think I see sans rr. But unlucky scheduling is always possible.
It takes seconds to do simple things like setting a single breakpoint. (rr isn't doing anything stupid.) gdb is extremely chatty with rr and makes a ton of redundant requests, so it looks like gdb's main logic is written assuming some primary backend and we're getting a fallback interface. ISTR reading about a "file xfer" extension, so that might be missing magic sauce.
This appears to be link(). It happens when I launch firefox on an ubuntu 12.04 x86-64 machine running on bare metal, but doesn't happen in any of my VMs.
This is with today's nightly i686 build, running in an x86 VM on an x86-64 system. Lots of moving parts here, not sure yet where the problem is. Full output below.
$ rr --record ./firefox/firefox -no-remote -P garbage [INFO] (/home/cjones/rr/rr/src/main.c:164) Start recording... .[ERROR] (/home/cjones/rr/rr/src/recorder/rec_process_event.c:2162: errno: None) recorder: unknown syscall 96 -- bailing out [ERROR] (/home/cjones/rr/rr/src/recorder/rec_process_event.c:2163: errno: None) execuction state: 4 sig 0 Printing register file: eax: 14 ebx: 0 ecx: 0 edx: b7c6aff4 esi: 1 edi: b781a240 ebp: bfffbd98 esp: bfffbd48 eip: b7fdd424 eflags 200292 orig_eax 60 xcs: 73 xds: 7b xes: 7b xfs: 0 xgs: 33 xss: 7b [ERROR] (/home/cjones/rr/rr/src/share/sys.c:117: errno: None) Exiting rr: /home/cjones/rr/rr/src/share/sys.c:120: sys_exit: Assertion `0' failed. Aborted (core dumped)
It appears that valgrind identifies itself as Merom, which apparently isn't supported yet. I'm testing with distro-supplied valgrind 3.7.0, so things may be different in later valgrind. If not, then we can either add Merom support to rr, or add later support for a newer architecture to valgrind.
The question of how well rr+valgrind could work is also interesting. Needs some thought. But even basic checking allowed me to diagnose #15.
Was hoping this would be quick, but getting more involved. We're definitely resolving the symbols, just something bad is happening when they're invoked.
I'm restructuring the way syscalls are defined in replay, but I'm not going to touch src/recorder/ this time around. We should keep this information in one place for easier maintenance.
This seems to be the cause of the issue referred to in #89, and most likely the cause of #62 as well. What happens is
$ bash breakpoint1.run.disabled [snip] .FAILED: expecting "calling C" [snip] before (last 100 chars): ) stop reason: 857f :133 pending sig: 0 Internal error: syscalls out of sync: rec: 252 now: 1 $ rr --replay --autopilot trace_0/ [FATAL] (/home/cjones/rr/rr/src/replayer/rep_process_event.c:202: errno: None) stop reason: 857f :133 pending sig: 0 Internal error: syscalls out of sync: rec: 252 now: 1 Aborted (core dumped)
So running this test using the harness diverges on replay, and also diverges when the test trace is replayed manually from the command line. But if I manually record and replay, it passes normally!
252 is exit_group
and 1 is exit
, so something in the test harness causes libc to exit_group
instead of exit
, but not during debug replay, or replay from the command line.
When gdb connects to rr, it fails to read some memory and then apparently gives up. Edited log follows
[INFO] (/home/cjones/rr/rr/src/replayer/dbg_gdb.c:121) rr debug server listening on :1111 [snip] DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:566: gdb requests memory (addr=0x8049F6C, len=4) DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:198: write_flush: '$72665F65#BB' DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:198: write_flush: '+' DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:566: gdb requests memory (addr=0x655F6676, len=4) DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:198: write_flush: '$#00' DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:198: write_flush: '+' DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:566: gdb requests memory (addr=0x655F6672, len=4) DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:198: write_flush: '$#00' [FATAL] (/home/cjones/rr/rr/src/replayer/dbg_gdb.c:182: errno: Input/output error) Error reading from gdb
Not clear yet what's wrong.
Instead of just falling over, which is what happens currently. I hit this about once a day.
(Until x86-64 is supported, of course :).)
rr: /home/cjones/rr/rr/src/share/util.c:1112: inject_and_execute_syscall: Assertion `((0xFF0000 & ctx->status) >> 16) == 0' failed.
At the callsite,
(gdb) p ((0xFF0000 & ctx->status) >> 16)
$1 = 8
I don't understand this code particularly well yet, but 8 doesn't seem to be a ptrace status code, so looks like something bad is happening.
Right now they're built with a script, but as the compilation environment gets more diverse (cross-compiling, various flags, different OSes, etc.) it behooves us to share that configuration with the build system.
Generally a good idea, but rr's type of low-level code may result in so many meaningless warnings that the cost exceeds the benefit. But worth looking into eventually.
For example,
rr python foo.py
fails, with
[ERROR] (/home/cjones/rr/rr/src/main.c:127: errno: No such file or directory) The specified file 'python' does not exist or is not executable
This isn't what users expect. Making this change though, means that the external input of the value of $PATH
is introduced into the execution.
(More of a "note to self".) We're kind of playing whack-a-mole with libc/syscall support, which is fine and a good way to bootstrap quickly, but we want to know ahead of time how complete our support is.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.