Giter Site home page Giter Site logo

rr-debugger / rr Goto Github PK

View Code? Open in Web Editor NEW
8.6K 157.0 553.0 25.57 MB

Record and Replay Framework

Home Page: http://rr-project.org/

License: Other

Shell 0.62% C++ 63.19% Python 5.17% C 27.48% Assembly 1.37% HTML 0.07% CMake 1.56% GDB 0.01% Cap'n Proto 0.28% Julia 0.23% Dockerfile 0.02%
debugger gdb reverse-execution linux

rr's People

Contributors

andreasgal avatar anoll avatar bernhardu avatar bgirard avatar bob131 avatar brooksmoses avatar dcci avatar derdakon avatar dholbert avatar dilumaluthge avatar dreiss avatar dzaima avatar emilio avatar espindola avatar froydnj avatar gitmensch avatar glandium avatar hotsphink avatar joneschrisg avatar keno avatar khuey avatar luser avatar nimrodpar avatar ojura avatar rocallahan avatar sidkshatriya avatar skitt avatar theidinside avatar tomandegg avatar yuyichao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rr's Issues

Implement handling of blocking syscalls

From @rocallahan

The current implementation has a few problems but the main one is that it doesn't handle potentially-blocking syscalls at all. Blocking syscalls are tricky to handle; ideally when a wrapped syscall suspends its thread, the rr supervisor process would be invoked and handle the wrapped syscall much like a regular syscall --- we'd suspend the thread and to try to wake up another thread. However it's not easy to detect when a syscall will block. For example our read syscalls usually do not block, but might. My current best idea for handling this is to use perf to listen for context-switch events and when a context-switch event is triggered during a wrapped syscall, treat that as a signal that the syscall has blocked. I did some experiements monitoring context-switch events on test programs, and it looks promising, but I never got around to hacking that into rr.

This is a good idea because it doesn't introduce the concept of a "blocking syscall", which is a bit hard to define. Instead it handles all syscalls the same way. More details forthcoming.

Investigate using faster syscall interfaces

Currently wrap_syscalls.c enters the kernel through an int $0x80 instruction. That's a full synchronous interrupt and can be hurty in syscall-heavy applications. Intel and AMD added faster instructions to do this, syscall and sysenter (though I forget whose is whose). We'll most likely get a speedup by using those when available.

Stack smashing with syscall wrapper when running xpcshell test (blocks #40)

Turns out that #40 hasn't been running with the wrapper lib! There's was a problem with the path passed to rr. But when I fix that, we start crashing

*** stack smashing detected ***: /mnt/hgfs/rr/workbench/.ff/bin/xpcshell terminated
======= Backtrace: =========
/lib/i386-linux-gnu/libc.so.6(__fortify_fail+0x45)[0x589530e5]
/lib/i386-linux-gnu/libc.so.6(+0x10409a)[0x5895309a]
/tmp/librr_wrap_syscalls.so(+0x241a)[0x5577c41a]
[0x55770032]

This is src/share/wrap_syscalls.c:216, which just the entry to static void setup_buffer() {. That's odd.

The wrapper script that drives xpcshell tests uses a brazillon env vars, so we may be tickling a pre-existing bug.

Breakpoint on trap instruction isn't hit

The problem is that if the breakpoint target is already a trap instruction, gdb never asks rr to set the breakpoint. rr has logic to handle breakpoint-on-trap-instruction, but gdb doesn't allow that logic to kick in. And for some reason gdb doesn't know how to handle the notifications that rr sends it, which AFAICT follow spec.

I consider this to be something of a problem, but I don't think it's worth the time right now to figure out what dance gdb wants us to do.

Add "magic" mechanism for tests to report failures

Right now some tests (I think only ones I've written) rely on rr not implementing SIGABRT to signal recording failures. When we add abort, that little trick won't work, since we'll happily record the abrt then play it back (which is what all other consumers want).

So we should have some mechanism to signal an always-fatal, non-recorded error, when some flag is passed to rr. A magic unallocated syscall number is one way. Or we could just have a flag to treat SIGABRT as a failure, but that seems fragile to me.

Implement a performace-testing script

Most basically, the script would run a workload, then run the same workload under rr and compare elapsed wall-clock time. Something like

$ ./script/measure.sh firefox --blah
'firefox --blah' took 10.3s
'rr --record firefox --blah' took 13.6s

Later on it might also be interesting to measure context switches, memory usage, and other such things that we find are important.

Failing to read stack memory

This is obviously making gdb very unhappy! When I do the following

gcc -g -m32 -o spin ../rr/src/test/interrupt.c
rr --record ./spin 
rr --replay --dbgport=1111 trace_0/

and then connect gdb to :1111, then gdb says

(gdb) c
Continuing.
  C-c C-c
[Thread 32144] #1 stopped.
[Switching to Thread 32144]
0x0804847d in spin () at ../rr/src/test/interrupt.c:13
(gdb) bt
#0  0x0804847d in spin () at ../rr/src/test/interrupt.c:13
Cannot access memory at address 0xffffd6ac

Huh? That's weird. proc/maps says

$ cat /proc/32031/maps 
[snip]
fffdd000-ffffe000 rw-p 00000000 00:00 0                                  [stack]

so the address should be mapped and readable. Really hope this isn't a ptrace glitch ....

Find a way to test breakpoints interrupting async signals

To test interrupting a signal event, we have to construct a trace like

(some trace event)
ensure that a specific function is called, that we break on
signal-delivered event

It's really easy to do this for synchronous signals, since we can arrange for a synchronous interrupt right after calling our function. It's also pretty easy to test interrupting time-slice pseudo-signals, because a program can call the magic function and then just spin the CPU until the hpc fires.

It's much harder to test interruption from an async signal, reliably. The signal has to be delivered by a program external to the test, or else the raise() event will come between the breakpoint and the signal delivery. So we have to spin CPU to receive the signal, but then if we do, then we can't reliably call the breakpoint function before the signal arrives or we might see a time-slice pseudo-signal.

In the near future, the async signal code will look almost exactly like the time-slice interrupt code, so this won't be a huge problem.

Intermittent failure in test "alarm"

I may only be seeing this when I use the filter lib, but I'm not 100% sure. Here's the output from one failed run

1: Test alarm FAILED: output from recording different than replay
1: Output from recording:
1: --------------------------------------------------
1: .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
1: Signal caught, Counter is 134851179
1: --------------------------------------------------
1: Output from replay:
1: --------------------------------------------------
1: .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
1: Signal caught, Counter is 134800000

rr hangs on firefox shutdown when recording (missed waitpid())

$ ps -e | egrep -e '(rr|firefox)'
 3302 pts/2    00:00:20 rr
 3303 pts/2    00:00:10 firefox 

Firefox has shut down and is waiting to be reaped.

(gdb) bt
#0  0x55577430 in __kernel_vsyscall ()
#1  0x557099e3 in __waitpid_nocancel () at ../sysdeps/unix/syscall-template.S:82
#2  0x08050fc2 in rec_sched_deregister_thread (ctx_ptr=0xffffd4e8) at /home/cjones/rr/rr/src/recorder/rec_sched.c:184
#3  0x08052973 in handle_ptrace_event (ctx_ptr=0xffffd4e8) at /home/cjones/rr/rr/src/recorder/recorder.c:560
#4  0x08052f9a in start_recording (rr_flags=...) at /home/cjones/rr/rr/src/recorder/recorder.c:804
#5  0x0804c852 in start (argc=4, argv=0xffffd6b0, envp=0xffffd6c4) at /home/cjones/rr/rr/src/main.c:197
#6  0x0804d083 in main (argc=7, argv=0xffffd6a4, envp=0xffffd6c4) at /home/cjones/rr/rr/src/main.c:372
(gdb) f 2
#2  0x08050fc2 in rec_sched_deregister_thread (ctx_ptr=0xffffd4e8) at /home/cjones/rr/rr/src/recorder/rec_sched.c:184
(gdb) p ctx->child_tid
$1 = 3303

This code is

    do {
=>      ret = waitpid(ctx->child_tid, &ctx->status, __WALL | __WCLONE);
    } while (ret != -1);

so rr is trying to reap ff, it just missed somehow.

rr not showing most FF output

During replay, the first output seen is

nsStringStats
 => mAllocCount:              4
 => mReallocCount:            0
 => mFreeCount:               4
 => mShareCount:              1
 => mAdoptCount:              0
 => mAdoptFreeCount:          0

which is right around shutdown. If NSPR is dup'ing stdout/stderr for logging, then this might make sense. It would also be really annoying.

Debugging firefox fails with "Remote failure reply: E3FEF19E..."

I think what's happening here is that we're sending upper-case hex digits, which is ambiguous with the error-packet syntax

Reply:

‘XX...’
    Each byte of register data is described by two hex digits ...

                   -> g
                   <- xxxxxxxx00000000xxxxxxxx00000000


‘E NN’
    for an error.

So here the first register happens to start with hex 'E', and gdb thinks we sent an error reply.

Should kill be a regular EMU, 0 syscall?

Currently it's manually implemented, but the definition is almost exactly the same as for regular EMU, 0 syscalls, except that arguments aren't check on syscall entry.

Intermittent failure in test "alarm" (different issue than #31)

I've seen this a few times recently, but only was able to get the error output today. I'd been assuming it was #31 but the symptom is different

1: esi registers do not match: syscall now: 80482a0 and recorded: 8048274
rr/src/replayer/replayer.c:267: errno: None) [syscall number 197, state 0, trace file line 129]

This is fstat64 being replayed differently. It only seems to happen in the test run that uses the wrapper lib, and only just after I recompile on a VMWare filesystem, so I wonder if this is a non-rr failure.

Set up PPA for rr and build deps

This will greatly simplify the setup process. The only snag is that the libpfm build system is kind of a PITA, so getting cross-compiled libs built may be an issue.

Mozilla bug 845190 isn't reproducing in rr

Bug reproduces seemingly every time outside of rr, haven't hit it in rr after hundreds of runs. I think that rr may be changing execution semantics. There's an error message printed from rr runs that I don't think I see sans rr. But unlucky scheduling is always possible.

Investigate why gdb interface is so slow

It takes seconds to do simple things like setting a single breakpoint. (rr isn't doing anything stupid.) gdb is extremely chatty with rr and makes a ton of redundant requests, so it looks like gdb's main logic is written assuming some primary backend and we're getting a fallback interface. ISTR reading about a "file xfer" extension, so that might be missing magic sauce.

"unknown syscall 9"

This appears to be link(). It happens when I launch firefox on an ubuntu 12.04 x86-64 machine running on bare metal, but doesn't happen in any of my VMs.

rr not recording getpriority() (Failing to record firefox-23 on x86: "recorder: unknown syscall 96 -- bailing out")

This is with today's nightly i686 build, running in an x86 VM on an x86-64 system. Lots of moving parts here, not sure yet where the problem is. Full output below.

$ rr --record ./firefox/firefox -no-remote -P garbage
[INFO] (/home/cjones/rr/rr/src/main.c:164) Start recording...

.[ERROR] (/home/cjones/rr/rr/src/recorder/rec_process_event.c:2162: errno: None) recorder: unknown syscall 96 -- bailing out
[ERROR] (/home/cjones/rr/rr/src/recorder/rec_process_event.c:2163: errno: None) execuction state: 4 sig 0
Printing register file:
eax: 14
ebx: 0
ecx: 0
edx: b7c6aff4
esi: 1
edi: b781a240
ebp: bfffbd98
esp: bfffbd48
eip: b7fdd424
eflags 200292
orig_eax 60
xcs: 73
xds: 7b
xes: 7b
xfs: 0
xgs: 33
xss: 7b

[ERROR] (/home/cjones/rr/rr/src/share/sys.c:117: errno: None) Exiting
rr: /home/cjones/rr/rr/src/share/sys.c:120: sys_exit: Assertion `0' failed.
Aborted (core dumped)

Get rr working with valgrind (or vice versa), as far as possible

It appears that valgrind identifies itself as Merom, which apparently isn't supported yet. I'm testing with distro-supplied valgrind 3.7.0, so things may be different in later valgrind. If not, then we can either add Merom support to rr, or add later support for a newer architecture to valgrind.

The question of how well rr+valgrind could work is also interesting. Needs some thought. But even basic checking allowed me to diagnose #15.

Unify recorder and replayer syscall definitions

I'm restructuring the way syscalls are defined in replay, but I'm not going to touch src/recorder/ this time around. We should keep this information in one place for easier maintenance.

Replay divergence caused by redirecting stdout in bash script

This seems to be the cause of the issue referred to in #89, and most likely the cause of #62 as well. What happens is

$ bash breakpoint1.run.disabled 
[snip]
.FAILED: expecting "calling C"
[snip]
before (last 100 chars): ) stop reason: 857f :133  pending sig: 0
Internal error: syscalls out of sync: rec: 252  now: 1
$ rr --replay --autopilot trace_0/
[FATAL] (/home/cjones/rr/rr/src/replayer/rep_process_event.c:202: errno: None) stop reason: 857f :133  pending sig: 0
Internal error: syscalls out of sync: rec: 252  now: 1

Aborted (core dumped)

So running this test using the harness diverges on replay, and also diverges when the test trace is replayed manually from the command line. But if I manually record and replay, it passes normally!

252 is exit_group and 1 is exit, so something in the test harness causes libc to exit_group instead of exit, but not during debug replay, or replay from the command line.

Failing to read memory after startup

When gdb connects to rr, it fails to read some memory and then apparently gives up. Edited log follows

[INFO] (/home/cjones/rr/rr/src/replayer/dbg_gdb.c:121) rr debug server listening on :1111
[snip]
DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:566: gdb requests memory (addr=0x8049F6C, len=4)
DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:198: write_flush: '$72665F65#BB'
DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:198: write_flush: '+'
DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:566: gdb requests memory (addr=0x655F6676, len=4)
DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:198: write_flush: '$#00'
DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:198: write_flush: '+'
DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:566: gdb requests memory (addr=0x655F6672, len=4)
DEBUG /home/cjones/rr/rr/src/replayer/dbg_gdb.c:198: write_flush: '$#00'
[FATAL] (/home/cjones/rr/rr/src/replayer/dbg_gdb.c:182: errno: Input/output error) Error reading from gdb

Not clear yet what's wrong.

Firefox not recording under filter lib

rr: /home/cjones/rr/rr/src/share/util.c:1112: inject_and_execute_syscall: Assertion `((0xFF0000 & ctx->status) >> 16) == 0' failed.

At the callsite,

(gdb) p ((0xFF0000 & ctx->status) >> 16)
$1 = 8

I don't understand this code particularly well yet, but 8 doesn't seem to be a ptrace status code, so looks like something bad is happening.

Integrate tests with build system

Right now they're built with a script, but as the compilation environment gets more diverse (cross-compiling, various flags, different OSes, etc.) it behooves us to share that configuration with the build system.

Investigate enabling -Wall or -Werror

Generally a good idea, but rr's type of low-level code may result in so many meaningless warnings that the cost exceeds the benefit. But worth looking into eventually.

Use the exec*p() helper

For example,

rr python foo.py

fails, with

[ERROR] (/home/cjones/rr/rr/src/main.c:127: errno: No such file or directory) The specified file 'python' does not exist or is not executable

This isn't what users expect. Making this change though, means that the external input of the value of $PATH is introduced into the execution.

Find libc test suite to import and run under rr

(More of a "note to self".) We're kind of playing whack-a-mole with libc/syscall support, which is fine and a good way to bootstrap quickly, but we want to know ahead of time how complete our support is.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.