From <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

The ptrace and UNIXbeard hackery is mostly working <a href="https://github.com/cgjones

Here's about the only hypothesis I have child gets descheduled

Implement handling of blocking syscalls about rr HOT 26 CLOSED

rr-debugger commented on May 9, 2024

Implement handling of blocking syscalls

from rr.

Comments (26)

joneschrisg commented on May 9, 2024

I put together a PoC that signals the ptrace-er when a ptrace-ee has been context-switched a configurable number of times (we'd want that to be "1" for our purposes here). Initial testing suggests that the signals are being delivered as one would expect. The signaling works basically the same way as for the hpc interrupts, except that in our case here the trace-ee has to program the interrupt, from within the syscall buffer lib. There are some corner cases to figure out but this seems like it might work.

from rr.

joneschrisg commented on May 9, 2024

After a few more experiments, I'm now pretty confident we can make this work. My little test program is here. To make this "actually" work, we need to process the de-sched interrupts in the parent process and disarm the counter there, which is pretty straightforward but needs a bunch of nasty POSIX gloop to pull off. Will prove that out next, since that part must be bulletproof.

@rocallahan I should have asked this before I got started, but (i) was your approach about the same as this? (iI) if not, do you happen to have your old test program around?

from rr.

rocallahan commented on May 9, 2024

You've already gone way beyond me --- I just had a little test program and manually ran "perf" to observe the progress of the context-switch counter.

from rr.

joneschrisg commented on May 9, 2024

The ptrace and UNIXbeard hackery is mostly working here, but I'm getting a seemingly-spurious desched notification that I don't understand yet. Or, there's an actual desched happening and I don't know where it's coming from. I have a hackaround that's not so bad and works, but will see if I can tell what's causing it.

from rr.

joneschrisg commented on May 9, 2024

The extra SIGIO doesn't seem to be an extra desched, because the switch counter is the same in the first and second. I also tried monitoring desched on syscalls that block for increasingly longer durations, and the extra SIGIO is always sent immediately after desched, not at arbitrary times. So I'm pretty close to ascribing this to magic for the time being and moving on.

from rr.

joneschrisg commented on May 9, 2024

Here's about the only hypothesis I have

child gets descheduled, bumps counter to i and schedules SIGIO
SIGIO notification "schedules" child, but it doesn't run
child is being ptraced, so we "deschedule" child to notify parent and bump counter to i+1
counter signal generated, but SIGIO is already pending so this one is queued
parent is notified and sees counter value i+1
parent stops delivery of first signal and shuts off counter
second SIGIO dequeued and delivered, notififying parent (counter is off, so no pseudo-desched possible here)
parent notified and sees counter value i+1 again
parent stops delivery of second SIGIO and we continue on

If this is what's happening, then the current hackaround should be a reliable way of dealing with this.

from rr.

joneschrisg commented on May 9, 2024

(I also confirmed that the $ip doesn't advance during the pseudo-step, when the program being traced isn't blocked on a syscall.)

from rr.

joneschrisg commented on May 9, 2024

I have a PoC for "rpc syscalls", where a tracee requests that the tracer make syscall(s) on its behalf, here. I think this is the last piece of technology we need.

from rr.

rocallahan commented on May 9, 2024

Why is that needed?

from rr.

joneschrisg commented on May 9, 2024

We have to do a pretty tricky dance to set up the cs counter and syscall buffer region across record/replay. During recording, the counter fd has to be shared with rr, but in replay it should be emulated. This requires letting rr know the fd number that's to be shared. In both record and replay, the buffer region has to be actually opened and mmap'd by both the tracee and rr. rr also has to know it's the syscall buffer initialization code it's talking to. The cleanest way I could think of to implement this is with a fake syscall; use an unallocated syscall number (like -42) to trap into rr, then let rr decide what to do. That's what the PoC does.

I think we could make this work by keying off syscall param pattern matching and so forth, but I'd rather get away from that.

from rr.

rocallahan commented on May 9, 2024

Sounds reasonable.

from rr.

joneschrisg commented on May 9, 2024

In other words, instead of assuming various things about the sequence of mmap/socket/recv/etc calls and patterns of parameters thereto, we could instead do something like

int counterfd = perf_event_open(...);
char* shmfile = "/dev/shm/rr-tracee-[tid]";
struct msghdr msg;
// set up msg
syscall(MAGIC_SYSCALL_BUF_INIT, counterfd, shmfile, &msg);

then rr decides what do with the params based on record vs. replay.

from rr.

joneschrisg commented on May 9, 2024

(Note to self: need to sigprocmask during this setup or things will get very hard.)

from rr.

joneschrisg commented on May 9, 2024

(Note to self: investigate what happens when may-block syscalls are interrupted with signals.)

from rr.

joneschrisg commented on May 9, 2024

All the necessary new technology is in place. What remains is

figure out the right buffer-flush dance to do during recording so that we know to replay the first buffered syscall, instead of blowing up trying to advance to where rr saw the desched event
see how this code works when signals are being delivered, revise as necessary

from rr.

joneschrisg commented on May 9, 2024

I think I got this licked

when rr receives the desched notification for task T at syscall s, record a buffer flush event ... but don't zero the byte counter. So then during replay, we stop at the first buffered syscall, and refill the buffer appropriately to let them finish.
record a normal syscall-entry into s. Since we haven't reset the record counter yet, just refilled the buffer, T will still compute the same record pointers during replay of the preamble of s as it did during recording.
continue normally, descheduling T if necessary, and advance to the syscall-exit of s. Record a normal syscall-exit event for T at s, including outparam data. This will in effect have T use the syscall buffer as undistinguished scratch space for the syscall s.
record a zero-syscallbuf-record-counter event. Now T can proceed using the syscall buffer "normally".

from rr.

joneschrisg commented on May 9, 2024

Hitting a scary problem now where finishing desched-interrupted syscalls with ptrace(SYSCALL) doesn't seem to be stopping at the syscall exit properly.

from rr.

joneschrisg commented on May 9, 2024

So the situation is pretty strange, and different for different syscalls. My experiments are here. There's a tracee that essentially looks like (in a loop)

arm_desched_event();  // ioctl(...);
[syscall(...)];
disarm_desched_event();  // ioctl(...);

And the the tracer is basically (in a loop)

waitpid(tracee);
// status 1
ptrace(SYSCALL, tracee);
waitpid(tracee);
// status 2
ptrace(SYSCALL, tracee);
waitpid(tracee);
// status 3
ptrace(SYSCALL, tracee);
waitpid(tracee);
// status 4

The behaviors I see for the different syscalls are

sched_yield()
1. SIGIO (orig_eax = SYS_sched_yield; other registers look like syscall is finished)
2. SIGIO (orig_eax = SYS_sched_yield; same regs)
3. syscall trap (orig_eax = SYS_ioctl; regs look like syscall entry)
4. syscall trap (orig_eax = SYS_ioctl; regs look like syscall exit)
write(stdout, '.', 1)
1. SIGIO (orig_eax = SYS_write; other registers look like syscall is finished)
2. SIGIO (orig_eax = SYS_write; same regs)
3. syscall trap (orig_eax = SYS_ioctl; regs look like syscall entry)
4. syscall trap (orig_eax = SYS_ioctl; regs look like syscall exit)
system(sleep 1) (this blocks on waitpid)
1. SIGIO (orig_eax = SYS_waitpid; other registers don't look like syscall entry or exit)
2. SIGIO (orig_eax = SYS_waitpid; other registers don't look like syscall entry or exit)
3. syscall trap (orig_eax = SYS_waitpid; regs look like syscall entry)
4. syscall trap (orig_eax = SYS_waitpid; regs look like syscall exit)
nanosleep(1sec)
1. SIGIO (orig_eax = SYS_nanosleep; other registers don't look like syscall entry or exit)
2. SIGIO (orig_eax = SYS_nanosleep; other registers don't look like syscall entry or exit)
3. syscall trap (orig_eax = 0; regs look like syscall entry)
4. syscall trap (orig_eax = 0; regs look like syscall exit)

In the last, an orig_eax of 0 probably means SYS_restart_syscall.

The first two behaviors look like a syscall that finishes, but the task is descheduled on syscall exit. So the tracer just sees the exit. Then we enter and exit the disarm_desched_event() ioctl.

The second two look like actual blocking; the SIGIO seems to arrive before the syscall "really starts" (no idea what that might mean in the kernel). We "restart" the syscall and enter it, and then we see it finish.

I don't understand the SYS_restart_syscall complication.

The annoying things are

in the finish-and-preempt cases, we can't access the tracee regs on syscall entry. This makes the plan above trickier. But not a major obstacle.
I don't see a way yet to reliably distinguish the finished-but-preempted reg state from the not-yet-entered reg state. We can safely assume that the next syscall following the one seen on SIGIO is always ioctl, which lets us machete-hack our way out of the problem. Ugly though. Will poke more.
the SYS_restart_syscall wrinkle is puzzling, but seems relatively easy to paper over

from rr.

joneschrisg commented on May 9, 2024

I'm getting headaches thinking through the changes I need to make to the recorder code, so I'm afraid I'm going to need to take a detour for some rewritin' :/

from rr.

joneschrisg commented on May 9, 2024

take a detour for some rewritin'

(That proved to be more than I wanted to chew on right now too. But will need to happen soon.)

The first two behaviors look like a syscall that finishes, but the task is descheduled on syscall exit. So the tracer just sees the exit. Then we enter and exit the disarm_desched_event() ioctl.

This proved to be quite easy to handle: we step over the extraneous SIGIO (status 2 above) and then ptrace(SYSCALL) to see what status 3 we got. If it's for sure the ioctl for disarm_desched_event(), which is fairly simple to check, then we just run the ioctl to completion. (It won't block.)

The cute part is that we don't have to record any trace data or change any execution state. The syscall buffer must not have been full for us to arm the desched event, therefore it doesn't need flushing. And the replayer doesn't need any extra data to know to step over the desched ioctls during replay, so we can just carry on! The next "normal" buffer flush will record what we want.

The other case won't be so easy though ...

from rr.

rocallahan commented on May 9, 2024

Hmm, so will we have to do these two extra ioctls around every potentially-blocking syscall? That's very unfortunate.

from rr.

joneschrisg commented on May 9, 2024

I have an idea for how we can avoid that, but it's going to be a bit until I can prove it out. Note though that (i) rr has to do something similar already with the hpc counters; (ii) ptrace traps are not cheap compared to this ioctl; (iii) rr invokes a bunch of not-so-cheap syscalls (many unnecessarily) to process traps now anyways. My guess is that we'll see considerably better perf by wrapping as many common syscalls as we can non-optimally, compared to now. But we'll soon see :).

from rr.

rocallahan commented on May 9, 2024

True.

from rr.

joneschrisg commented on May 9, 2024

I've got things working in this patch, except for a couple niggling details. Not pretty, but oh well.

from rr.

joneschrisg commented on May 9, 2024

This patch ended up being less straightforward than I was hoping, so I'll try to put up a wiki page with a walkthrough in the near future.

from rr.

joneschrisg commented on May 9, 2024

Hmm, so will we have to do these two extra ioctls around every potentially-blocking syscall? That's very unfortunate.

These are taking ~600-800ns each, which, unfortunately, ends up being small potatoes compared to the overhead of a traced syscall. >1us per syscall doesn't make me happy but it's not significantly impacting execution time in the traces I'm looking at.

from rr.

Implement handling of blocking syscalls about rr HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent