Comments (26)
I put together a PoC that signals the ptrace-er when a ptrace-ee has been context-switched a configurable number of times (we'd want that to be "1" for our purposes here). Initial testing suggests that the signals are being delivered as one would expect. The signaling works basically the same way as for the hpc interrupts, except that in our case here the trace-ee has to program the interrupt, from within the syscall buffer lib. There are some corner cases to figure out but this seems like it might work.
from rr.
After a few more experiments, I'm now pretty confident we can make this work. My little test program is here. To make this "actually" work, we need to process the de-sched interrupts in the parent process and disarm the counter there, which is pretty straightforward but needs a bunch of nasty POSIX gloop to pull off. Will prove that out next, since that part must be bulletproof.
@rocallahan I should have asked this before I got started, but (i) was your approach about the same as this? (iI) if not, do you happen to have your old test program around?
from rr.
You've already gone way beyond me --- I just had a little test program and manually ran "perf" to observe the progress of the context-switch counter.
from rr.
The ptrace and UNIXbeard hackery is mostly working here, but I'm getting a seemingly-spurious desched notification that I don't understand yet. Or, there's an actual desched happening and I don't know where it's coming from. I have a hackaround that's not so bad and works, but will see if I can tell what's causing it.
from rr.
The extra SIGIO doesn't seem to be an extra desched, because the switch counter is the same in the first and second. I also tried monitoring desched on syscalls that block for increasingly longer durations, and the extra SIGIO is always sent immediately after desched, not at arbitrary times. So I'm pretty close to ascribing this to magic for the time being and moving on.
from rr.
Here's about the only hypothesis I have
- child gets descheduled, bumps counter to i and schedules SIGIO
- SIGIO notification "schedules" child, but it doesn't run
- child is being ptraced, so we "deschedule" child to notify parent and bump counter to i+1
- counter signal generated, but SIGIO is already pending so this one is queued
- parent is notified and sees counter value i+1
- parent stops delivery of first signal and shuts off counter
- second SIGIO dequeued and delivered, notififying parent (counter is off, so no pseudo-desched possible here)
- parent notified and sees counter value i+1 again
- parent stops delivery of second SIGIO and we continue on
If this is what's happening, then the current hackaround should be a reliable way of dealing with this.
from rr.
(I also confirmed that the $ip doesn't advance during the pseudo-step, when the program being traced isn't blocked on a syscall.)
from rr.
I have a PoC for "rpc syscalls", where a tracee requests that the tracer make syscall(s) on its behalf, here. I think this is the last piece of technology we need.
from rr.
Why is that needed?
from rr.
We have to do a pretty tricky dance to set up the cs counter and syscall buffer region across record/replay. During recording, the counter fd has to be shared with rr, but in replay it should be emulated. This requires letting rr know the fd number that's to be shared. In both record and replay, the buffer region has to be actually opened and mmap'd by both the tracee and rr. rr also has to know it's the syscall buffer initialization code it's talking to. The cleanest way I could think of to implement this is with a fake syscall; use an unallocated syscall number (like -42) to trap into rr, then let rr decide what to do. That's what the PoC does.
I think we could make this work by keying off syscall param pattern matching and so forth, but I'd rather get away from that.
from rr.
Sounds reasonable.
from rr.
In other words, instead of assuming various things about the sequence of mmap/socket/recv/etc calls and patterns of parameters thereto, we could instead do something like
int counterfd = perf_event_open(...); char* shmfile = "/dev/shm/rr-tracee-[tid]"; struct msghdr msg; // set up msg syscall(MAGIC_SYSCALL_BUF_INIT, counterfd, shmfile, &msg);
then rr decides what do with the params based on record vs. replay.
from rr.
(Note to self: need to sigprocmask during this setup or things will get very hard.)
from rr.
(Note to self: investigate what happens when may-block syscalls are interrupted with signals.)
from rr.
All the necessary new technology is in place. What remains is
- figure out the right buffer-flush dance to do during recording so that we know to replay the first buffered syscall, instead of blowing up trying to advance to where rr saw the desched event
- see how this code works when signals are being delivered, revise as necessary
from rr.
I think I got this licked
- when rr receives the desched notification for task T at syscall s, record a buffer flush event ... but don't zero the byte counter. So then during replay, we stop at the first buffered syscall, and refill the buffer appropriately to let them finish.
- record a normal syscall-entry into s. Since we haven't reset the record counter yet, just refilled the buffer, T will still compute the same record pointers during replay of the preamble of s as it did during recording.
- continue normally, descheduling T if necessary, and advance to the syscall-exit of s. Record a normal syscall-exit event for T at s, including outparam data. This will in effect have T use the syscall buffer as undistinguished scratch space for the syscall s.
- record a zero-syscallbuf-record-counter event. Now T can proceed using the syscall buffer "normally".
from rr.
Hitting a scary problem now where finishing desched-interrupted syscalls with ptrace(SYSCALL) doesn't seem to be stopping at the syscall exit properly.
from rr.
So the situation is pretty strange, and different for different syscalls. My experiments are here. There's a tracee that essentially looks like (in a loop)
arm_desched_event(); // ioctl(...);
[syscall(...)];
disarm_desched_event(); // ioctl(...);
And the the tracer is basically (in a loop)
waitpid(tracee);
// status 1
ptrace(SYSCALL, tracee);
waitpid(tracee);
// status 2
ptrace(SYSCALL, tracee);
waitpid(tracee);
// status 3
ptrace(SYSCALL, tracee);
waitpid(tracee);
// status 4
The behaviors I see for the different syscalls are
sched_yield()
- SIGIO (orig_eax = SYS_sched_yield; other registers look like syscall is finished)
- SIGIO (orig_eax = SYS_sched_yield; same regs)
- syscall trap (orig_eax = SYS_ioctl; regs look like syscall entry)
- syscall trap (orig_eax = SYS_ioctl; regs look like syscall exit)
write(stdout, '.', 1)
- SIGIO (orig_eax = SYS_write; other registers look like syscall is finished)
- SIGIO (orig_eax = SYS_write; same regs)
- syscall trap (orig_eax = SYS_ioctl; regs look like syscall entry)
- syscall trap (orig_eax = SYS_ioctl; regs look like syscall exit)
system(sleep 1)
(this blocks on waitpid)- SIGIO (orig_eax = SYS_waitpid; other registers don't look like syscall entry or exit)
- SIGIO (orig_eax = SYS_waitpid; other registers don't look like syscall entry or exit)
- syscall trap (orig_eax = SYS_waitpid; regs look like syscall entry)
- syscall trap (orig_eax = SYS_waitpid; regs look like syscall exit)
nanosleep(1sec)
- SIGIO (orig_eax = SYS_nanosleep; other registers don't look like syscall entry or exit)
- SIGIO (orig_eax = SYS_nanosleep; other registers don't look like syscall entry or exit)
- syscall trap (orig_eax = 0; regs look like syscall entry)
- syscall trap (orig_eax = 0; regs look like syscall exit)
In the last, an orig_eax of 0 probably means SYS_restart_syscall.
The first two behaviors look like a syscall that finishes, but the task is descheduled on syscall exit. So the tracer just sees the exit. Then we enter and exit the disarm_desched_event()
ioctl.
The second two look like actual blocking; the SIGIO seems to arrive before the syscall "really starts" (no idea what that might mean in the kernel). We "restart" the syscall and enter it, and then we see it finish.
I don't understand the SYS_restart_syscall
complication.
The annoying things are
- in the finish-and-preempt cases, we can't access the tracee regs on syscall entry. This makes the plan above trickier. But not a major obstacle.
- I don't see a way yet to reliably distinguish the finished-but-preempted reg state from the not-yet-entered reg state. We can safely assume that the next syscall following the one seen on SIGIO is always
ioctl
, which lets us machete-hack our way out of the problem. Ugly though. Will poke more. - the SYS_restart_syscall wrinkle is puzzling, but seems relatively easy to paper over
from rr.
I'm getting headaches thinking through the changes I need to make to the recorder code, so I'm afraid I'm going to need to take a detour for some rewritin' :/
from rr.
take a detour for some rewritin'
(That proved to be more than I wanted to chew on right now too. But will need to happen soon.)
The first two behaviors look like a syscall that finishes, but the task is descheduled on syscall exit. So the tracer just sees the exit. Then we enter and exit the disarm_desched_event() ioctl.
This proved to be quite easy to handle: we step over the extraneous SIGIO (status 2
above) and then ptrace(SYSCALL)
to see what status 3
we got. If it's for sure the ioctl for disarm_desched_event()
, which is fairly simple to check, then we just run the ioctl to completion. (It won't block.)
The cute part is that we don't have to record any trace data or change any execution state. The syscall buffer must not have been full for us to arm the desched event, therefore it doesn't need flushing. And the replayer doesn't need any extra data to know to step over the desched ioctls during replay, so we can just carry on! The next "normal" buffer flush will record what we want.
The other case won't be so easy though ...
from rr.
Hmm, so will we have to do these two extra ioctls around every potentially-blocking syscall? That's very unfortunate.
from rr.
I have an idea for how we can avoid that, but it's going to be a bit until I can prove it out. Note though that (i) rr has to do something similar already with the hpc counters; (ii) ptrace traps are not cheap compared to this ioctl; (iii) rr invokes a bunch of not-so-cheap syscalls (many unnecessarily) to process traps now anyways. My guess is that we'll see considerably better perf by wrapping as many common syscalls as we can non-optimally, compared to now. But we'll soon see :).
from rr.
True.
from rr.
I've got things working in this patch, except for a couple niggling details. Not pretty, but oh well.
from rr.
This patch ended up being less straightforward than I was hoping, so I'll try to put up a wiki page with a walkthrough in the near future.
from rr.
Hmm, so will we have to do these two extra ioctls around every potentially-blocking syscall? That's very unfortunate.
These are taking ~600-800ns each, which, unfortunately, ends up being small potatoes compared to the overhead of a traced syscall. >1us per syscall doesn't make me happy but it's not significantly impacting execution time in the traces I'm looking at.
from rr.
Related Issues (20)
- fatal error: linux/openat2.h: No such file or directory HOT 1
- Cannot continue over exec HOT 4
- RR waitpid bug not seen during non-recording HOT 2
- Make rr work with `perf_event_paranoid`=2 HOT 9
- Use hardware breakpoints and bpf for fast fast-forwarding to asynchronous events
- `sigframe_grow_stack` no-syscallbuf test failing in ARM CI HOT 1
- Does rr support ARM cortex-A55 CPU? HOT 8
- openat test leads to undeclared SYS_openat2 HOT 4
- Emulated mlock + MADV_DONTNEED diverges HOT 19
- Make rr link with lld HOT 2
- `mmap` ignores `MAP_FIXED_NOREPLACE` when using rr in chaos mode HOT 1
- `netfilter` test fails in 32-bit when `CONFIG_NETFILTER_XTABLES_COMPAT` is configured off HOT 2
- LICENSE / Copyright adjustments needed (Pernesco, contributors?) HOT 3
- Crash when replaying a trace of Mixxx (part 2) HOT 7
- Crash when replaying a trace of Mixxx (part 3) HOT 3
- `rr replay -g` and `run <event>` don't agree with `when` HOT 3
- Test dlopen fails since c7d57227 HOT 5
- CMake Policy CMP0148
- Failing tests on on i5-12500? HOT 24
- GDB Checkpoint Issue
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rr.