This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
Re: Getting user-space stack backtraces in more probe contexts
- From: Roland McGrath <roland at redhat dot com>
- To: William Cohen <wcohen at redhat dot com>
- Cc: SystemTAP <systemtap at sources dot redhat dot com>
- Date: Fri, 14 May 2010 00:36:17 -0700 (PDT)
- Subject: Re: Getting user-space stack backtraces in more probe contexts
- References: <4BEC6954.6090802@redhat.com>
Yes, perf is not even trying to address the same problem. They've just
decided to punt to assuming -fno-omit-frame-pointers everywhere (like Sun).
For relying on CFI as we do, there are three basic approaches possible.
Before getting to those, I'll refine some details from your investigation.
> SIGTRAP signal is generated. This signal is handled by a utrace engine
> executing uprobe_report_signal() in uprobes.c. The utrace engine
> passes a pointer to the pt_regs into the uprobe_report_signal().
Moreover, all utrace callbacks are (by rule of the API) at a place where
it is kosher to use the user_regset interfaces. 'struct pt_regs' is a
convenience if you happen to know what you want in it, but user_regset
is the only thing that gets you all the correct values of all the user
registers on all machines.
> The mechanisms used for the syscalls do not store data in a pt_regs
> structure.
That's not quite true. There is always a 'struct pt_regs' at the
pointer returned by task_pt_regs(), the same one in all utrace callbacks
and such. There is always some correct data there. That's how
asm/syscall.h, instruction_pointer(), user_stack_pointer(), all work.
The "fast" paths store some subset of the 'struct pt_regs'. The details
vary widely by machine. On the x86, the tail end of the struct is
actually stored by the hardware for 'int $0x80' and all non-system-call
kernel entries (this has the PC and SP). In the syscall/sysenter cases
(both 32 and 64, which are different for each flavor with the same
name), it's the kernel's code at those special entry-points that does
this; the x86_64 kernel uses a lot of hairy assembly macros that make
this stuff hard to notice. (The 'struct pt_regs' sits at the base of
the kernel stack and so it's filled by pushing on the stack, later
pushes setting earlier fields, as we read the struct definition in the
source.) Then all kinds of entries push the system call number and
argument registers too. All the non-system-call kinds of entries (page
faults, interrupts, machine traps from user mode code) then push all the
rest of the register, so they are available for signal handling et al.
(It's consistent AFAIK on other machines that all the non-system-call
entries from user mode make all registers available.)
On the i386, all entries push all the registers in 'struct pt_regs',
because with 6 syscall arguments, the SP, and the syscall number
register, that's all the registers there are. On most other machines
(all that I'm at all familiar with), only about as many as are used for
system call registers are saved in the fast entry paths.
Importantly, in cases like the x86_64 where the kernel stack fills the
'struct pt_regs', the "unavailable" portions of task_pt_regs() are not
merely unrelated garbage that doesn't tell you the right values of user
registers. They are whatever the kernel code in the call chain from the
sys_* function has pushed on the stack. So referring to that data at an
inappropriate time doesn't just give you useless results for userland,
but to consider it as "userland data" might be leaking kernel bits so as
to violate some information security intent.
As you noticed, the clone/fork/vfork syscalls (and also execve) all take
a special path that ensures all the user registers are saved. That's so
of those calls on all machines, and it's no coincidence that these are
the places where there are ptrace/utrace/tracehook report points. (It's
also true of a few other syscalls dealing with signals or arch-specific
weirdness.)
Note that when using syscall tracing (via utrace or the equivalent newer
"syscall tracepoint" path), things are similar but not quite the same.
In those special cases just mentioned, where there are utrace stopping
points inside the call, the full 'struct pt_regs' is on the kernel stack
below the sys_* frame, so you can just access it there (i.e. user_regset
calls will do that safely). But in syscall tracing, the full registers
are only on the stack and visible to user_regset et al during the
tracehook_report_syscall_* call (where you get utrace or tracepoint
callbacks). After the the entry tracing functions return, the extra
words are popped back off the stack into the registers, and then the
actual call to the sys_* function is clobbering that with private kernel
stack data. (Then, after the call, all those registers are pushed back
on if there is syscall exit tracing to do.) So while you have full
register access at the tracing callbacks, when actually inside the
particular system call code, you don't have any way to recover it short
of perfect kernel-side CFI unwinding back to the red line (more on that
later).
Now, to those three paths.
1. Work with what you got.
This means, give the user unwinder some arch-specific code to prime
its state from a known-to-be-partial struct pt_regs. The various
registers are marked as in "undefined" CFI state as opposed to the
usual initial state of a known register value. The ones that are
available (roughly the syscall registers, PC and SP) are there. Then
process CFI as usual, and bail with a complaint when you find you
need an undefined register value to figure out the next PC.
This is one of those double-edged things that has the (dis)advantage
that it works 100% perfectly on i386 (what you got is all there is to
have). It may well even work most of the time on other machines, I
have barely a guess off hand about that. (It's sometimes possible to
get a certain someone to write a fancy script that could analyze a
decent corpus of binaries and prestimadigitate about their FDEs'
sensitivity to certain missing registers.) But the bare guess is
that it might well tend to cover just recovering the PC and CFA
(enough to keep doing a basic backtrace) much more often than it
covers all the registers (so that a full debugger or extracting
application $variables from unwound frames, which we don't yet
support anyway, would be happy).
This has the feature that you can just try it in the unwinder code
today without depending on any other moving parts, and see what you
get. It's known complete as is on i386, so you can just declare
victory for that machine. For others, you can see what happens in
practice right away.
It's sure that full-frame unwinding (that is, calculating all the
registers) will hit "undefined"s. On-demand register finding was
previously mentioned as an optimization for the basic backtrace,
which rarely needs to figure many of the registers at each frame just
to find the next PC. Doing that would make the "bail on undefined"
logic hit far less often, one presumes. At that point, perhaps it
becomes good enough, though never fully waterproof.
2. Turtles all the way down!
(The turtles are made of CFI.) That is, unwind in kernel space all
the way back to the red line. (The "red line" is what OS hackers of
my vintage call the kernel-mode/user-mode boundary.) If all the
kernel CFI is correct, then you can unwind from anywhere all the way
back to the frame that's the kernel entry point, and then unwind from
there to full user registers. It should be just like unwinding
through an in-kernel interrupt or trap frame.
In 100% proper CFI these frames are marked as "signal frames" (it's
part of the "augmentation string"), so you can see those and then
check whether the "caller's PC" of that frame is < TASK_SIZE
(i.e. outside kernel text) to tell whether it's the base kernel entry
frame or is an in-kernel trap frame. (The magic registers that
user_mode() checks are not recorded in the CFI, though they could be.
If they were, you could apply the exact check user_mode() does to the
reconstructed registers to decide if you think they're from user
mode.) You can also do something much simpler like see if the
unwound stack pointer matches user_stack_pointer(task_pt_regs(task)).
All this requires is that all kernel code have CFI, that the CFI be
correct, and that you have that CFI. Three small matters. I can
only really speak to these for x86.
For some time now, since sometime after the short-lived in-kernel CFI
unwinder got removed, the linker script used to build vmlinux
discards the .eh_frame section. This is where all the hand-written
CFI in assembly code has been going, since that's where the assembler
puts it for .cfi_* directives. So, in x86 kernels there is believed
to be CFI for all the code and it is imagined to be correct, but it
has not been in any binary you've seen in a long time.
In the interim, the assembler has grown the .cfi_sections directive
that lets us direct whether the .cfi_* directives in assembly code
produce .eh_frame, .debug_frame, or both. I have only just now sent
a fix to the kernel x86 maintainers to use .cfi_section .debug_frame
in the x86 assembly code, so that CFI is preserved for us to find.
(I've put that patch into the rawhide kernel, so kernel-debuginfo
from rawhide will have full CFI the next time the rest of the rawhide
kernel's patches start building again. We can probably get it into
Fedora 13 and update kernels too.)
So, near-future kernels will have CFI for the kernel entry points
(and other assembly) so we can find out concretely whether there is
any CFI that is missing or wrong. It seems to be maintained fairly
judiciously despite the upstream kernel build not having any way for
anyone ever to see it. In the past I have volunteered to fix it as
needed. Hence, with vigilance, relying on it for "current" kernels
is plausible.
For other machines, I don't really know the details. Unless there is
some magic I don't know about, the powerpc assembly in the kernel has
no CFI, for example. It might be that the base kernel stack layout
is formulaic enough to handle the kernel entry frames generically
with a hard-coded rule on that machine or something like that, but
you'd need an arch expert to tell you for each arch.
Incidentally, ia64 (and arm?) has its own non-DWARF flavor of unwind
info that the assembler generates mostly automagically without the
hand-written directives, and an in-kernel unwinder for it. In fact,
on ia64, that unwinder is the one and only way you ever get the full
user registers. Its unw_unwind_to_user() is used by the ptrace code.
So, while I have no idea about the 'struct pt_regs' story on ia64, I
believe there it's actually safe to use the user_regset calls more or
less anywhere you don't hold spinlocks or whatnot.
This solution can be "smooth round the bend", as they say. There's
no messy "phase change" at all, it's just unwinding all the way
through. There's the minor bump of noticing when you shift from
consulting kernel CFI to user CFI, but perhaps you just think of that
as PC ranges in different modules, as with a kernel module's CFI.
But, my bet is it may prove to be not quite perfect (needing assembly
fixes) on x86, and difficult to get anyone to add hand-assembly CFI
in its entirety to powerpc and other machines where it's absent now.
(The assembler supports the same .cfi_* stuff for powerpc and other
machines just fine, if the kernel assembly code wanted to use it.)
You certainly can't use it on existing kernels, and there is only any
kind of ETA on that as yet for the x86.
3. Two phase with a safe point
This is the notion that Will mentioned, but there is a general and
optimal way to do it. It's a classic "software interrupt" scheme:
at an arbitrary point, put down a marker; when you reach a safe
point (here, just before returning to user mode), pick up the
marker and do the rest of the work.
There are a variety of ways to do this, but there are now (kind of)
some good ones. In recent kernels, the TIF_NOTIFY_RESUME flag
exists just for this sort of thing. In all kernels, TIF_SIGPENDING
does a related thing.
You can safely do set_thread_flag(TIF_NOTIFY_RESUME) from anywhere.
This means tracehook_notify_resume() gets called before returning
to user mode. tracehook_notify_resume() is an arch-indepedent
inline with one call site on each machine (in arch-specific code).
In any kernel that has it, you could at least use a kprobe on that
inline to get a callback at the safe user-mode boundary (where
user_regset is kosher, if you don't have interrupts disabled or
other locks held or whatnot).
In utrace, this is what passing UTRACE_REPORT to utrace_control()
does. But you can't call utrace_control() from interrupt level or
with locks held or so forth, because of lockdep issues. I'd always
expected to have some manner of utrace call that can be made by the
current task from interrupt level or anywhere, to demand a utrace
report at the next safe point--i.e., what UTRACE_REPORT (or also a
UTRACE_INTERRUPT option) would do if another thread were making a
utrace_control() call. I think we can add a simple thing like that
to utrace easily. But it's not there now.
If you are considering something like a futex syscall, that might
block (or already has), then TIF_NOTIFY_RESUME/UTRACE_REPORT
doesn't do anything until the syscall finishes of its own accord.
(For a problem futex wait you are investigating, that might be
never.) If you know that it's a syscall that restarts properly
(futex does, correctly resuming a timeout if there is one), then
you can use TIF_SIGPENDING (in utrace, UTRACE_INTERRUPT) instead.
That will prevent it from blocking normally, instead going back to
user mode to restart the syscall. On its way back, you can see the
full registers before or when it restarts. Then of course you have
to know not to loop when you hit your probe inside the futex code
the second time. (Note in the case of futex and some others, the
second time around will be NR_restart_syscall rather than the
original syscall, hence that path will be via futex_wait_restart()
rather than sys_futex->do_futex as the first time's path was.) In
the case of a thread already blocked on a futex, you can already
use utrace_control() on it to do UTRACE_INTERRUPT from another
thread. To do it from inside the futex call on the current thread,
e.g. from a timer interrupt in the same thread context or something
like that, you'd need the same new utrace interface as above.
In the absence of utrace or that new feature, you can do a couple
of things with TIF_SIGPENDING. You can actually send a signal from
anywhere (send_sig et al), and then catch that happening by normal
means. That is, you can use an ignored signal and then notice
trace_signal_deliver(); or, with existing utrace you can use any
signal and swallow it in the utrace report_signal handler. Or, you
can do plain set_thread_flag(TIF_SIGPENDING) from anywhere, and
then use a kprobe on get_signal_to_deliver() to catch it before it
checks and sees no signals and does nothing.
Finally, you can just change the user PC to something that leads to
an event you know how to trace. (There's no point in just using an
invalid PC, since that only leads to a signal you might as well
just send.) That could be the vDSO __kernel_rt_sigreturn if you
have a probe in sys_rt_sigreturn or something like that, or could
be some random PC where you previously installed a uprobe. This is
probably a poor choice, because of complications like restoring the
real user PC if another signal comes along first.
With any of those methods, what the low-level implemention provides
is equivalent to two internal probes, hence "two phase". There are
lots of ways to deal with this in the script world. e.g.
a. For backtraces alone, you could have ubacktrace() store a magic
object that is a placeholder for a backtrace to be done at the
safe point. For printing a backtrace from an inside-kernel probe,
some magic nugget would be placed in the output buffer so that the
stapio side would know not to actually deliver this buffer as
printed text until the second phase probe comes along to fill in
that portion of the output out of order.
b. You could force the script/tapset to do it entirely in terms of
two language-level probes:
probe kernel... { notify_resume() /* could be embedded-c */ }
probe user.resume { ... } // could be just = kernel.function("...")
c. You could add two-phase probes as a first-class language feature:
probe kernel... {
print_firstpart();
@resume {
print_secondpart();
}
}
Perhaps with some interrupting variant too, perhaps even one that
rolls in the restart-once logic:
probe kernel... {
@restartsys { bt = ubacktrace() }
printf("blah at %s\n", bt)
}
Inside the @resume et al clauses, you have no $ context (can use
only kernel globals), but have full user registers. Perhaps if it
appears in a library... probe, then your globals-only $ context is
for the user module named in the probe instead.
This class of approach has the big advantage that it's entirely
compatible with doing user unwinding from the .eh_frame CFI in the
user text rather than relying on prepacking. At these safe points,
it's entirely fine to do full user memory access with uaccess.h or
whatever, block in page faults to bring the necessary text in, etc.
(Back at the dawn of time, I presumed this is how it would always be
done, and hence thought it pretty nutty to be packing up userland CFI
data into stap kernel modules.) At worst, you risk nothing but
wedging that one thread, and it can still be killed cleanly.
Even with the prepacking, this can let the user unwinder code run
preemptible (for voluntary preemption, you can sprinkle it with
cond_resched, and for paranoia, fatal_signal_pending bail-out
checks). This removes the burden of dealing with userland CFI down
in any sensitive places where delaying too long or getting led astray
into an infinite loop is a big problem. Then any unwinding work done
in such places is only for the kernel, where the CFI to contend with
is a finite known set we can have scoured thoroughly for bogons
beforehand. (Until someone compiles another kernel module, of
course, but you get the idea.)
I said three paths, but the careful reader will have noticed when I was
talking about the nuances of entry paths earlier that there is also:
4. Pre-collect via syscall-entry tracing
As I mentioned above, the syscall tracing callbacks have complete
register access and can use user_regset. So, you can enable syscall
tracing via utrace or the "sys_enter" tracepoint. In your callback,
use user_regset (properly) or the 'struct pt_regs' (from argument or
task_pt_regs(), improperly) to save off the complete user register
data somewhere. Then when in a later probe point before the syscall
exit, you can use the saved regset block to prime the user unwinder.
You might optimize out copying the registers based on looking for a
syscall_get_nr() value. Or you might make that syscall-entry probe
the place where you check for and record a futex call, rather than a
separate kernel probe inside the sys_futex call chain.
Conversely, you could have a general mode (maybe even enabled just by
using ubacktrace() in a script!) that just implicitly enables the
syscall-entry tracing everywhere (or in targeted tasks, or whatever)
with a canned probe in the runtime that stores the user registers.
Both the tracing mode and the copying add overhead to every syscall,
which can be measured to think about how desireable this is to do how
automatically. (I happen to know that on x86 there is some work in
the low-level magic we can do to reduce the tracing mode part of the
overhead on the syscall-exit side, which you incur by doing
syscall-entry tracing. So let me know if measurements suggest that
optimizing that part of it could be the tipping point for a decision
that's desireable in other regards such as punting on all the
potential work involved in all the other avenues under discussion.)
Thanks,
Roland