This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Getting user-space stack backtraces in more probe contexts


Yes, perf is not even trying to address the same problem.  They've just
decided to punt to assuming -fno-omit-frame-pointers everywhere (like Sun).

For relying on CFI as we do, there are three basic approaches possible.
Before getting to those, I'll refine some details from your investigation.

> SIGTRAP signal is generated. This signal is handled by a utrace engine
> executing uprobe_report_signal() in uprobes.c.  The utrace engine
> passes a pointer to the pt_regs into the uprobe_report_signal().

Moreover, all utrace callbacks are (by rule of the API) at a place where
it is kosher to use the user_regset interfaces.  'struct pt_regs' is a
convenience if you happen to know what you want in it, but user_regset
is the only thing that gets you all the correct values of all the user
registers on all machines.

> The mechanisms used for the syscalls do not store data in a pt_regs
> structure.

That's not quite true.  There is always a 'struct pt_regs' at the
pointer returned by task_pt_regs(), the same one in all utrace callbacks
and such.  There is always some correct data there.  That's how
asm/syscall.h, instruction_pointer(), user_stack_pointer(), all work.

The "fast" paths store some subset of the 'struct pt_regs'.  The details
vary widely by machine.  On the x86, the tail end of the struct is
actually stored by the hardware for 'int $0x80' and all non-system-call
kernel entries (this has the PC and SP).  In the syscall/sysenter cases
(both 32 and 64, which are different for each flavor with the same
name), it's the kernel's code at those special entry-points that does
this; the x86_64 kernel uses a lot of hairy assembly macros that make
this stuff hard to notice.  (The 'struct pt_regs' sits at the base of
the kernel stack and so it's filled by pushing on the stack, later
pushes setting earlier fields, as we read the struct definition in the
source.)  Then all kinds of entries push the system call number and
argument registers too.  All the non-system-call kinds of entries (page
faults, interrupts, machine traps from user mode code) then push all the
rest of the register, so they are available for signal handling et al.
(It's consistent AFAIK on other machines that all the non-system-call
entries from user mode make all registers available.)

On the i386, all entries push all the registers in 'struct pt_regs',
because with 6 syscall arguments, the SP, and the syscall number
register, that's all the registers there are.  On most other machines
(all that I'm at all familiar with), only about as many as are used for
system call registers are saved in the fast entry paths.

Importantly, in cases like the x86_64 where the kernel stack fills the
'struct pt_regs', the "unavailable" portions of task_pt_regs() are not
merely unrelated garbage that doesn't tell you the right values of user
registers.  They are whatever the kernel code in the call chain from the
sys_* function has pushed on the stack.  So referring to that data at an
inappropriate time doesn't just give you useless results for userland,
but to consider it as "userland data" might be leaking kernel bits so as
to violate some information security intent.

As you noticed, the clone/fork/vfork syscalls (and also execve) all take
a special path that ensures all the user registers are saved.  That's so
of those calls on all machines, and it's no coincidence that these are
the places where there are ptrace/utrace/tracehook report points.  (It's
also true of a few other syscalls dealing with signals or arch-specific
weirdness.)

Note that when using syscall tracing (via utrace or the equivalent newer
"syscall tracepoint" path), things are similar but not quite the same.
In those special cases just mentioned, where there are utrace stopping
points inside the call, the full 'struct pt_regs' is on the kernel stack
below the sys_* frame, so you can just access it there (i.e. user_regset
calls will do that safely).  But in syscall tracing, the full registers
are only on the stack and visible to user_regset et al during the
tracehook_report_syscall_* call (where you get utrace or tracepoint
callbacks).  After the the entry tracing functions return, the extra
words are popped back off the stack into the registers, and then the
actual call to the sys_* function is clobbering that with private kernel
stack data.  (Then, after the call, all those registers are pushed back
on if there is syscall exit tracing to do.)  So while you have full
register access at the tracing callbacks, when actually inside the
particular system call code, you don't have any way to recover it short
of perfect kernel-side CFI unwinding back to the red line (more on that
later).


Now, to those three paths.

1. Work with what you got.

   This means, give the user unwinder some arch-specific code to prime
   its state from a known-to-be-partial struct pt_regs.  The various
   registers are marked as in "undefined" CFI state as opposed to the
   usual initial state of a known register value.  The ones that are
   available (roughly the syscall registers, PC and SP) are there.  Then
   process CFI as usual, and bail with a complaint when you find you
   need an undefined register value to figure out the next PC.

   This is one of those double-edged things that has the (dis)advantage
   that it works 100% perfectly on i386 (what you got is all there is to
   have).  It may well even work most of the time on other machines, I
   have barely a guess off hand about that.  (It's sometimes possible to
   get a certain someone to write a fancy script that could analyze a
   decent corpus of binaries and prestimadigitate about their FDEs'
   sensitivity to certain missing registers.)  But the bare guess is
   that it might well tend to cover just recovering the PC and CFA
   (enough to keep doing a basic backtrace) much more often than it
   covers all the registers (so that a full debugger or extracting
   application $variables from unwound frames, which we don't yet
   support anyway, would be happy).

   This has the feature that you can just try it in the unwinder code
   today without depending on any other moving parts, and see what you
   get.  It's known complete as is on i386, so you can just declare
   victory for that machine.  For others, you can see what happens in
   practice right away.  

   It's sure that full-frame unwinding (that is, calculating all the
   registers) will hit "undefined"s.  On-demand register finding was
   previously mentioned as an optimization for the basic backtrace,
   which rarely needs to figure many of the registers at each frame just
   to find the next PC.  Doing that would make the "bail on undefined"
   logic hit far less often, one presumes.  At that point, perhaps it
   becomes good enough, though never fully waterproof.

2. Turtles all the way down!

   (The turtles are made of CFI.)  That is, unwind in kernel space all
   the way back to the red line.  (The "red line" is what OS hackers of
   my vintage call the kernel-mode/user-mode boundary.)  If all the
   kernel CFI is correct, then you can unwind from anywhere all the way
   back to the frame that's the kernel entry point, and then unwind from
   there to full user registers.  It should be just like unwinding
   through an in-kernel interrupt or trap frame.  

   In 100% proper CFI these frames are marked as "signal frames" (it's
   part of the "augmentation string"), so you can see those and then
   check whether the "caller's PC" of that frame is < TASK_SIZE
   (i.e. outside kernel text) to tell whether it's the base kernel entry
   frame or is an in-kernel trap frame.  (The magic registers that
   user_mode() checks are not recorded in the CFI, though they could be.
   If they were, you could apply the exact check user_mode() does to the
   reconstructed registers to decide if you think they're from user
   mode.)  You can also do something much simpler like see if the
   unwound stack pointer matches user_stack_pointer(task_pt_regs(task)).
   
   All this requires is that all kernel code have CFI, that the CFI be
   correct, and that you have that CFI.  Three small matters.  I can
   only really speak to these for x86.

   For some time now, since sometime after the short-lived in-kernel CFI
   unwinder got removed, the linker script used to build vmlinux
   discards the .eh_frame section.  This is where all the hand-written
   CFI in assembly code has been going, since that's where the assembler
   puts it for .cfi_* directives.  So, in x86 kernels there is believed
   to be CFI for all the code and it is imagined to be correct, but it
   has not been in any binary you've seen in a long time.

   In the interim, the assembler has grown the .cfi_sections directive
   that lets us direct whether the .cfi_* directives in assembly code
   produce .eh_frame, .debug_frame, or both.  I have only just now sent
   a fix to the kernel x86 maintainers to use .cfi_section .debug_frame
   in the x86 assembly code, so that CFI is preserved for us to find.
   (I've put that patch into the rawhide kernel, so kernel-debuginfo
   from rawhide will have full CFI the next time the rest of the rawhide
   kernel's patches start building again.  We can probably get it into
   Fedora 13 and update kernels too.)

   So, near-future kernels will have CFI for the kernel entry points
   (and other assembly) so we can find out concretely whether there is
   any CFI that is missing or wrong.  It seems to be maintained fairly
   judiciously despite the upstream kernel build not having any way for
   anyone ever to see it.  In the past I have volunteered to fix it as
   needed.  Hence, with vigilance, relying on it for "current" kernels
   is plausible.

   For other machines, I don't really know the details.  Unless there is
   some magic I don't know about, the powerpc assembly in the kernel has
   no CFI, for example.  It might be that the base kernel stack layout
   is formulaic enough to handle the kernel entry frames generically
   with a hard-coded rule on that machine or something like that, but
   you'd need an arch expert to tell you for each arch.

   Incidentally, ia64 (and arm?) has its own non-DWARF flavor of unwind
   info that the assembler generates mostly automagically without the
   hand-written directives, and an in-kernel unwinder for it.  In fact,
   on ia64, that unwinder is the one and only way you ever get the full
   user registers.  Its unw_unwind_to_user() is used by the ptrace code.
   So, while I have no idea about the 'struct pt_regs' story on ia64, I
   believe there it's actually safe to use the user_regset calls more or
   less anywhere you don't hold spinlocks or whatnot.

   This solution can be "smooth round the bend", as they say.  There's
   no messy "phase change" at all, it's just unwinding all the way
   through.  There's the minor bump of noticing when you shift from
   consulting kernel CFI to user CFI, but perhaps you just think of that
   as PC ranges in different modules, as with a kernel module's CFI.

   But, my bet is it may prove to be not quite perfect (needing assembly
   fixes) on x86, and difficult to get anyone to add hand-assembly CFI
   in its entirety to powerpc and other machines where it's absent now.
   (The assembler supports the same .cfi_* stuff for powerpc and other
   machines just fine, if the kernel assembly code wanted to use it.)
   You certainly can't use it on existing kernels, and there is only any
   kind of ETA on that as yet for the x86.

3. Two phase with a safe point

   This is the notion that Will mentioned, but there is a general and
   optimal way to do it.  It's a classic "software interrupt" scheme:
   at an arbitrary point, put down a marker; when you reach a safe
   point (here, just before returning to user mode), pick up the
   marker and do the rest of the work.

   There are a variety of ways to do this, but there are now (kind of)
   some good ones.  In recent kernels, the TIF_NOTIFY_RESUME flag
   exists just for this sort of thing.  In all kernels, TIF_SIGPENDING
   does a related thing.

   You can safely do set_thread_flag(TIF_NOTIFY_RESUME) from anywhere.
   This means tracehook_notify_resume() gets called before returning
   to user mode.  tracehook_notify_resume() is an arch-indepedent
   inline with one call site on each machine (in arch-specific code).
   In any kernel that has it, you could at least use a kprobe on that
   inline to get a callback at the safe user-mode boundary (where
   user_regset is kosher, if you don't have interrupts disabled or
   other locks held or whatnot).

   In utrace, this is what passing UTRACE_REPORT to utrace_control()
   does.  But you can't call utrace_control() from interrupt level or
   with locks held or so forth, because of lockdep issues.  I'd always
   expected to have some manner of utrace call that can be made by the
   current task from interrupt level or anywhere, to demand a utrace
   report at the next safe point--i.e., what UTRACE_REPORT (or also a
   UTRACE_INTERRUPT option) would do if another thread were making a
   utrace_control() call.  I think we can add a simple thing like that
   to utrace easily.  But it's not there now.

   If you are considering something like a futex syscall, that might
   block (or already has), then TIF_NOTIFY_RESUME/UTRACE_REPORT
   doesn't do anything until the syscall finishes of its own accord.
   (For a problem futex wait you are investigating, that might be
   never.)  If you know that it's a syscall that restarts properly
   (futex does, correctly resuming a timeout if there is one), then
   you can use TIF_SIGPENDING (in utrace, UTRACE_INTERRUPT) instead.
   That will prevent it from blocking normally, instead going back to
   user mode to restart the syscall.  On its way back, you can see the
   full registers before or when it restarts.  Then of course you have
   to know not to loop when you hit your probe inside the futex code
   the second time.  (Note in the case of futex and some others, the
   second time around will be NR_restart_syscall rather than the
   original syscall, hence that path will be via futex_wait_restart()
   rather than sys_futex->do_futex as the first time's path was.)  In
   the case of a thread already blocked on a futex, you can already
   use utrace_control() on it to do UTRACE_INTERRUPT from another
   thread.  To do it from inside the futex call on the current thread,
   e.g. from a timer interrupt in the same thread context or something
   like that, you'd need the same new utrace interface as above.

   In the absence of utrace or that new feature, you can do a couple
   of things with TIF_SIGPENDING.  You can actually send a signal from
   anywhere (send_sig et al), and then catch that happening by normal
   means.  That is, you can use an ignored signal and then notice
   trace_signal_deliver(); or, with existing utrace you can use any
   signal and swallow it in the utrace report_signal handler.  Or, you
   can do plain set_thread_flag(TIF_SIGPENDING) from anywhere, and
   then use a kprobe on get_signal_to_deliver() to catch it before it
   checks and sees no signals and does nothing.

   Finally, you can just change the user PC to something that leads to
   an event you know how to trace.  (There's no point in just using an
   invalid PC, since that only leads to a signal you might as well
   just send.)  That could be the vDSO __kernel_rt_sigreturn if you
   have a probe in sys_rt_sigreturn or something like that, or could
   be some random PC where you previously installed a uprobe.  This is
   probably a poor choice, because of complications like restoring the
   real user PC if another signal comes along first.

   With any of those methods, what the low-level implemention provides
   is equivalent to two internal probes, hence "two phase".  There are
   lots of ways to deal with this in the script world.  e.g.

   a. For backtraces alone, you could have ubacktrace() store a magic
      object that is a placeholder for a backtrace to be done at the
      safe point.  For printing a backtrace from an inside-kernel probe,
      some magic nugget would be placed in the output buffer so that the
      stapio side would know not to actually deliver this buffer as
      printed text until the second phase probe comes along to fill in
      that portion of the output out of order.
   b. You could force the script/tapset to do it entirely in terms of
      two language-level probes:
	probe kernel... { notify_resume() /* could be embedded-c */ }
	probe user.resume { ... } // could be just = kernel.function("...")
   c. You could add two-phase probes as a first-class language feature:
	probe kernel... {
		print_firstpart();
		@resume {
			print_secondpart();
		}
	}
      Perhaps with some interrupting variant too, perhaps even one that
      rolls in the restart-once logic:
	probe kernel... {
		@restartsys { bt = ubacktrace() }
		printf("blah at %s\n", bt)
	}
      Inside the @resume et al clauses, you have no $ context (can use
      only kernel globals), but have full user registers.  Perhaps if it
      appears in a library... probe, then your globals-only $ context is
      for the user module named in the probe instead.

   This class of approach has the big advantage that it's entirely
   compatible with doing user unwinding from the .eh_frame CFI in the
   user text rather than relying on prepacking.  At these safe points,
   it's entirely fine to do full user memory access with uaccess.h or
   whatever, block in page faults to bring the necessary text in, etc.
   (Back at the dawn of time, I presumed this is how it would always be
   done, and hence thought it pretty nutty to be packing up userland CFI
   data into stap kernel modules.)  At worst, you risk nothing but
   wedging that one thread, and it can still be killed cleanly.

   Even with the prepacking, this can let the user unwinder code run
   preemptible (for voluntary preemption, you can sprinkle it with
   cond_resched, and for paranoia, fatal_signal_pending bail-out
   checks).  This removes the burden of dealing with userland CFI down
   in any sensitive places where delaying too long or getting led astray
   into an infinite loop is a big problem.  Then any unwinding work done
   in such places is only for the kernel, where the CFI to contend with
   is a finite known set we can have scoured thoroughly for bogons
   beforehand.  (Until someone compiles another kernel module, of
   course, but you get the idea.)

I said three paths, but the careful reader will have noticed when I was
talking about the nuances of entry paths earlier that there is also:

4. Pre-collect via syscall-entry tracing

   As I mentioned above, the syscall tracing callbacks have complete
   register access and can use user_regset.  So, you can enable syscall
   tracing via utrace or the "sys_enter" tracepoint.  In your callback,
   use user_regset (properly) or the 'struct pt_regs' (from argument or
   task_pt_regs(), improperly) to save off the complete user register
   data somewhere.  Then when in a later probe point before the syscall
   exit, you can use the saved regset block to prime the user unwinder.

   You might optimize out copying the registers based on looking for a
   syscall_get_nr() value.  Or you might make that syscall-entry probe
   the place where you check for and record a futex call, rather than a
   separate kernel probe inside the sys_futex call chain.

   Conversely, you could have a general mode (maybe even enabled just by
   using ubacktrace() in a script!) that just implicitly enables the
   syscall-entry tracing everywhere (or in targeted tasks, or whatever)
   with a canned probe in the runtime that stores the user registers.
   Both the tracing mode and the copying add overhead to every syscall,
   which can be measured to think about how desireable this is to do how
   automatically.  (I happen to know that on x86 there is some work in
   the low-level magic we can do to reduce the tracing mode part of the
   overhead on the syscall-exit side, which you incur by doing
   syscall-entry tracing.  So let me know if measurements suggest that
   optimizing that part of it could be the tipping point for a decision
   that's desireable in other regards such as punting on all the
   potential work involved in all the other avenues under discussion.)


Thanks,
Roland


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]