This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: breakpoint assistance: single-step out of line


On Sun, 2007-03-04 at 13:38 -0800, Roland McGrath wrote:
> The method of single-stepping over an out of line copy of the instruction
> clobbered by breakpoint insertion has been proven by kprobes.  The
> complexities are mitigated in that implementation by the constrained
> context of the kernel and the fixed subset of possible machine code known
> to validly occur in any kernel or module text.
> 
> There are two core problem areas in implementing single-step out of line
> for user mode code.  These are where to store the out of line copies, and
> arch issues with instruction semantics.
> 
> 
> Starting with arch issues, I'll talk about the only ones I know in detail,
> which are x86 and x86_64.  kprobes has done the basic work here.  For the
> user mode context, on the one hand the risks of munging an instruction's
> behavior are confined to the user address space in question, but on the
> other hand we have to deal robustly with the full range of instructions
> that can be executed on the processor in user mode.

Yes.

> 
> Instruction decoding needs to be robust, not presume the canonical subset
> of encodings normally produced by the compiler, as used in the kernel.  On
> machines other than x86, this tends to be quite simple.  On x86, it means
> parsing all the instruction prefixes correctly and so forth.  I think the
> parsing should be done at breakpoint insertion time, caching just a few
> bits saying what fixup strategy we need to use after the step.

I guess that depends on how complicated the switch(opcode) { ... } code
in uprobe_resume_execution() gets.  Parsing the instruction at
probe-insertion time is essential for x86_64, at least partly because of
rip-relative addressing, as you discuss below.

> If we can't
> positively decode the instruction enough to be confident that we know how
> to fix it up, refuse to insert the breakpoint.

Yes.

> (If it's an invalid
> instruction, you don't need a breakpoint because you'll get a trap anyway.)
> 
> The instructions of concern are those that refer to the PC.
> On 32-bit x86, these are only the few control flow instructions.
> 
> On x86_64, there is also %rip-relative addressing.  We cannot presume
> addresses are within the same 4GB window so that the displacement can just
> be adjusted, as we do in the kernel.

Yes.

> However, we can use some other
> tricks.  The only instruction that computes a %rip-relative address as a
> result is lea.  It is not difficult to recognize that one and just emulate
> it outright; there are only a few variations of address-size, data-size,
> and output register.  It's not much easier to fix it up after the step.
> 
> Unless I'm overlooking something, all other %rip-relative uses are implicit
> in the effective address for a memory access.  

I think that's the case, but we haven't done a thorough review of the
instruction list.

> For these, we can use the fs
> or gs segment prefix on the copied instruction, and adjust the displacement
> and the fs or gs base value to come up with the original target address.
> In the unlikely event that the instruction already uses the fs or gs
> prefix, just adjust the appropriate base value and use the instruction as
> it is.  Otherwise, insert a gs prefix in the copied instruction, and set
> the gs base to the difference between the address of the copy (after the
> inserted prefix) and the breakpoint address.  It is a little costly to set
> the fs or gs base value and reset it after the step, much more than setting
> a register in the trap frame; but it's probably not too bad.

The approach we had in mind was to change the rip-relative instruction
to an indirect instruction where the target address is in a scratch
register (one not accessed by the original instruction).  Save the value
of the scratch register, load in the target address, single-step, and
restore the scratch register's real value.  This isn't coded yet.

> 
> 
> Next we come to the problem of where to store copied instructions for
> stepping.  The idea of stealing a stack page for this is a non-starter.
> For both security and robustness, it's never acceptable to introduce a user
> mapping that is both writable and executable, even temporarily.  We need to
> use an otherwise unused page in the address space, that will be
> read/execute only for the user, we can write to it only from kernel mode.

As it turns out, this approach isn't very portable, either.  The s390
and powerpc compilers regularly generate code that accesses data beyond
the top-of-stack, so it's tough to find a "safe" page in the stack vma.

> 
> In some meeting notes I've seen mention of "do what the vdso does".  I
> don't know what this referred to specifically.

We were thinking in terms of a per-process page that's automatically set
up at exec time.  There's no dso involved in our approach, but the
"vdso" reference has been hard to kill.

> There are two things this
> might mean, and those are the two main options I see.  What the i386 vDSO
> used to do (CONFIG_COMPAT_VDSO), what the ia64 vDSO does, and what the
> x86-64 vsyscall page does (not a vDSO but similar), is the fixmap area.
> What the i386 vDSO, the ia32 vDSO on x86_64, and the powerpc vDSO do,
> is insert a vma.

It's the latter.

> 
> The fixmap area is a region of address space that shares some page tables
> across all tasks in the system.  The advantages are that it has no vm setup
> cost since it is done once at boot, and that it is completely outside the
> range of virtual addresses the user task can map normally and so does not
> perturb any mapping behavior or appear in /proc/PID/maps or via
> access_process_vm or such things that might have unintended side effects on
> the user process.  On 32-bit x86, a disadvantage is that when NX page
> protection is not available (older CPUs or non-PAE kernel builds), the
> exec-shield approximation of NX protection via segmentation is defeated by
> having an executable page high in the address space; this can be worked
> around on the exec-shield kernel with some extra effort.  Other machines
> may not already have an analogous region of reserved address space where a
> page can be made user-readable/executable.  Other potential disadvantages
> are the fixed amount of space (chosen at compile-time or boot-time, with
> some small limit on the number of pages available), and the security
> implications of global pages visible to all users on the system.  The
> limited size might mean that slots need to be assigned only momentarily
> while doing the step, meaning fresh icache flushing every time.  Then you'd
> ideally use only one slot per CPU, but that needs some work to be right
> given preemption.  The briefness of this window may mitigate the security
> concerns, but still there are a few bytes of information about a traced
> thread leaking to anyone in the system who wants to try to see them.  The
> setup every time necessitated by the fixed space is costly, but on the
> other hand its CPU use scales linearly with more breakpoints and more
> occurrences and its memory use stays constant, compared to open-ended
> allocation scaling with the number of breakpoints.

We haven't seriously considered the above approach.

> 
> Inserting a vma means essentially doing an mmap from inside the kernel.
> Both the advantages and the disadvantages of this stem from its normalcy.
> Any stray mmap/munmap/mprotect call from the user might wind up clobbering
> this mapping.  

Good point.

> It appears in /proc/PID/maps and will become known to other
> debugging facilities tracing the process, so they will think it's a normal
> user allocation; it might appear in core dumps.  This might have other bad
> effects on processes that look at their own maps file to see what heap
> pages there are, which some GC libraries or suchlike might well do.  The
> mapping also has subtler effects perturbing the process's own mapping
> behavior, which could introduce anomalies or even break some programs that
> need a lot of control over their address space.  The advantages are that
> it's straightforward to implement and easy to be sure that it does the
> right thing vis a vis paging and so forth, it provides the option of using
> an open-ended amount of storage to optimize the use of many breakpoints,
> and it's wholly private to the user address space in question.

Our current approach uses a fixed-size area (1 page for now) that's
allocated at exec time.  Instruction slots are allocated to probepoints
as they are hit, and a probepoint owns the slot until another probepoint
steals it.  For x86[_64], 1 pages gives us 256 slots.  We would see
thrashing due to slot starvation only if the process is hitting more
than 256 different probepoints in a short time span.

We're still debugging this approach.

> 
> A third option I didn't mention before is doing something in the page
> tables behind the vm system's back (this as distinct, and somewhat simpler
> than, the fancy VM ideas like per-thread page tables).  I don't know enough
> about this to comment in detail.  The attraction is that it would avoid
> some of the interactions I just mentioned with vma's, and might have lower
> overhead to set up.  It might be difficult to make this do reasonable
> things about paging and such.  This is probably not a good bet, but I don't
> know much about it.

I haven't thought about the above approach.

> 
> The fixmap is somewhat attractive at least for x86, x86-64, and ia64.  It's
> nice that it doesn't interact with the normal user address range and set of
> visible mappings.  The overhead of resetting and icache flushing an
> instruction slot on every use is less than the uprobes prototype using a
> stack page already has.  I don't know if the performance of that will be
> good enough in the long run, or if priming a slot once and using it
> repeatedly will perform enough better that we care about this overhead.

We picked per-probepoint multiplexing because of icache issues (and
because it seems best for single-threaded apps), but it turned out to be
no more complex than per-thread or per-cpu muxing.

> 
> The vma is the most straightforward thing to implement, and is generic
> across machines.  It makes sense to implement this first generically

Oh, good.  Glad we got that right.

> and
> then experiment later with the fixmap approach as an arch-specific
> alternative.  The stack randomization done on at least x86/x86-64 means
> that there is normally a good little stretch of address space free above
> the stack vma (the top part of which holds environ and auxv).  (Just try
> "tail -1 /proc/self/maps" a few times.)  This area is unlikely to conflict
> with address space the user's own mappings would ever have considered.
> Allocating at one page above the end of the stack vma (leaving a red zone)
> seems good.

Sounds good, although I personally don't know the incantation for
putting the vma there.  Any help would be appreciated.

> I'm really more concerned about things monitoring the
> mappings.  Perhaps we could add a VM_* flag that says to omit the vma from
> listings, but I don't know how that would be received by kernel people, let
> alone a flag to disallow user munmap/mmap/mprotect calls to change a mapping.

I'm not so worried about the visibility of the area in /proc/*/maps and
such; protecting it from munmap & friends seems more of a concern.

> 
> I can go into further detail on how I envision implementing the vma and/or
> fixmap plans if it is not clear.
> 
We hope to post the above-described implementation Real Soon Now.  Given
your interest, maybe we will even if we don't have it firing on all
cylinders.

> 
> Thanks,
> Roland

Thanks again for your ideas and interest.
Jim


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]