This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

notify_page_fault() problem


In testing the ARM Kprobes code, I hit a bug where the kernel would
occasionally lock up.  I tracked this down to the page fault handling
code.

Once I had a grasp of how the code is working, I'm wondering why other
Kprobes implementations aren't hitting the same bug.

On i386, I see they have a fix or workaround for this issue, but in
skimming the code base I don't see something like it on any other
Kprobes supporting platform.

The manifestation of the problem is that the kernel can get stuck
in an infinite loop with repeating page faults until it stack
overruns.  I understand how and why it is doing it, but exactly what
is considered the best fix is unclear to me since I'm not a Linux
kernel expert.

The start of the problem is that the data space for Kprobes modules
comes out of the module_alloc'd memory region.  Like vmalloc'd
memory this memory can cause an MMU fault, which is as expected.
Like vmalloc'd memory, the kernel fills in the hardware PTEs for its
MMU for module_alloc'd memory on an on-demand basis.

The problem is that the call to notify_page_fault() is right up
front in most architecture's data fault handling code.  The function
notify_page_fault() and its registered chain routines can be invoked
_before_ the kernel's had a chance to fill in its mapping.  See the
problem yet?  The MMU generates a fault, the arch-specific fault
handler is called, it immediately calls notify_page_fault() which
calls kprobe_exceptions_notify() which calls kprobe_fault_handler()
which calls get_kprobe().  The function get_kprobe() dereferences
a kprobe data structure which the structure isn't always mapped in
hardware yet, so it faults.  Now we re-start the entire call
sequence again, and then again, and again, and again until the
system falls over.

Now on i386's do_page_fault(), it avoids the above infinite
recursion by checking to see if the fault happened in kernel space.
If it did, it makes sure it was the right kind of fault, and, if
it is, it calls vmalloc_fault() which will attempt to map the missing
page translation with vmalloc_sync_one().  If it is successful,
do_page_fault() returns without letting notify_page_fault() get
called.

This support code on i386 (vmalloc_fault() and vmalloc_sync_one())
is in its architecture-specific layer.  If this is considered
the best fix for this issue (instead of moving the call, more
below), rather than everyone go off and implement it themselves
in their own layer, should something like it be implemented in
architecture-independent code?


One thing I don't understand is why notify_page_fault() is called so early in everyone's page fault handling code. Shouldn't it be called just before the kernel's about to give up, instead of right up front?

Page fault handling code is a serious critical path in any kernel.
As it is by putting the notify_page_fault() call so early, just
having the check there, whether or not any hooks are registered,
has to have a detrimental impact on overall system performance.  If
it is treated as an option of last resort like other 'die' chain
services, this wouldn't have been a problem in the first place.  Why
did someone find it important for the check not to go last?


For now to workaround this bug for the ARM Kprobes port, I just removed the call to notify_page_fault() for kernel translation faults. I wanted to hear some comments on this problem before working on a fix.

Quentin


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]