This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
Re: Making the transport layer more robust
On 08/12/2011 10:43 AM, Mark Wielaard wrote:
> commit 46ac9ed5bad86641e552bee4e42a2d973ffc12d0
> Author: Mark Wielaard <mjw@redhat.com>
> Date: Fri Aug 12 19:34:20 2011 +0200
>
> Remove _stp_ctl_work_timer from module transport layer.
>
> The _stp_ctl_work_timer would trigger every 20ms to check whether
> there were cmd messages queued, but not announced yet and to
> check the _stp_exit_flag was set.
>
> This commit makes all control messages announce themselves and
> check the _stp_exit_flag in the _stp_ctl_read_cmd loop (delivery
> is still possibly delayed since the messages are just pushed on
> a wait queue).
This has unfortunately left open an opportunity for deadlock. The
kernel wake_up infrastructure takes a spinlock on the wait queue. If
the probe context happens to fire while that lock is held, either via a
direct probe on something called by wake_up or indirectly via NMI, then
the handler must not call anything that would attempt the same lock.
But this commit is triggering a wake_up on ctl prints, and commit
a85c8aff triggers the same on exit().
For example, __wake_up_common is called with a lock held, and then
either of these will cause a deadlock:
probe kernel.function("__wake_up_common") { warn(pp()) }
probe kernel.function("__wake_up_common") { exit() }
This issue in general is very similar to PR2525. We must take care not
to call any blocking code from arbitrary probe context.
Thanks,
Josh