This is the mail archive of the
libc-hacker@sourceware.cygnus.com
mailing list for the glibc project.
gdb and linuxthreads (A deadlock in linuxthreads.)
- To: Xavier.Leroy@inria.fr (Xavier Leroy)
- Subject: gdb and linuxthreads (A deadlock in linuxthreads.)
- From: hjl@lucon.org (H.J. Lu)
- Date: Sat, 19 Dec 1998 16:46:31 -0800 (PST)
- Cc: libc-hacker@cygnus.com (GNU C Library), drepper@cygnus.com (Ulrich Drepper)
>
> Hello,
>
> I agree there is something wrong in the way the thread manager handles
> terminated threads, but I'm not sure I follow your interpretation of
> the bug, nor your patch.
>
> > The manager is waiting in the loop for
> > a. dead children.
> > b. request from children.
> > Now
> > 1. At some instance, a child exits.
> > 2. The manager wakes up in the loop and finds a dead child. It calls
> > pthread_reap_children () which in turns call pthread_exited () which
> > calls __pthread_lock (). In __pthread_lock (), there is
> > if (oldstatus != 0) suspend(self);
> > This time "oldstatus" is 1 and suspend(self) is called. Now the manager
> > thinks it has nothing to do and suspends itself. At the same time,
> > another child sends a REQ_CREATE message to the manager and the calls
> > suspend(self). Now both the manager and the child called suspend(self).
> > We get a dead lock here.
>
> I don't see it. The manager suspends itself because there is another
> thread that currently holds the lock for the terminated thread
> (i.e. a third thread doing a join or detach on the thread that has
> just terminated). However, that third thread is going to release the
> lock eventually. I mean, it cannot be the same thread that has just
> sent a request to the manager and is suspended. If it holds the lock,
I have verified that there was no third thread at all. There were only
2 threads, the manager and the thread just sent a request to the
manager. It may be a race condition which can only happen on a SMP
machine.
> it's not suspended. Releasing the lock will restart the manager,
> which then will process the request and restart the requesting thread.
>
> BUT: there is something very wrong in using __pthread_sig_restart to
> signal dead children, because that signal is also used to restart the
> manager thread when it's waiting on a fastlock. So, a child that dies
> while the manager is waiting for a fastlock will restart the manager
> prematurely. In itself that wouldn't be incorrect, just inefficient.
> But the fastlock mechanism relies crucially on __pthread_sig_restart
> being blocked at all times except when waiting on a fastlock.
> (Otherwise, we get a race condition that leadds to lost wakeups.)
> And this is not the case in the thread manager, since the "dead
> children" signal must remain unblocked. I strongly suspect the
> manager deadlock you've observed is due to such a lost wakeup.
> I totally missed this point when turning the spinlocks used in the
> original implementation into fast locks.
That is very possible.
>
> The problem would be easy to fix... if it weren't for the gdb
> interface. The right thing to do is have the children send
> __pthread_sig_cancel instead of __pthread_sig_restart when they die.
> Then, __pthread_manager_sighandler is called from the handler for
> __pthread_sig_cancel. __pthread_sig_restart remains blocked by
> default in the manager thread just like in any other thread.
>
> The problem is that the gdb interface uses a __pthread_sig_cancel sent
> to the thread manager as an indication that a new thread is created.
> (See the processing of REQ_DEBUG.) I really don't know what happens
> if we send a __pthread_sig_cancel also when a thread dies. Have you
> looked at the OpenGroup patches to gdb? Not being too familiar with
> gdb myself, I don't fully understand what's going on.
Could you please get my gdb 4.17.0.6 from
ftp://ftp.kernel.org/pub/linux/devel/gcc
It has the OpenGroup linuxthread patch. But it only works with
glibc 2.0. I am enclosing a patch here against 4.17.0.6. It will
compile under glibc 2.1. But I run into a problem. Ulrich, could
you please please tell me why you added CLONE_PTRACE to __clone
call? I think that is one thing which breaks gdb.
>
> I'll try to send you tomorrow patches that implement the approach
> above (use __pthread_sig_cancel to signal dead children). Then maybe
> you could test them on a multiprocessor and see whether the problem
> with ex6.c remains.
>
Thanks. I will try. BTW, I only see this dead lock about 1 out of
40 tries.
Thanks.
--
H.J. Lu (hjl@gnu.org)
----
Index: lnx-thread.c
===================================================================
RCS file: /home/work/cvs/gnu/gdb/gdb/lnx-thread.c,v
retrieving revision 1.2
diff -u -p -r1.2 lnx-thread.c
--- lnx-thread.c 1998/12/03 01:07:42 1.2
+++ lnx-thread.c 1998/12/19 18:48:53
@@ -403,6 +403,8 @@ stop_thread (pid)
printf_unfiltered ("[New %s]\n", target_pid_to_str (pid));
add_thread (pid);
}
+ else
+ perror_with_name ("ptrace in stop_thread");
}
/* Wait for a thread */
@@ -641,6 +643,19 @@ linuxthreads_new_objfile (objfile)
"__pthread_sig_cancel");
return;
}
+
+#ifdef SIGRTMIN
+ if (!linuxthreads_sig_restart && !linuxthreads_sig_cancel)
+ {
+ linuxthreads_sig_restart = __libc_allocate_rtsig (1);
+ linuxthreads_sig_cancel = __libc_allocate_rtsig (1);
+ if (linuxthreads_sig_restart < 0 || linuxthreads_sig_cancel < 0)
+ {
+ linuxthreads_sig_restart = LINUXTHREAD_SIG_EXIT;
+ linuxthreads_sig_cancel = LINUXTHREAD_SIG_CANCEL;
+ }
+ }
+#endif
if ((ms = lookup_minimal_symbol ("__pthread_threads_max",
NULL, objfile)) == NULL