This is the mail archive of the gdb-patches@sourceware.org mailing list for the GDB project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[RFA] Fix crash on Linux 2.4 when threaded program exits


The debugger crashes when debugging a threaded program when the program
exits:

    (gdb) run
    Starting program: /[...]/q 
    [Thread debugging using libthread_db enabled]
    [New Thread 0xb748ebb0 (LWP 9340)]
    [New Thread 0xb728abb0 (LWP 9341)]
    Test2
    Test1
    [Thread 0xb748ebb0 (LWP 9340) exited]
    [Thread 0xb728abb0 (LWP 9341) exited]
    [Thread 0xb75d9b80 (LWP 9337) exited]
    Recursive internal problem.
    zsh: 9330 abort      gdb-head q

It appears that this is only specific to Linux kernels 2.4, and the way
the NPTL behaves on that version of the kernel: With 2.4, we only receive
an "exited" notification for the main thread, whereas with 2.6, we receive
the notification for each and every thread.

What happens in the 2.4 case is that we delete the lp structure for
the thread that exited and then still try to use it shortly after.
At this point, the memory has been free'ed and the contents has been
corrupted. As a result, we hit an internal error that hits another
internal error that causes the abort.

The code in linux-nat.c:linux_nat_filter_event looks like this:

  if ((WIFEXITED (status) || WIFSIGNALED (status)) && num_lwps > 1)
    {
      [delete threads that have vanished]

      exit_lwp (lp);

      /* If there is at least one more LWP, then the exit signal was
         not the end of the debugged application and should be
         ignored.  */
      if (num_lwps > 0)
        return NULL;
    }

As you can see, in the linux-2.4 case, we end up deleting all threads,
then call exit_lwp to delete the main thread. Next we check num_lwps
which is zero, so we continue. Shortly after that, in the same routine,
we already access lp (around line 2717, "lp->ignore_sigint"), but the
symptoms actually appear slightly later when accessing the lp ptid
in order to set the inferior_ptid which is used to get the associated
inferior.

The fix was to delete the lp and return NULL iff there are other
lwps that still exist.

2009-04-01  Joel Brobecker  <brobecker@adacore.com>

        * linux-nat.c (linux_nat_filter_events): Do not delete the lwp if
        this is the last one.

Tested on x86-linux (with a 2.4.21 Linux kernel). It fixes ~25 failures.
Tested on x86_64-linux (with a 2.6 kernel). No regression.

Does this look correct?

Thanks,
-- 
Joel
diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index be99ece..feca722 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -2644,13 +2644,14 @@ linux_nat_filter_event (int lwpid, int status, int options)
 			    "LLW: %s exited.\n",
 			    target_pid_to_str (lp->ptid));
 
-      exit_lwp (lp);
-
-      /* If there is at least one more LWP, then the exit signal was
-	 not the end of the debugged application and should be
-	 ignored.  */
-      if (num_lwps > 0)
-	return NULL;
+      if (num_lwps > 1)
+       {
+	 /* If there is at least one more LWP, then the exit signal
+	    was not the end of the debugged application and should be
+	    ignored.  */
+	 exit_lwp (lp);
+	 return NULL;
+       }
     }
 
   /* Check if the current LWP has previously exited.  In the nptl

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]