This is the mail archive of the gdb-patches@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: RFH: failed assert debugging threaded+fork program over gdbserver

From: Joel Brobecker <brobecker at adacore dot com>
To: Don Breazeal <donb at codesourcery dot com>
Cc: gdb-patches at sourceware dot org
Date: Thu, 23 Jun 2016 15:59:35 -0700
Subject: Re: RFH: failed assert debugging threaded+fork program over gdbserver
Authentication-results: sourceware.org; auth=none
References: <20160512171650 dot GC26324 at adacore dot com> <5734C06C dot 8040008 at codesourcery dot com>

Hello,

I finally had some time to look into this again. I think I have
a better understanding.

On Thu, May 12, 2016 at 10:42:04AM -0700, Don Breazeal wrote:
> On 5/12/2016 10:16 AM, Joel Brobecker wrote:
> > Hello,
> > 
> > I have noticed the following problem, when debugging a program which
> > uses both threads and fork. The program is attached in copy, and
> > it was compiled by simply doing:
> > 
> >     % gnatmake -g a_test
> > 
> > The issue appears only randomly, but it seems to show up fairly
> > reliably when using certain versions of GNU/Linux such as RHES7,
> > or WRSLinux. I also see it on Ubuntu, but less reliably. Here is
> > what I have found, debugging on WRSLinux (we set it up as a cross,
> > but it should be the same with native GNU/Linux distros):
> > 
> >     % gdb a_test
> >     (gdb) break a_test.adb:30
> >     (gdb) break a_test.adb:39
> >     (gdb) target remote my_board:4444
> >     (gdb) continue
> >     Continuing.
> >     [...]
> >     [New Thread 866.868]
> >     [New Thread 866.869]
> >     [New Thread 870.870]
> >     /[...]/gdb/thread.c:89: internal-error: thread_info* inferior_thread(): Assertion `tp' failed.
> >     A problem internal to GDB has been detected,
> >     further debugging may prove unreliable.
> >     Quit this debugging session? (y or n) 
> > 
> > The error happens because GDBserver returns a list of threads
> > to GDB where a new thread as a different PID (870 in the case
> > above, instead of 866).
> 
> This is not supposed to happen.  In remote.c:remote_update_thread_list
> there is a call to remove_new_fork_children that is explicitly supposed
> to prevent this scenario.  The new fork child thread should be deleted
> from the thread list ("context") before we call remote_notice_new_inferior.
> 
> We don't want the remote to report new threads related to the fork child
> until after we have handled the fork using infrun.c:follow_fork. (Note:
> it looks like the function comment for remove_new_fork_children is
> stale, since there are scenarios other than being stopped at a fork
> catchpoint where this can occur.)
> 
> > Any advice on how I should be fixing the issue?
> 
> It looks like there is a case that remove_new_fork_children isn't
> handling correctly with your test+target(s). I'd start there to find out
> how the new thread is getting through.

I started by looking at remove_new_fork_children, which, in essence,
remove threads whose target_waitstatus->kind is fork or vforked.
And for that to happen, I think the remote (gdbserver) has to send
GDB a stop:fork packet, something like this:

    Packet received: T05fork:p4bc4.4bc4;06:0100000000000000;07:d0ddffffff7f0000;10:6a4f83f7ff7f0000;thread:p4ba5.4ba5;core:5;

However, this is not what happens, in our case. When continuing,
we just receive a SIGTRAP event, at which point GDB asks for the
thread list, and by then, the thread for the forked process is
in the list, leading to the situation described above.

Deciphering the gdbserver logs, I think I understand why this is
happening. After we're done single-stepping out of our breakpoint,
and we have resumed our program we go into a wait loop in
linux_wait_for_event_filtered, which, as I understand it,
pulls event from the kernel, but without doing a blocking wait
(WNOHANG), and as long as events are found, we keep waiting
for more:

  /* Always pull all events out of the kernel.  We'll randomly select
     an event LWP out of all that have events, to prevent
     starvation.  */
  while (event_child == NULL)
    {
      [...]
      ret = my_waitpid (-1, wstatp, options | WNOHANG);
      if (ret > 0)
        {
          [...]
          continue;
        }

After resuming, we go through a couple of loops where I think
we discover the new threads that are getting started. Each
time, we resume the threads via:

      /* Now that we've pulled all events out of the kernel, resume
         LWPs that don't have an interesting event to report.  */
      if (stopping_threads == NOT_STOPPING_THREADS)
        for_each_inferior (&all_threads, resume_stopped_resumed_lwps);

But shortly after, we get the following batch of events:

| leader_pid=2178, leader_lp!=NULL=1, num_lwps=3, zombie=0
| sigsuspend'ing
| sigchld_handler
| my_waitpid (-1, 0x40000001)
| my_waitpid (-1, 0x40000001): status(137f), 2190
| LWFE: waitpid(-1, ...) returned 2190, ERRNO-OK
| LLW: waitpid 2190 received Stopped (signal) (stopped)
| my_waitpid (-1, 0x40000001)
| my_waitpid (-1, 0x40000001): status(57f), 2189
| LWFE: waitpid(-1, ...) returned 2189, ERRNO-OK
| LLW: waitpid 2189 received Trace/breakpoint trap (stopped)
| pc is 0x1000336c
| CSBB: LWP 2178.2189 stopped by software breakpoint
| my_waitpid (-1, 0x40000001)
| my_waitpid (-1, 0x40000001): status(57f), 2188
| LWFE: waitpid(-1, ...) returned 2188, ERRNO-OK
| LLW: waitpid 2188 received Trace/breakpoint trap (stopped)
| pc is 0x10003218
| CSBB: LWP 2178.2188 stopped by software breakpoint
| my_waitpid (-1, 0x40000001)
| my_waitpid (-1, 0x40000001): status(1057f), 2178
| LWFE: waitpid(-1, ...) returned 2178, ERRNO-OK
| LLW: waitpid 2178 received Trace/breakpoint trap (stopped)
| pc is 0xfe87158
| HEW: Got fork event from LWP 2178, new child is 2190  <<<<--- fork!
| 
| pc is 0xfe87158
| my_waitpid (-1, 0x40000001)
| my_waitpid (-1, 0x40000001): status(ffffffff), 0
| LWFE: waitpid(-1, ...) returned 0, ERRNO-OK

This tells me that we get 3 SIGTRAP events from our program's
threads hitting a breakpoint, *and* the fork event. The fork
event is confirmed by some logs printed right after:

| LWP 2178: extended event with waitstatus status->kind = forked
| Hit a non-gdbserver trap event.

Now that we've have interesting stuff to report, we decide to
stop all LWPS:

| >>>> entering stop_all_lwps
| stop_all_lwps (stop, except=none)
| wait_for_sigstop: pulling events
| my_waitpid (-1, 0x40000001)
| my_waitpid (-1, 0x40000001): status(ffffffff), 0
| LWFE: waitpid(-1, ...) returned 0, ERRNO-OK
| leader_pid=2178, leader_lp!=NULL=1, num_lwps=3, zombie=0
| leader_pid=2190, leader_lp!=NULL=1, num_lwps=1, zombie=0
| LLW: exit (no unwaited-for LWP)
| stop_all_lwps done, setting stopping_threads back to !stopping
| <<<< exiting stop_all_lwps

Because all our threads had hit a breakpoint, we can see above
that none of the lwps got sent a sigstop signal, and therefore
waitpid returns zero as expected - confirmed by "(no unwaited-for LWP)".

And this is where I think things go wrong - instead of reporting
the fork event to GDB, we look at all our threads, find that none
of them was being single-stepped, and so select one of the threads
that got a SIGTRAP at random, reporting the event of that thread:

| SEL: Found 3 SIGTRAP events, selecting #1
| Checking whether LWP 2178 needs to move out of the jump pad...no
| Checking whether LWP 2190 needs to move out of the jump pad...no
| linux_wait_1 ret = LWP 2178.2188, 1, 5
| <<<< exiting linux_wait_1
| Writing resume reply for LWP 2178.2188:1
| putpkt ("$T05swbreak:;01:482009e0;40:10003218;thread:p882.88c;core:0;#75"); [noack mode]

This is in linux-low.c::select_event_lwp

I'm thinking the right thing to do, here is to enhance select_event_lwp
to look for threads that received a fork/vfork event first, and report
that event to gdb ahead of all the other kinds of events.

But I ran out of time to try this today - so I decided to send my notes
about this so far. I hope I am on the right track!

-- 
Joel

Follow-Ups:
- RFA/gdbserver: GDB internal-error debugging threaded program with breakpoint and forks (was: "Re: RFH: failed assert debugging threaded+fork program over gdbserver")
  - From: Joel Brobecker
- RE: RFH: failed assert debugging threaded+fork program over gdbserver
  - From: Breazeal, Don

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]