This is the mail archive of the gdb-patches@sources.redhat.com mailing list for the GDB project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[RFC] pb unwinding from pthread_cond_wait on ppc-linux (RFA?)


Hello,

We just noticed the following problem with gdb-6.3 on powerpc-linux.
To reproduce it, use the code provided at the end of this message,
and compile it using the following command:

        % gnatmake -g term

The following scenario shows that GDB is failing to compute the callstack
for a task doing a delay:

        (gdb) b term.adb:33
        Breakpoint 1 at 0x10002a4c: file term.adb, line 33.
        (gdb) run
        Starting program: /[...]/term
        [Thread debugging using libthread_db enabled]
        [New Thread 805493568 (LWP 15202)]
        [New Thread 807594784 (LWP 15205)]
        [New Thread 809691936 (LWP 15206)]
        [Thread 807594784 (LWP 15205) exited]
        [Switching to Thread 805493568 (LWP 15202)]

        Breakpoint 1, term () at term.adb:33
        33         delay 1.0;  -- STOP after STA has terminated.
        (gdb) task 3
 !!! -> Previous frame inner to this frame (corrupt stack?)
        (gdb) bt
        #0  0x0ffe47d8 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
           from /lib/tls/libpthread.so.0
        #1  0x0ffe47bc in pthread_cond_timedwait@@GLIBC_2.3.2 ()
           from /lib/tls/libpthread.so.0
 !!! -> #2  0x0ffe47bc in pthread_cond_timedwait@@GLIBC_2.3.2 ()
           from /lib/tls/libpthread.so.0
        #3  0x0ffe47bc in pthread_cond_timedwait@@GLIBC_2.3.2 ()
           from /lib/tls/libpthread.so.0
        #4  0x0ffe47bc in pthread_cond_timedwait@@GLIBC_2.3.2 ()
           from /lib/tls/libpthread.so.0
        #5  0x0ffe47bc in pthread_cond_timedwait@@GLIBC_2.3.2 ()
           from /lib/tls/libpthread.so.0
        #6  0x0ffe47bc in pthread_cond_timedwait@@GLIBC_2.3.2 ()
           from /lib/tls/libpthread.so.0
        #7  0x0ffe47bc in pthread_cond_timedwait@@GLIBC_2.3.2 ()
           from /lib/tls/libpthread.so.0

Frame #2 and beyond are incorrect.

What twe are trying to do in the scenario above is to switch
to task 3, which corresponds to Forever_Task. This task is
simply doing "delay 1.0" forever. The "task 3" command more
or less corresponds to "thread 3" except that we also added
a bit of code that goes up the stack for the user until it
finds a user-frame, ie the first frame that does not correspond
to the system nor the GNAT runtime.

This explains the context during which the "Previous frame inner [...]"
error message is printed during the task switch.

With gdb-6.0, we used to get the following callstack:

        #0  0x0ffe47d8 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
           from /lib/tls/libpthread.so.0
        #1  0x0ffe47bc in pthread_cond_timedwait@@GLIBC_2.3.2 ()
           from /lib/tls/libpthread.so.0
        #2  0x1000b2c4 in system__soft_links__tasking__timed_delay_t ()
        #3  0x1000e354 in ada__calendar__delays__delay_for ()
        #4  0x10002180 in term.forever_task (<_task>=0x10045eb0) at term.adb:17
        #5  0x10009510 in system__tasking__stages__task_wrapper ()
        #6  0x0ffe1048 in start_thread () from /lib/tls/libpthread.so.0
        #7  0x0ff5b70c in clone () from /lib/tls/libc.so.6

I think that the reason that gdb-6.0 worked is a bit related to luck.
Here are the first few instructions of pthread_cond_timedwait():

    0x0ffe45b4 <pthread_cond_timedwait+0>:     stwu    r1,-128(r1)
    0x0ffe45b8 <pthread_cond_timedwait+4>:     mflr    r8
    0x0ffe45bc <pthread_cond_timedwait+8>:     lis     r7,15258
    0x0ffe45c0 <pthread_cond_timedwait+12>:    bl      0xffed01c
    0x0ffe45c4 <pthread_cond_timedwait+16>:    stw     r8,132(r1)

As you see, the lr register is first saved in r8 @+4 shortly before
it is saved in the callee's slot @+16. Unfortunately, the prologue
unwinder is confused by the branch @+12 and causes it to stop scanning
the prologue when it encounters this branch. See skip_prologue():

      else if ((op & 0xfc000001) == 0x48000001)
        {                       /* bl foo, 
                                   to save fprs??? */
           [...]
           continue;
        }

As a consequence, the following code in gdb-6.0 kicked in (see
rs6000_frame_saved_pc()):

  if (fdata.lr_offset == 0 && get_next_frame (fi) != NULL)
    {
      [...]
        return read_memory_addr (DEPRECATED_FRAME_CHAIN (fi)
                                 + tdep->lr_frame_offset,
                                 wordsize);

I interpret the above into: We didn't find where the lr was saved,
and we're not the bottom frame. So this means that the lr must have
been saved. And since the ABI says that it should be saved at the
base of the caller's frame + 4, we'll try our luck there.

This piece of code disappeared in 6.3, presumably during the transition
to the new frame code. Instead, we have:

  /* If != 0, fdata.lr_offset is the offset from the frame that
     holds the LR.  */
  if (fdata.lr_offset != 0)
    cache->saved_regs[tdep->ppc_lr_regnum].addr = cache->base + fdata.lr_offset;
  /* The PC is found in the link register.  */
  cache->saved_regs[PC_REGNUM] = cache->saved_regs[tdep->ppc_lr_regnum];

I tried putting the same logic back, as an experiment. Something like
this:

  else (frame_relative_level (next_frame) >= 0)
    cache->saved_regs[tdep->ppc_lr_regnum].addr = cache->base + word_size;

(from memory). This worked pretty well as far as fixing my problem,
but did introduce some regressions in callfunc.exp. It has also this
bitter taste of something that didn't look quite right.

At the same time, I don't understand too well the following code in
skip_prologue():

          /* Don't skip over the subroutine call if it is not within
             the first three instructions of the prologue.  */
          if ((pc - orig_pc) > 8)
            break;

This refers applies to the "bl" instruction. Why 3? Given that in the
pthread_cond_timedwait function I have the bl instruction is at +12
(ie one insn too late for being accepted as part of the prologue),
I changed that number to 12, and found that it fixed my problem without
introducing any regression in the testsuite.

I feel that this second approach is a bit better than the first one.
What do you think?

2004-12-08  Joel Brobecker  <brobecker@gnat.com>

        * rs6000-tdep.c (skip_prologue): Allow bl instructions in the first
        four instructions of the function instead of the first three ones.

Tested on powerpc-linux, no regression.

-- 
Joel

procedure Term is

   task type Short_Task;
   type Short_Task_Access is access Short_Task;

   task type Forever_Task;
   type Forever_Task_Access is access Forever_Task;

   task body Short_Task is
   begin
      delay 0.2;
   end Short_Task;

   task body Forever_Task is
   begin
      loop
         delay 1.0;
      end loop;
   end Forever_Task;

   STA : Short_Task_Access;
   FTA : Forever_Task_Access;

begin

   STA := new Short_Task;
   FTA := new Forever_Task;

   while not STA'Terminated loop
      delay 1.0;
   end loop;

   delay 1.0;  -- line 33: STOP after STA has terminated.

end Term;

Attachment: rs6000-tdep.c.diff
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]