This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: race condition in dl code


Hi,

On Wed, Sep 06, 2006 at 03:22:06PM +0530, Sripathi Kodi wrote:
> Hi,
> 
> We  are seeing  a race  condition in  dl code  resulting in  our application
> crashing  with sigsegv.  We have  seen the  problem only  while running  -rt
> (CONFIG_PREEMPT_RT)  kernels. We  have seen  the problem  on glibc-2.4.90-17
> from FC6 as well as latest glibc from cvs.
> 
> Backtrace of the application thread that crashes is as follows. It is a Java
> based testcase.
> 
> (gdb) bt
> #0  0xffffe410 in __kernel_vsyscall ()
> #1  0x4f14c546 in __nanosleep_nocancel () from /lib/libc.so.6
> #2  0x4f14c37b in sleep () from /lib/libc.so.6
> #3  0xb7e65724 in Hack_print_info_before_abort (whichCase=0xb7e71898 "CASE 1: not-SI_USER", signal=11, sigInfo=0xb7ee4e0c,
>     context=0xb7ee4e8c, portLibType=4, previousSignal=0x0) at j9signal.c:414
> #4  0xb7e665dc in masterSynchSignalHandler (signal=11, sigInfo=0xb7ee4e0c, contextInfo=0xb7ee4e8c) at j9signal.c:1134
> #5  <signal handler called>
> #6  0x4ef694ad in do_lookup_x () from /lib/ld-linux.so.2
> #7  0x4ef69883 in _dl_lookup_symbol_x () from /lib/ld-linux.so.2
> #8  0x4ef6d8c4 in _dl_fixup () from /lib/ld-linux.so.2
> #9  0x4ef72cb0 in _dl_runtime_resolve () from /lib/ld-linux.so.2
> #10 0xb7ee9058 in thread_wrapper (arg=0x805bd5c) at j9thread.c:933
> #11 0x4b359f9a in start_thread () from /lib/libpthread.so.0
> #12 0x4f18abae in clone () from /lib/libc.so.6
> 
> Our analysis:
> _dl_fixup  passes   link_map  structure  (l)   as  well  as   l->l_scope  to
> _dl_lookup_symbol_x. In  the link_map structure  passed to fixup,  the field
> l_scope_mem[4]  is  the  statically  allocated  memory  for  the  first  few
> 'r_scope_elem's. When this gets filled  up, more r_scope_elems are allocated
> and attached at 'l_scope'. Memory pointed  by l_scope too can get exhausted.
> When this  happens, this  memory is  doubled using realloc  in this  part of
> dl_open_worker() function:
> 
>               else
>                 {
>                   newp = (struct r_scope_elem **)
>                     realloc (imap->l_scope,
>                              new_size * sizeof (struct r_scope_elem *));
>                   if (newp == NULL)
>                     _dl_signal_error (ENOMEM, "dlopen", NULL,
>                                       N_("cannot create scope list"));
>                   imap->l_scope = newp;
>                 }
> 
>               imap->l_scope_max = new_size;
>             }
> 
> So this  reallocs the  memory for  l_scope (doubling  it's size)  and points
> l_scope to  the new  area of  memory. Old memory  is now  free for  reuse by
> malloc code.
> 
> Sequence of events happening in our testcase is:
> i) Thread  A calls  _dl_lookup_symbol_x from _dl_fixup,  passing it  'l' and
> 'l->l_scope' as parameters. It is preempted before _dl_lookup_symbol_x runs.
> 
> ii) Thread B does (possibly many)  dlopen()s, it exhausts the memory pointed
> to by l_scope. Hence dl_open_worker() reallocs this memory.
> 
> iii) Memory  previously allocated  to l_scope is  reused. However,  thread A
> still  has a  reference to  this area  of memory  through l_scope  passed to
> _dl_lookup_symbol_x.
> 
> iv) Thread  A tries  to access memory  pointed to by  it's copy  of l_scope,
> hence gets a SEGV.
> 
> Occasionally, when  it doesn't  get a  SEGV, thread A  fails to  resolve the
> symbol  and prints  out a  message like  "undefined symbol:  printf, version
> GLIBC_2.0"
> 
> We  feel that  l_scope will  have to  be protected  with a  lock (preferably
> read-write lock) to solve this problem. The idea is to make the thread doing
> dlopen  to hold  a write  lock on  l_scope before  doing realloc.  This will
> prevent  memory being  freed from  under the  feet of  reader threads  doing
> symbol lookup holding read  lock on l_scope. I can send such  a patch if the
> idea is acceptable. If the idea of introducing a lock is not good, could you
> please suggest any other way to solve this problem?

I observed  that __dlsym  uses _dl_load_lock  to protect  against concurrent
loads/unloads of libraries. I tried with  the same lock in fixup code around
the call  to _dl_lookup_symbol_x(). With  this, our testcase is  not hitting
the race  condition I have  explained above. Is  this the right  approach to
solve the problem?

Following is the patch I am running with:

diff -uprN libc_org/elf/dl-runtime.c libc/elf/dl-runtime.c
--- libc_org/elf/dl-runtime.c	2006-09-05 00:25:24.000000000 -0500
+++ libc/elf/dl-runtime.c	2006-09-08 12:35:39.000000000 -0500
@@ -92,9 +92,13 @@ _dl_fixup (
 	    version = NULL;
 	}
 
+     /* Protect against concurrent loads and unloads.  */ 
+      __rtld_lock_lock_recursive (GL(dl_load_lock));
+
       result = _dl_lookup_symbol_x (strtab + sym->st_name, l, &sym,
 				    l->l_scope, version, ELF_RTYPE_CLASS_PLT,
 				    DL_LOOKUP_ADD_DEPENDENCY, NULL);
+      __rtld_lock_unlock_recursive (GL(dl_load_lock));
 
       /* Currently result contains the base load address (or link map)
 	 of the object that defines sym.  Now add in the symbol
@@ -174,11 +178,17 @@ _dl_profile_fixup (
 		version = NULL;
 	    }
 
+	  /* Protect against concurrent loads and unloads.  */
+	  __rtld_lock_lock_recursive (GL(dl_load_lock));
+
+
 	  result = _dl_lookup_symbol_x (strtab + refsym->st_name, l, &defsym,
 					l->l_scope, version,
 					ELF_RTYPE_CLASS_PLT,
 					DL_LOOKUP_ADD_DEPENDENCY, NULL);
 
+	  __rtld_lock_unlock_recursive (GL(dl_load_lock));
+
 	  /* Currently result contains the base load address (or link map)
 	     of the object that defines sym.  Now add in the symbol
 	     offset.  */


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]