This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: race condition in dl code
- From: Sripathi Kodi <sripathik at in dot ibm dot com>
- To: libc-alpha at sources dot redhat dot com
- Cc: drepper at redhat dot com
- Date: Fri, 8 Sep 2006 23:22:17 +0530
- Subject: Re: race condition in dl code
- References: <20060906095206.GC11699@in.ibm.com>
- Reply-to: sripathik at in dot ibm dot com
Hi,
On Wed, Sep 06, 2006 at 03:22:06PM +0530, Sripathi Kodi wrote:
> Hi,
>
> We are seeing a race condition in dl code resulting in our application
> crashing with sigsegv. We have seen the problem only while running -rt
> (CONFIG_PREEMPT_RT) kernels. We have seen the problem on glibc-2.4.90-17
> from FC6 as well as latest glibc from cvs.
>
> Backtrace of the application thread that crashes is as follows. It is a Java
> based testcase.
>
> (gdb) bt
> #0 0xffffe410 in __kernel_vsyscall ()
> #1 0x4f14c546 in __nanosleep_nocancel () from /lib/libc.so.6
> #2 0x4f14c37b in sleep () from /lib/libc.so.6
> #3 0xb7e65724 in Hack_print_info_before_abort (whichCase=0xb7e71898 "CASE 1: not-SI_USER", signal=11, sigInfo=0xb7ee4e0c,
> context=0xb7ee4e8c, portLibType=4, previousSignal=0x0) at j9signal.c:414
> #4 0xb7e665dc in masterSynchSignalHandler (signal=11, sigInfo=0xb7ee4e0c, contextInfo=0xb7ee4e8c) at j9signal.c:1134
> #5 <signal handler called>
> #6 0x4ef694ad in do_lookup_x () from /lib/ld-linux.so.2
> #7 0x4ef69883 in _dl_lookup_symbol_x () from /lib/ld-linux.so.2
> #8 0x4ef6d8c4 in _dl_fixup () from /lib/ld-linux.so.2
> #9 0x4ef72cb0 in _dl_runtime_resolve () from /lib/ld-linux.so.2
> #10 0xb7ee9058 in thread_wrapper (arg=0x805bd5c) at j9thread.c:933
> #11 0x4b359f9a in start_thread () from /lib/libpthread.so.0
> #12 0x4f18abae in clone () from /lib/libc.so.6
>
> Our analysis:
> _dl_fixup passes link_map structure (l) as well as l->l_scope to
> _dl_lookup_symbol_x. In the link_map structure passed to fixup, the field
> l_scope_mem[4] is the statically allocated memory for the first few
> 'r_scope_elem's. When this gets filled up, more r_scope_elems are allocated
> and attached at 'l_scope'. Memory pointed by l_scope too can get exhausted.
> When this happens, this memory is doubled using realloc in this part of
> dl_open_worker() function:
>
> else
> {
> newp = (struct r_scope_elem **)
> realloc (imap->l_scope,
> new_size * sizeof (struct r_scope_elem *));
> if (newp == NULL)
> _dl_signal_error (ENOMEM, "dlopen", NULL,
> N_("cannot create scope list"));
> imap->l_scope = newp;
> }
>
> imap->l_scope_max = new_size;
> }
>
> So this reallocs the memory for l_scope (doubling it's size) and points
> l_scope to the new area of memory. Old memory is now free for reuse by
> malloc code.
>
> Sequence of events happening in our testcase is:
> i) Thread A calls _dl_lookup_symbol_x from _dl_fixup, passing it 'l' and
> 'l->l_scope' as parameters. It is preempted before _dl_lookup_symbol_x runs.
>
> ii) Thread B does (possibly many) dlopen()s, it exhausts the memory pointed
> to by l_scope. Hence dl_open_worker() reallocs this memory.
>
> iii) Memory previously allocated to l_scope is reused. However, thread A
> still has a reference to this area of memory through l_scope passed to
> _dl_lookup_symbol_x.
>
> iv) Thread A tries to access memory pointed to by it's copy of l_scope,
> hence gets a SEGV.
>
> Occasionally, when it doesn't get a SEGV, thread A fails to resolve the
> symbol and prints out a message like "undefined symbol: printf, version
> GLIBC_2.0"
>
> We feel that l_scope will have to be protected with a lock (preferably
> read-write lock) to solve this problem. The idea is to make the thread doing
> dlopen to hold a write lock on l_scope before doing realloc. This will
> prevent memory being freed from under the feet of reader threads doing
> symbol lookup holding read lock on l_scope. I can send such a patch if the
> idea is acceptable. If the idea of introducing a lock is not good, could you
> please suggest any other way to solve this problem?
I observed that __dlsym uses _dl_load_lock to protect against concurrent
loads/unloads of libraries. I tried with the same lock in fixup code around
the call to _dl_lookup_symbol_x(). With this, our testcase is not hitting
the race condition I have explained above. Is this the right approach to
solve the problem?
Following is the patch I am running with:
diff -uprN libc_org/elf/dl-runtime.c libc/elf/dl-runtime.c
--- libc_org/elf/dl-runtime.c 2006-09-05 00:25:24.000000000 -0500
+++ libc/elf/dl-runtime.c 2006-09-08 12:35:39.000000000 -0500
@@ -92,9 +92,13 @@ _dl_fixup (
version = NULL;
}
+ /* Protect against concurrent loads and unloads. */
+ __rtld_lock_lock_recursive (GL(dl_load_lock));
+
result = _dl_lookup_symbol_x (strtab + sym->st_name, l, &sym,
l->l_scope, version, ELF_RTYPE_CLASS_PLT,
DL_LOOKUP_ADD_DEPENDENCY, NULL);
+ __rtld_lock_unlock_recursive (GL(dl_load_lock));
/* Currently result contains the base load address (or link map)
of the object that defines sym. Now add in the symbol
@@ -174,11 +178,17 @@ _dl_profile_fixup (
version = NULL;
}
+ /* Protect against concurrent loads and unloads. */
+ __rtld_lock_lock_recursive (GL(dl_load_lock));
+
+
result = _dl_lookup_symbol_x (strtab + refsym->st_name, l, &defsym,
l->l_scope, version,
ELF_RTYPE_CLASS_PLT,
DL_LOOKUP_ADD_DEPENDENCY, NULL);
+ __rtld_lock_unlock_recursive (GL(dl_load_lock));
+
/* Currently result contains the base load address (or link map)
of the object that defines sym. Now add in the symbol
offset. */