This is the mail archive of the
libc-help@sourceware.org
mailing list for the glibc project.
fork hang with corrupted list_all_lock
- From: "Wayne H. Badger" <badger at yahoo-inc dot com>
- To: libc-help at sourceware dot org
- Date: Fri, 25 Jun 2010 17:42:42 -0500
- Subject: fork hang with corrupted list_all_lock
I have discovered an anomaly whose investigation has led to glibc and
I'm
wondering if this has been seen before.
I have a cluster of machines running RHEL5.4 (glibc-2.5 based) on
Nehalem
E5530 processors (16 hyperthreaded CPUs, stepping 5) that are running
a java
process (hadoop TaskTracker). TaskTracker is 32-bit and multithreaded
(~80
threads). The kernel is 64 bit running 2.6.18-164.2.1.el5.
I have caught the process in a relatively rare event that is one of
those
"can't happen" scenarios.
Whenever a process forks, __libc_fork (nptl/sysdeps/unix/sysv/linux/
fork.c)
calls _IO_list_lock() to acquire list_all_lock before calling the fork
system
call. list_all_lock contains three fields: lock, cnt, and owner.
After the
fork system call, the child resets the lock and the parent releases it.
Normally, this works as you would expect, but when it fails, the
parent's lock
is zeroed (.lock=0, .cnt=0, .owner=0) and when subsequently released,
results
in a lock in an invalid state. At that time, the lock has these values.
list_all_lock.lock: 2
list_all_lock.cnt: -1
list_all_lock.owner: <thread that released the lock>
From this state, no additional forks can be made. Many of the
threads in the
process are waiting for a lock in the malloc code (malloc_atfork) that
runs
when a fork is currently outstanding. The process is hung at this
point.
So, the "can't happen" event is that some thread/process has scrozzled
the
lock while it is being held by a thread. Unless there is some glibc
code that
is just writing out-of-bounds zeroes, it looks like the lock is being
reset
with _IO_list_resetlock(). Since only the child calls this code in
its own
address space, it ought not affect the parent's version of the lock.
This anomaly occurs only when running RHEL5.4 on the Nehalem
processors. I
have not been able to reproduce the issue running either RHEL5.4 or
RHEL5.1 on
older E5420 processors.
Remediations tried so far have all resulted in the same TaskTracker
hang.
* latest java (jdk1.6_20)
* set UseMemBar in java
* use latest microcode from Intel for E5530
* restrict the CPU set to all CPUs on a single processor
* disable HyperThreading in the BIOS
* latest RHEL glibc: glibc-2.5-49
I have tried a couple of tests that resulted in the issue not being
reproduced.
* restrict all threads to the same CPU
* add glibc debugging so that cache line containing
list_all_lock was rearranged
I have looked at http://sourceware.org/ml/libc-hacker/2007-02/msg00009.html
,
but this doesn't seem quite like the issue that I'm seeing. If that
were
the bug, then I would expect to see a deadlock situation, not corrupted
lock fields.
While it looks like this may be a silicon bug, it is possible that it
is not
and so I'm looking for anyone who might have seen this kind of
behavior in
glibc.
Wayne
--
Wayne Badger
Yahoo!