This is the mail archive of the cygwin-developers@cygwin.com mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: fix cond_race... was RE: src/winsup/cygwin ChangeLog thread.cc thread.h ...


I've fresh out of ideas.

Here's the system state of a crash:
threads, lock waiting on..
1 lock=0xa0cad00
2 - cygwin signal thread
3,4,5,6,7...17, 19 lock=0xa060050 - fast_cfunction
18  lock=0xa069868 - restore thread.. from posix open.
20 - not sure.

Now, your python test is meant to create 20 threads, but AFAICT it's
only created 18. (main thread + signal + 18 python threads).

Now, the cond variable for lock 0xa060050 show 14 waiting threads, not
the 16 listed. So we've gotten out of sync on the wait count. Now
barring code crashes the only place that can happen is the Signal
spinlock. However unless gcc is doing something _funky_ every
InterlockedIncrement is matched by a InterLockedDecrement, and
InterlockedIncrement is always called once.

As to why the waiters count is out by two... the implication is that 2
too many threads where released by the Spinlock code and it didn't
detect that occuring. Or 1 too many, twice. I can't see that this would
negatively affect Python. (I've checked the pthread python header. and
the returning callsa re in an infinite loop on thelock->locked which
will serialise the returning threads within python because
pthread_cond_wait will block until is has the mutex).

Now, the crash appears to occur during the Reaping of process's.

heres another system state:
thread 4 - unknown - system address's only. Could be a thread that has
exited and not been joined yet.
thread 3 - waiting on lock=0xa0727f8
thread 2 - signal thread
thread 1 - waiting on lock=0xa05de88

In this example, it's clearly wedged. The state of the the cond variable
that 3 is waiting on is internally consistent - waiting == 1. As is the
cond that 1 is waiting on. It _may_ be deadlocked, but as I don't know
if there is any cross reference between the two python locks, I can't
comment. The great example of this is that thread 3 can only have
reached it's point if the lock was locked (locked==1) when it tried to
grab it, and a race occured with the current owners unlocking and
signaling of the unlock.

Now that race *could* be a cygwin thing, but I suspect python...

It would be nice if you could show me enough python debug internals to
understand what exact python code each thread is actually stuck on - at
the moment I'm not sure how far through the test sequence it's actually
getting.

The c code involved in the release simply locks the lock mutex, sets
locked to 0, unlocks the mutex, and signals the cond variable.
the acquirer meanwhile, grabs the mutex, tests the locked variable
state, and calls cond_wait IFF it's still locked.

So lets look at this:
python uses a mutex (no known races in cygwin) to protect the locked
member. That member is tested _before calling_ pthread_cond_wait on line
313 of te thread_pthread.h file.

So to have a Signal and Wait race occuring, you need the value of locked
to propogate slowly (the SMP case), or else cond_wait is never called
(if there is one acquiring thread), or else cygwin is flakey. If there
are n acquiring threads, then n-1 may enter pthread_cond_wait, but 1 is
guaranteed to see the correct value and execute it's code. That then
calls release in it's time, and now (worst case) all the n-1 threads are
listed as waiting in cygwin but haven't called WFSO yet. if even 1 has
called WFSO, then no race occurs. If the race occurs, you've tested with
up to a 1000 iterations on the spinlock and still had it crash. Thats
more than enough to allow a single thread with less than 1 lines of c to
go to call WFSO. (And remember that access to that code block is
serialised via owning the mutex in the first place). In short, I still
cannot see how this can actually manage to fail. (ie in 500 Sleep (1)
calls every one of those 20 worker threads should have gotten a time
slice, locked and released the mutex (40 timeslices), and then called
WFSO the second time around.

BTW: in thread.cc in your cygwin source, you might like to change s = !s
to s= (s+1) % 2; I'm not sure that !s is guaranteed to give you Sleep
(1) :}.

Now the python thread header seems a little buggy to me.
1) the lock struct member locked should be volatile. (Do you have an SMP
system Jason?)
2) ..acquire_lock can exit with the lock marked as owned, but return an
error. If the calling function assumes that an error from acquire_lock
means it' failed to acquire the lock, the lock will get wedged, and
python will appear to hang. (the if (error) success = 0; line is the
culprit. It should also reset thelock->locked IMO. This is congruent
with the symptoms I see in your case.

So in short, I'm not at all convinced this is a _cygwin_ problem.

I'd suggest fixing 1) if you have dual proc's, and 2) regardless. My
current bet is that python is getting an error of some sort from
pthreads, and handling that badly. (although this assumes that stderr is
still redirected, else I can't explain why the perror() call that should
have resulted is not displayed.)

If it still occurs after that, I'll spend some more time and get
serialised blow by blow activity logged to syslog so we can see exactly
whats going on.

The only thing I can think of that could be affecting this is if NT
always releases a timeslice when ExitCriticalSection is called, and even
that shouldn't cause the symptoms you're seeing.

Rob


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]