This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug nptl/13165] pthread_cond_wait() can consume a signal that was sent before it started waiting


http://sourceware.org/bugzilla/show_bug.cgi?id=13165

--- Comment #13 from Mihail Mihaylov <mihaylov.mihail at gmail dot com> 2011-09-28 09:02:51 UTC ---
(In reply to comment #11)
> I've confirmed that the issue occurs on my Debian system with their libc6
> package (eglibc 2.13-10, albeit slightly different from glibc).

I originally observed the problem on a Debian stable. I've run my test case on
my laptop which is running Mint and on my office workstation which is running
kubuntu.

I looked at the eglibc source code before posting the bug and saw that the code
which causes the race is identical to the one in glibc, so the bug is in both
implementations.

> I've also confirmed that the issue does not occur with my implementation of
> condition variables in musl libc(*).

I took a look at your code. As far as I can tell, you are not trying to avoid
spurious wakeups as hard as glibc, that's why you don't have the same race.

> I suspect it's a real bug, but I need to read the code more closely to
> understand what's going on...

Here is my understanding of the root cause - an attempt to prevent spurious
wakeups that has gone too far and destroys ordering - waking future waiters
instead of present ones.

There are two checks that NPTL uses to prevent spurious wakeups:

1) It only allows a thread to wake if a signal has been sent after it started
waiting. This is achieved by checking if cond->__data.__wakeup_seq has remained
unchanged.

2) It only allows as many threads to wake up as there were signals. This is
achieved by checking if cond->__data._woken_seq equals
cond->__data.__wakeup_seq.

If any of this checks indicates a spurious wakeup the thread retries the wait.

The problem is in check 2, because the guard is triggered if any thread has
woken spuriously - not just the current thread. Worse - it is triggered only
after the spuriously woken thread consumed a signal. So in many cases the
spuriously woken thread consumes the signal, and a validly woken thread is
forced to retry. The result is that a spurious wakeup may steal signals that
were sent before it started waiting.

Now, I'm confident that the race is real. But maybe some people would disagree
that it is a bug. That's why I asked in my original message if this behaviour
is intentional or a bug.

It is a bug if pthread condition variables should support the following usage: 

   ...

   pthread_mutex_lock(&m);

   SomeType localState = f(sharedState);

   while ( predicate(sharedState, localState) ) {
      pthread_cond_wait(&c, &m);
   }

   ...

In this case it actually matters which thread will wake up, because if the
wrong thread wakes up, it will retry the wait and the signal will be lost (this
is what happened to me). Unfortunately the spec is not very clear on the issue.
But this is the pattern that the pthread_cond_wait implementation in glibc
itself uses to detect spurious wakeups on the futex.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]