This is the mail archive of the libc-ports@sources.redhat.com mailing list for the libc-ports project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM.

From: "Carlos O'Donell" <carlos at systemhalted dot org>
To: David Miller <davem at davemloft dot net>, Daniel Jacobowitz <drow at false dot org>, "Joseph S. Myers" <joseph at codesourcery dot com>, Andreas Schwab <schwab at linux-m68k dot org>, Thomas Schwinge <thomas_schwinge at mentor dot com>, Kaz Kojima <kkojima at rr dot iij4u dot or dot jp>, Marcus Shawcroft <marcus dot shawcroft at linaro dot org>
Cc: katsuki dot uwatoko at toshiba dot co dot jp, libc-ports at sourceware dot org
Date: Fri, 8 Feb 2013 23:18:42 -0500
Subject: Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM.
References: <F3B6523425E7914AA6214ED358D820AF25D90F5D@TGXML316.toshiba.local> <Pine.LNX.4.64.1302072328580.17419@digraph.polyomino.org.uk>

Dan,

If you wouldn't mind I'd like your historical perspective on a glibc issue.

You were the person who last touched
ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.c; do you have any
recollection why ARM needed a custom copy of this file?

Dave,

I see that sparc32 also has a unique copy of lowlevellock.c Why the
use of *_24_* atomic primitives? Faster?

Andreas, Thomas, Kaz, Marcus,

I've added you to the TO: because machines you maintain are affected
by a bug in their lowlevellock.h implementations. Please see section
(c) below and review my analysis.

> Carlos, any comments on this patch
> <http://sourceware.org/ml/libc-ports/2013-01/msg00084.html>?  It makes the
> ARM version of __lll_timedlock_wait closer to the HPPA version, but I
> don't know if many of the differences between different architecture
> versions of this code are really deliberate....

I do not believe HPPA needs a unique version of lowlevellock.c and
should instead be using the generic version. I do not recollect why I
would have created a copy of lowlevellock.c other than I was mostly
cloning SPARC when doing binutils and glibc work (Dave does a good
job). The HPPA kernel helper emulates compare-and-exchange and thus
everything that is required for a generic NPTL implementation.

> Would you agree that the generic Linux version
> (nptl/sysdeps/unix/sysv/linux/lowlevellock.c) doesn't need such a change
> because the loop is using atomic_exchange_acq rather than
> atomic_compare_and_exchange_bool_acq, so is already setting the futex to 2
> in every loop iteration?

(a) Review of the reporter's claim that a busy loop can result in the
ARM version of __lll_timedlock_wait.

I have reviewed the ARM function __lll_timedlock_wait and I can
confirm that it does indeed seem like the current code can cause a
busy-wait loop with three threads given the conditions provided by
Uwatoko-san. I see nothing unusual about the given conditions and with
enough cores you should be able to trigger the busy-wait loop.

(b) Review of the generic version of __lll_timedlock_wait.

I can confirm that the generic version of __lll_timedlock_wait is
*not* susceptible to a busy-wait loop. The use of `atomic_exchange_acq
(futex, 2)' in the while loop condition ensures that the code always
correctly *assumes* the lock is locked with more than one waiter. Note
that the expectation is that __lll_timedlock *must not* reset futex to
1, otherwise new threads may prevent us from waiting on a futex value
of 2. The expectation is that __lll_timedlock sets futex to 1 only if
it was zero.

(c) Solution to the busy wait loop.

I do not like the proposed solution. It is a band-aid on top of a
function that doesn't look like it has any architectural merits. I
think ARM should be using the generic NPTL version of lowlevellock.c.

The real problem is that ARM's __lll_timedlock does a
"atomic_exchange_acq (__futex, 1)" while attempting to acquire the
futex (violating what I noted in (b)). This means that any new thread
trying to acquire the futex will reset the futex to "locked no
waiters" and this causes all other inflight threads doing a futex wait
to fail with -EWOUDLBLOCK.

It seems to me that this is an ARM bug in __lll_timedlock macro in
lowlevelock.h that was partly worked around in __lll_timedlock_wait in
lowlevellock.c.

For example compare ARM's implementation:
#define __lll_timedlock(futex, abstime, private)                              \
  ({                                                                          \
     int *__futex = (futex);                                                  \
     int __val = 0;                                                           \
                                                                              \
     if (__builtin_expect (atomic_exchange_acq (__futex, 1), 0))              \
       __val = __lll_timedlock_wait (__futex, abstime, private);              \
     __val;                                                                   \
  })

To SPARC's implementation:
static inline int
__attribute__ ((always_inline))
__lll_timedlock (int *futex, const struct timespec *abstime, int private)
{
  int val = atomic_compare_and_exchange_val_24_acq (futex, 1, 0);
  int result = 0;

  if (__builtin_expect (val != 0, 0))
    result = __lll_timedlock_wait (futex, abstime, private);
  return result;
}

Note that the SPARC implementation sets the futex to 1 only if it was
zero, otherwise it calls into __lll_timedlock_wait, which will then
set futex to 2, because we can't know at that point that there aren't
one or more waiters (the "perhaps" scenario if you've read Ulrich's
paper "Futexes Are Tricky").

I have reviewed all of the lowlevellock.h implementations and find the
following results:

No chance of busy-wait:
* mips, ia64, alpha, tile, i386, powerpc, x86-64, s390, sparc, hppa
- All of these targets use compare-and-exchange in __lll_timedlock
without resetting futex to 1.

Possible chance of busy-wait:
* m68k, aarch64, arm, sh/sh4,
- All of these targets reset futex to 1 using atomic-exchange in
__lll_timedlock.

m68k and sh/sh4 use the generic version of lowlevellock.c, and this
means the busy-wait loop bug is not present but the use of
atomic-exchange causes a different bug. On m68k and sh/sh4 you get at
most N spurious failed futex wait syscalls for each thread calling
__lll_timedwait where N is the number of inflight threads expecting
futex == 2.

For example on m68k:

* T1 calls __lll_timedlock setting futex to 1 and taking the lock.
* T2 calls __lll_timedlock setting futex to 1 but does not take the lock.
* T2 calls __lll_timedlock_wait and sets the futex to 2 and does not
gain the lock.
* T3 calls __lll_timedlock setting futex to 1 but does not take the lock.
* T2 calls futex wait but fails with -EWOULDBLOCK because T3 reset futex to 1.
-> One inflight thread (T2), and one spurious failed futex wait syscall.
* T2 again sets the futex to 2 and does not gain the lock.
* ... T2 and T3 go on to call futex wait syscall and both sleep.

In the ARM case with the current implementation of __lll_timedwait we
get a busy-loop as outlined by Owatoko-san.

The real fix is to use atomic-compare-and-swap in __lll_timedlock.

In fact the use of atomic-exchange in ARM's __lll_timedlock seems
dangerous given this unlock:
#define __lll_unlock(futex, private) \
  (void)                                                        \
    ({ int *__futex = (futex);                                  \
       int __oldval = atomic_exchange_rel (__futex, 0);         \
       if (__builtin_expect (__oldval > 1, 0))                  \
         lll_futex_wake (__futex, 1, private);                  \
    })

Thus you *could* have a new thread calling __lll_timedlock, setting
futex to 1, and every thread calling __lll_unlock skips the wake. This
could cause a *huge* delay between sleeping and waking as the wake is
deferred later and later. Eventually one of the threads does an atomic
operation on the futex to set the value to 2 and will take the lock,
and eventually call wake. However, the latency is increased and that's
a performance bug.

In summary:
- All machines listed above as `Possible chance of busy-wait' should
be using a generic __lll_timedlock that looks like this:
~~~
  int result = 0;
  if (atomic_compare_and_exchange_bool_acq (futex, 1, 0) != 0)
    result = __lll_timedlock_wait (futex, abstime, private);
  return result;
~~~
(simplest example taken from alpha's lowlevellock.h)

Comments?

Cheers,
Carlos.

Follow-Ups:
- Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM.
  - From: David Miller

References:
- Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM.
  - From: Joseph S. Myers

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]