This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Lock elision test results

From: Torvald Riegel <triegel at redhat dot com>
To: vogt at linux dot vnet dot ibm dot com
Cc: libc-alpha at sourceware dot org
Date: Tue, 02 Jul 2013 14:18:04 +0200
Subject: Re: Lock elision test results
References: <20130614102653 dot GA21917 at linux dot vnet dot ibm dot com>

On Fri, 2013-06-14 at 12:26 +0200, Dominik Vogt wrote:
> Test results on a zEC12 with eight cpus with Andi's the lock
> elision v10 patches ported to z/architecture.  Unfortunately I
> cannot provide the source code used for the tests at the moment,
> but I can share relative performance data.  I plan to create a
> collection of test programs that can be used to measure elision
> performance in specific cases.
> 
> The tests were run on 13th of June, 2013.
> 
> Test 1
> ======
> 
> Setup
> -----
> 
> Two concurrent threads using pthread mutexes (m1, m2) and
> counters c1, c2, c3.  All static data structures are allocated
> in separate cache lines.
> 
> thread 1:
> 
>   barrier
>   repeat <n> times

We need to know how large n and m (below) are.  At least a rough range.

>     lock m1
>     lock m2
>     increment c1
>     unlock m1
>     increment c2
>     repeat <m> times
>       waste a minimal amount of cpu
>     unlock m2
>     signal that thread 1 has finished its work

I suppose that this isn't part of the loop (but it's indented the same
way as the loop body)?

>   barrier
> 
> thread 2:
> 
>   barrier
>   get start timestamp
>   while thread 1 has not finished
>     lock m1
>     increment c3
>     unlock m2
>   get end timestamp
> 
> Performance is measured in loops of thread 2 divided by the time
> taken.

Why just thread 2's time?  Without looking at thread 1's performance too
we can't distinguish throughput from fairness effects.

> 
> Test execution
> --------------
> 
> The test is run ten times each with four different versions and
> setups of glibc:
> 
> (1) current glibc without elision patchs (2506109403de)
> (2) glibc-2.15
> (3) current glibc (1) plus elision patchs, GLIBC_PTHREAD_MUTEX=none
> (4) current glibc (1) plus elision patchs, GLIBC_PTHREAD_MUTEX=elision
> 
> The best results of all runs for each glibc setup are compared.

What's the variance between the results from different test runs?  At
least having min/max/avg would be good, or perhaps the median, or
something like that.

> The result for (1) is the reference (i.e. 100%).  Higher values
> mean higher relative performance.
> 
> Result
> ------
> 
> (1) unpatched  : 100.00%
> (2) old glibc  : 101.83%
> (3) elision off:  77.87%

What causes this 25% overhead?  Does it also occur with just a single
thread, or is it a difference in cache misses?

> (4) elision on :  29.37%
> 
> The abort ratio in (4) is >= 75% on thread 1

Why does it abort?  Did you experiment with different tuning params (ie,
the fields in elision_conf)?

>  and < 1% on thread 2.

Did thread 2 use elision often, or not?  If it did, you almost never
abort, and you just measure thread 2's performance, then why the
overhead?  Are aborts extremely costly on your architecture?  Do you get
a lot more cache misses somewhere else?

What's the performance of running just thread 2 without thread 1?  IOW,
how large are the single-thread overheads of elision on your arch/impl?

Do you have tests with larger critical sections?  Right now your
critical sections, esp. the one in thread 2, are extremely small.

> 
> Test 2 (nested locks)
> ======
> 
> Setup
> -----
> 
> Three concurrent threads using pthread mutexes (m1, ..., m10) and
> counters c1, ..., c10.  All static data structures are allocated
> in separate cache lines.
> 
> all threads:
> 
>   barrier
>   take start timestamp (only thread 1)
>   repeat <n> times
>     lock m1, increment c1
>     lock m2, increment c2
>     ...
>     lock m10, increment c10
>     unlock m10
>     unlock m9
>     ...
>     unlock m1
>   barrier
>   take end timestamp (only thread 1)
> 
> Performance is measured in the inverse of the time taken on thread
> 1.

See above, throughput vs. fairness.

> Test execution
> --------------
> 
> Identical to test 1.
> 
> Result
> ------
> 
> (1) unpatched  : 100.00%
> (2) old glibc  : 134.35%

What causes this difference between (1) and (2)?  If we get a 34%
difference just for stock glibc, this seems to be a bigger problem for
me than similar overheads if we turn on elision, which is still an
experimental feature.

> (3) elision off:  56.45%
> (4) elision on :  31.31%

We need more data to understand these results.  See above.

> The abort ratio in (4) in all threads is between 5% and 10%.

That's interesting because all threads seem to conflict with each other
(ie, they increment the same counters).  How often was elision actually
used?

> 
> Test 3 (cacheline pingpong)
> ======
> 
> Setup
> -----
> 
> Four concurrent threads using a pthread mutexes m and c1, ..., c4.
> All static data structures are allocated in separate cache lines.
> 
> thread <n>:
> 
>   barrier
>   take start timestamp (only thread 1)
>   barrier
>   repeat <m> times
>     lock m
>     increment c<n>
>     unlock m
>   barrier
>   take end timestamp (only thread 1)
> 
> Performance is measured in the inverse of the time taken on thread
> 1.

See above, throughput vs. fairness.

> 
> Test execution
> --------------
> 
> Identical to test 1.
> 
> Result
> ------
> 
> (1) unpatched  : 100.00%
> (2) old glibc  : 103.94%
> (3) elision off:  76.25%
> (4) elision on : 373.38%

So in this case we scale almost linearly, nice.  Do you know why it is
different in test 1?  Just the abort ratio as cause of this would be
kind of surprising, given that abort rates differ not that much (or
aborts would have to be really costly).

> 
> The abort ratio in (4) in all threads is < 0.01%.

In general, I think it would be good if you could port these tests to
the performance benchmark framework that Siddhesh has been working on,
and contribute them to glibc.  This way, we can also work on avoiding
performance regressions as the one you reported for the 2.15 vs. current
case without the elision patches.

Torvald

Follow-Ups:
- Re: Lock elision test results
  - From: Dominik Vogt
- Re: Lock elision test results
  - From: Dominik Vogt
- Re: Test 3 (Was: Lock elision test results)
  - From: Dominik Vogt

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]