This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Lock elision test results

From: Dominik Vogt <vogt at linux dot vnet dot ibm dot com>
To: libc-alpha at sourceware dot org
Date: Wed, 3 Jul 2013 14:07:53 +0200
Subject: Re: Lock elision test results
References: <20130614102653 dot GA21917 at linux dot vnet dot ibm dot com> <1372767484 dot 22198 dot 4505 dot camel at triegel dot csb> <20130703065355 dot GA4522 at linux dot vnet dot ibm dot com> <1372849432 dot 14172 dot 20 dot camel at triegel dot csb>
Reply-to: libc-alpha at sourceware dot org

On Wed, Jul 03, 2013 at 01:03:52PM +0200, Torvald Riegel wrote:
> On Wed, 2013-07-03 at 08:53 +0200, Dominik Vogt wrote: 
> > On Tue, Jul 02, 2013 at 02:18:04PM +0200, Torvald Riegel wrote:
> > > On Fri, 2013-06-14 at 12:26 +0200, Dominik Vogt wrote:
> > > >   barrier
> > > > 
> > > > thread 2:
> > > > 
> > > >   barrier
> > > >   get start timestamp
> > > >   while thread 1 has not finished
> > > >     lock m1
> > > >     increment c3
> > > >     unlock m2
> > > >   get end timestamp
> > > > 
> > > > Performance is measured in loops of thread 2 divided by the time
> > > > taken.
> > > 
> > > Why just thread 2's time?  Without looking at thread 1's performance too
> > > we can't distinguish throughput from fairness effects.
> > 
> > As the threads are synchronized they take the same amount of time.
> 
> Right, I missed that.  Nonetheless, thread 2 can do more work depending
> on whether it gets to acquire m1 more frequently than thread 1 or not.
> If thread 2 does acquire m1 only infrequently for some reason, this will
> decrease thread 1's time but it doesn't mean that we're now necessarily
> faster -- we might just do less work overall, so less throughput.  Or if
> thread 2 holds m1 most of the time (or, acquires it frequently), then
> thread 1's time will be larger but we also do more work.
> 
> IOW, I think we need to report the iterations of thread 2 too, or let
> thread2 do a fixed number of iterations and measure the time it needs
> for those.

I don't see what difference that makes.  But regardless of whether
<n> is 2000, 5000, 10000 or 20000, thread 2 always completes
roughly the same number of iterations per second.  Also, if I let
thread 2 execute a fixed number of iterations and terminate the
loop in thread 1 when thread 2 is done, the number of iterations
per second is the same.

> > Speaking about the actual test result:
> > 
> > 1. In (4), thread 1 has a high abort ratio, i.e. it needs lots of
> >    additional cpu cycles and the whole test runs much longer than
> >    it would without aborts.
> > 2. Thread 2 does not benefit much from the additional run time,
> >    the number of iterations is roughly the same as without
> >    elision.
> 
> Does it have a high abort rate too?  If not, it should benefit from the
> additional runtime that thread 1's effectively gives to thread 2.

I don't think it has lots of aborts other than the ones that are
caused by the mutex being locked.  To me it looks like the aborts
cause by the write-write conflict always hit thread 1, but I don't
know for sure at the moment.  When thread 2 aborts because the
mutex was locked, it blocks on the lock and waits for thread 1 to
complete before it can continue.

> What's the ratio between the increase in thread 2's iterations and the
> increase in thread 1's iteration?

See above.

A quick printf shows that thread 2 does about

(1) 14000*n
(2) 13500*n
(3) 14700*n
(4)  6800*n

iterations at the same time thread 1 does one outer iteration.

> > > > Test execution
> > > > --------------
> > > > 
> > > > The test is run ten times each with four different versions and
> > > > setups of glibc:
> > > > 
> > > > (1) current glibc without elision patchs (2506109403de)
> > > > (2) glibc-2.15
> > > > (3) current glibc (1) plus elision patchs, GLIBC_PTHREAD_MUTEX=none
> > > > (4) current glibc (1) plus elision patchs, GLIBC_PTHREAD_MUTEX=elision
> > > > 
> > > > The best results of all runs for each glibc setup are compared.
> > > 
> > > What's the variance between the results from different test runs?  At
> > > least having min/max/avg would be good, or perhaps the median, or
> > > something like that.
> > 
> > It's a bit difficult to answer this without posting absolute
> > numbers.  The range of measurements is large enough to be
> > irritating in all test setups, even in the early morning when
> > nobody but me works on the machine.
> 
> What do you mean by "irritating"?

By irritating I mean that I have no explanation for them.

In test 2 the range is much larger, and sometimes the performance
of a single test setup is roughly 50% or 200% of the "normal"
performance.  However, I never see 30%, 60%, 80% or 120%.
I guess that once a test program runs into unfavourable timing
conditions between the threads, it has trouble getting out of them
again.

> > > If it did, you almost never
> > > abort, and you just measure thread 2's performance, then why the
> > > overhead?
> > 
> > > Are aborts extremely costly on your architecture?
> > 
> > Yes.
> 
> That's interesting.  Can you give a rough range of how costly they are?
> Or what is happening internally?

A full list of the steps during transaction abort handling can be
found on page 5-100 of the "z/Architecture, Principles of
Operation" (SA22-7832-09).  It's available somewhere on the web.
In short, the cpu cleans up the cache (rolling back transactional
stores and committing nontransactional stores), restores the PC
and the general purpose registers (specified by TBEGIN) and writes
a transaction diagnostic block which is a block of 256 bytes that
contains the abort address, abort reason, a copy of the general
purpose registers when the at the point of abortion (the caller
provides a pointer to the tdb if he wants one).  All this is done
by millicode.

There's a paper describing the HTM hardware implementation on z:

  http://dl.acm.org/citation.cfm?id=2457483

There are datails about the implementation of the instructions on
pages 32 and 33.  TBEGIN, TBEGINC and TEND are implemented in
hardware, TABORT, ETND and PPA are implemented as millicode and
thus slower.  I have access to the internal document that lists
the cycles these instructions may take, but I must not quote from
it.

Maybe I can say that TABORT (and probably a normal abort too)
takes by an order of magnitude more time than a TBEGIN  which in
turn takes an order of magnitute more time than TEND.  Roughly:

  TABORT: very expensive
   >>
  TBEGIN: expensive
  ETND:   expensive (extract transaction nesting depth)
   >>
  TEND:   cheap

I heard that ETND execution time is goint to be "fixed" in the
future.

Also, an abort does not restore floating point registers, so there
is an additional overhead before TBEGIN to save the non-clobbered
floating point registers and to restore them after TEND when
necessary.

Overall, I'd say a simple TBEGIN-TEND pair around a single
instruction is expensive enough to be worthwhile only if it can
prevent a cache miss (cache pingpong).  It becomes better with
longer transaction bodies.

> If aborts are very costly, then this certainly affects any tuning
> decisions regarding when to use elision or any kind of transaction-based
> speculative execution.  We'd have to be more conservative.

Yes.

Ciao

Dominik ^_^  ^_^

-- 

Dominik Vogt
IBM Germany

Follow-Ups:
- Re: Lock elision test results
  - From: Torvald Riegel

References:
- Re: Lock elision test results
  - From: Torvald Riegel
- Re: Lock elision test results
  - From: Dominik Vogt
- Re: Lock elision test results
  - From: Torvald Riegel

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]