This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Lock elision test results


On Wed, Jul 03, 2013 at 03:18:04PM +0200, Torvald Riegel wrote:
> On Wed, 2013-07-03 at 14:07 +0200, Dominik Vogt wrote:
> > On Wed, Jul 03, 2013 at 01:03:52PM +0200, Torvald Riegel wrote:
> > I don't see what difference that makes.
> 
> Well, if thread 1 runs n iterations and thread 2 runs n2 iterations,
> then the overall throughput is, simplified, n+n2.  Roughly, if we take
> longer to run n iterations, yet n2 is bigger at the same time, we have
> the same throughput.  Or if we need less time, but n2 is lower, then
> same throughput as well.  That's why I think we need to consider the
> iterations of thread 2 as well.

> > But regardless of whether
> > <n> is 2000, 5000, 10000 or 20000, thread 2 always completes
> > roughly the same number of iterations per second.
> 
> Do you mean the same number of iterations for each value of <n>, or the
> same number of iterations across all these values of <n>?

The same number of iterations per second, no matter of what value
<n> is.

> > Also, if I let
> > thread 2 execute a fixed number of iterations and terminate the
> > loop in thread 1 when thread 2 is done, the number of iterations
> > per second is the same.
> 
> Sorry, the same as what?

The same number of iterations per second as in the original test
where the number of iterations of thread 1 is constant.

> > > > Speaking about the actual test result:
> > > > 
> > > > 1. In (4), thread 1 has a high abort ratio, i.e. it needs lots of
> > > >    additional cpu cycles and the whole test runs much longer than
> > > >    it would without aborts.
> > > > 2. Thread 2 does not benefit much from the additional run time,
> > > >    the number of iterations is roughly the same as without
> > > >    elision.
> > > 
> > > Does it have a high abort rate too?  If not, it should benefit from the
> > > additional runtime that thread 1's effectively gives to thread 2.
> > 
> > I don't think it has lots of aborts other than the ones that are
> > caused by the mutex being locked.  To me it looks like the aborts
> > cause by the write-write conflict always hit thread 1, but I don't
> > know for sure at the moment.  When thread 2 aborts because the
> > mutex was locked, it blocks on the lock and waits for thread 1 to
> > complete before it can continue.
> > 
> > > What's the ratio between the increase in thread 2's iterations and the
> > > increase in thread 1's iteration?
> > 
> > See above.
> > 
> > A quick printf shows that thread 2 does about
> > 
> > (1) 14000*n
> > (2) 13500*n
> > (3) 14700*n
> > (4)  6800*n
> > 
> > iterations at the same time thread 1 does one outer iteration.
> 
> The outer iteration is one iteration in the "repeat <n>" loop?

Yes.

> If so, this would be surprising, I believe.

> The current parameters of the
> adaptation algorithms should lead to thread 1 acquiring m1 without
> elision after a few retries (same applies to m2, so worst case two times
> a few retries).

Agreed.

> Once it does that, it won't ever touch m1, so thread 2
> can work without disturbances (unless there is some false sharing
> somewhere).

Yes, because both threads use real locks.

So, while thread 1 does one iteration of 

  waste a minimal amount of cpu

  waste some cpu

thread 2 does

  lock m1
  increment c3
  unlock m2
 
It looks like one iteration of thread 2 takes bout six times more
cpu that an iteration of thread 1 (without elision) and with
alision it takes about 14 times more cpu.

> At which point we should have the option to do the same as
> in case (3);  thus, the difference is surprising to me.  Do you have any
> explanations or other guesses?

Hm, with the default tuning values (three attempts with elision
and three with locks), *if* thread 1 starts using locks, thread 2
would

  try to elide m1 <------------------\
    begin a transaction              |
    *futex != 0 ==> forced abort     |
  acquire lock on m1                 |
    increment counter                |
  release lock on m1                 |
  acquire lock on m1                 |
    increment counter                |
  release lock on m1                 |
  acquire lock on m1                 |
    increment counter                |
  release lock on m1  ---------------/

I.e. for three successful locks you get one aborted transaction.
This slows down thread 2 considerably.  Actually I'm surprised
that thread 2 does not lose more.

What I do not understand is why thread 1 starts aborting
transactions at all.  After all there is no conflict in the
write sets of both threads.  The only aborts should occur because
of interrupts.  If once the lock is used unfortunate timing
conditions force the code to not switch back to elision (because
one of the thread always uses the lock and forces the other one
to lock too), that would explain the observed behaviour.  But
frankly that looks to unlikely to me, unless I'm missing some
important facts.

> > > > > What's the variance between the results from different test runs?  At
> > > > > least having min/max/avg would be good, or perhaps the median, or
> > > > > something like that.
> > > > 
> > > > It's a bit difficult to answer this without posting absolute
> > > > numbers.  The range of measurements is large enough to be
> > > > irritating in all test setups, even in the early morning when
> > > > nobody but me works on the machine.
> > > 
> > > What do you mean by "irritating"?
> > 
> > By irritating I mean that I have no explanation for them.

The explanation for that is probably the machine architecture.
Our system partition has eight cpus.  Six (or five?) cpus are on
the same chip.  So, if some thread is executed by a cpu on a
different chip, memory latency is much higher.  This effect could
explain the constant factor I see.

> > Also, an abort does not restore floating point registers, so there
> > is an additional overhead before TBEGIN to save the non-clobbered
> > floating point registers and to restore them after TEND when
                                                      ^^^^^^
> > necessary.

After an abort, of course.

> > Overall, I'd say a simple TBEGIN-TEND pair around a single
> > instruction is expensive enough to be worthwhile only if it can
> > prevent a cache miss (cache pingpong).
> 
> So, elision can be more efficient if it succeeds in the first attempt
> and the lock is contended (so elision can avoid the cache miss on the
> lock)?

I think so.

There may be an effect on the z hardware that may play an important
role in elision performance.  I'm not hundred percent sure, but I
think the following can happen.

The system architecture is like this:

  * Each cpu has its own L1 and L2 cache
  * Six (or five?) cpus share one L3 cache (all on one chip)
  * Several cpu chips share one L4 cache
  * All caches are inclusive
  * When a cpu writes memory into the L1 cache, the write is
    automatically flushed to the L2 and L3 cache.

(assuming that thread 1 rund on cpu A and thread 2 runs on cpu B)

 * Thread 1 writes memory cell X outside any transaction.
 * The cache of cpu A loads the cache line that contains X
   exclusively.
 * X is modified; the write is flushed to the cpu A's L3 cache.
 * Thread 2 starts a transaction on cpu B.
 * Thread 2 reads or writes X.
 * Now, if cpu B does not use the same L3 cache as cpu A (i.e. is
   on a different chip), it tries to load the cache line notices,
   that a different chip has already modified the cache line and
   *aborts*.  This is because it cannot know whether X was
   modified externally before or during the transaction.

   Note that HTM on z guarantees that a cpu with an open
   transaction does not see memory modifications made by other
   cpus inside or outside of transactions.

This could lead to aborts that are unnecessary, strictly speaking.

Ciao

Dominik ^_^  ^_^

-- 

Dominik Vogt
:IBM Germany


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]