This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] Lock elision implementation guidelines


On Mon, 2013-06-10 at 13:43 +0200, Dominik Vogt wrote:
> > Focus on PNT, allow E
> > 
> >     Enable LE conservatively. Provide tuning knobs that are not part of the
> >     ABI.
> >     Pro: Hopefully little harm done by LE implementation deficiencies (need
> >     to be conservative with semantics, but performance tuning doesn't need to
> >     perfect right away).
> >     Con: No stable way to tune LE until a later time. 
> 
> 
> The attached program suffers from a massive performance loss
> caused by lock elision.  It runs two threads; the first one has a
> repeated, big workload, while the second one just increases a
> counter repeatedly.  Two mutexes are used.
> 
> (The test was run on a zEC12 with the elision patches ported to z
> architecture.)
> 
> thread 1:
> ---------
> 
> for (...)
>   lock A
>   lock B
>   increment a counter 1
>   unlock A
>   handle big workload
>   increment counter 3
>   unlock B
> 
> thread 2:
> ---------
> 
> forever
>   lock A
>   increment a counter 2
>   lock B

unlock A, I assume

> 
> (For completeness, there might be a third thread that runs at very
> low frequency, and uses lock B to query counter 3.  That thread
> might not need to be very responsive.  I.e. lock B is not
> superfluous.)
> 
> In a real life test, the runtime of the program with elision is
> almost identical to the runtime without elision and just locks.

So, "big workload" executes for a long time (i.e., significantly longer
than one would need for trying to acquire a lock for a couple of times)?

> But in the same time I get 1247 million increments of counter 2
> without elision but just 660 million increments _with_ elision.

It's surprising that the difference is so large.  If "big workload" runs
for most of the time in thread 1, A should be free most of the time.
Does anything else in "big workload" conflict with what thread 2 does?

Do you have any performance counter data for how often you can run
transactions successfully?

> The explanation for this is that because of lock collisions the
> mutex A is eventually switched to using a real lock instead of an
> elided lock.  Thread 1 then elides lock B.  From this point I'm
> not entirely sure what happens as I'm new to the nptl code.

If locking for A isn't elided, then thread 1 should unlock A with a
transactional write but then see that thread 2 is waiting, and abort
because it calls futex_wake syscall.  If thread 2 doesn't try to acquire
the lock until after thread 1 unlocked it transactionally, thread 1 will
also abort.

> But
> either thread 2 waits on lock A until thread 1 releases lock B

I believe that this shouldn't happen, but this depends on how your HTM
does conflict resolution, so I'm not quite sure.  The only case I can
think of is if your HTM wouldn't make transactional writes conflict with
concurrent nontransactional reads until commit, and lock A would use
spinning with reads.  But normal mutexes don't do this kind of spinning
(at least the x86 and C implementations don't, dunno about z).

> , or
> as thread 2 tries to lock A, it aborts the transaction in thread
> 1.

Which then should lead to B to switch to no elision, and because "big
workload" is long should be similar to an execution without elision.

> 
> The point is:  For certain program logic, lock elision can be
> very bad for performance.

But if the tuning is conservative enough, it shouldn't be active
eventually.  So the key is finding some kind of default tuning that
makes sense.

You cited the "Focus on PNT, allow E" option, but I'm not sure what kind
of conclusions you draw from your test, and how it relates to this
option.  Would you like to have a very conservative default tuning?  Or
do you think that lock elision should only be enabled explicitly on z
architecture?  Or what else?

> P.S.:  Test run with default tuning values; I think it switches
> to real locks after three failed transactions, and back to
> transactions after three real locks.  I'm quite sure that given
> _any_ set of tuning parameters, I can write you a test program
> that has similarly bad performance using elision.

Certainly not for _any_.  For example, if we don't use elision anymore
at all after 10 failed attempts globally, I guess the performance
degradation that elision would cause would be pretty limited :)

Thanks for making the test, and it would be interesting if you can
investigate this further, post more results, and perhaps experiment with
different tuning options.  As Andi said, this data is necessary input to
see which steps we need to take to make use of elision.

Torvald


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]