This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC] Lock elision implementation guidelines
- From: Torvald Riegel <triegel at redhat dot com>
- To: vogt at linux dot vnet dot ibm dot com
- Cc: libc-alpha at sourceware dot org
- Date: Mon, 10 Jun 2013 20:49:19 +0200
- Subject: Re: [RFC] Lock elision implementation guidelines
- References: <1360527652 dot 3065 dot 11521 dot camel at triegel dot csb> <1370618459 dot 16968 dot 11300 dot camel at triegel dot csb> <20130610114346 dot GA1384 at linux dot vnet dot ibm dot com>
On Mon, 2013-06-10 at 13:43 +0200, Dominik Vogt wrote:
> > Focus on PNT, allow E
> >
> > Enable LE conservatively. Provide tuning knobs that are not part of the
> > ABI.
> > Pro: Hopefully little harm done by LE implementation deficiencies (need
> > to be conservative with semantics, but performance tuning doesn't need to
> > perfect right away).
> > Con: No stable way to tune LE until a later time.
>
>
> The attached program suffers from a massive performance loss
> caused by lock elision. It runs two threads; the first one has a
> repeated, big workload, while the second one just increases a
> counter repeatedly. Two mutexes are used.
>
> (The test was run on a zEC12 with the elision patches ported to z
> architecture.)
>
> thread 1:
> ---------
>
> for (...)
> lock A
> lock B
> increment a counter 1
> unlock A
> handle big workload
> increment counter 3
> unlock B
>
> thread 2:
> ---------
>
> forever
> lock A
> increment a counter 2
> lock B
unlock A, I assume
>
> (For completeness, there might be a third thread that runs at very
> low frequency, and uses lock B to query counter 3. That thread
> might not need to be very responsive. I.e. lock B is not
> superfluous.)
>
> In a real life test, the runtime of the program with elision is
> almost identical to the runtime without elision and just locks.
So, "big workload" executes for a long time (i.e., significantly longer
than one would need for trying to acquire a lock for a couple of times)?
> But in the same time I get 1247 million increments of counter 2
> without elision but just 660 million increments _with_ elision.
It's surprising that the difference is so large. If "big workload" runs
for most of the time in thread 1, A should be free most of the time.
Does anything else in "big workload" conflict with what thread 2 does?
Do you have any performance counter data for how often you can run
transactions successfully?
> The explanation for this is that because of lock collisions the
> mutex A is eventually switched to using a real lock instead of an
> elided lock. Thread 1 then elides lock B. From this point I'm
> not entirely sure what happens as I'm new to the nptl code.
If locking for A isn't elided, then thread 1 should unlock A with a
transactional write but then see that thread 2 is waiting, and abort
because it calls futex_wake syscall. If thread 2 doesn't try to acquire
the lock until after thread 1 unlocked it transactionally, thread 1 will
also abort.
> But
> either thread 2 waits on lock A until thread 1 releases lock B
I believe that this shouldn't happen, but this depends on how your HTM
does conflict resolution, so I'm not quite sure. The only case I can
think of is if your HTM wouldn't make transactional writes conflict with
concurrent nontransactional reads until commit, and lock A would use
spinning with reads. But normal mutexes don't do this kind of spinning
(at least the x86 and C implementations don't, dunno about z).
> , or
> as thread 2 tries to lock A, it aborts the transaction in thread
> 1.
Which then should lead to B to switch to no elision, and because "big
workload" is long should be similar to an execution without elision.
>
> The point is: For certain program logic, lock elision can be
> very bad for performance.
But if the tuning is conservative enough, it shouldn't be active
eventually. So the key is finding some kind of default tuning that
makes sense.
You cited the "Focus on PNT, allow E" option, but I'm not sure what kind
of conclusions you draw from your test, and how it relates to this
option. Would you like to have a very conservative default tuning? Or
do you think that lock elision should only be enabled explicitly on z
architecture? Or what else?
> P.S.: Test run with default tuning values; I think it switches
> to real locks after three failed transactions, and back to
> transactions after three real locks. I'm quite sure that given
> _any_ set of tuning parameters, I can write you a test program
> that has similarly bad performance using elision.
Certainly not for _any_. For example, if we don't use elision anymore
at all after 10 failed attempts globally, I guess the performance
degradation that elision would cause would be pretty limited :)
Thanks for making the test, and it would be interesting if you can
investigate this further, post more results, and perhaps experiment with
different tuning options. As Andi said, this data is necessary input to
see which steps we need to take to make use of elision.
Torvald