This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] Lock elision implementation guidelines


On Wed, Feb 27, 2013 at 03:18:30PM +0100, Torvald Riegel wrote:
> On Fri, 2013-02-22 at 15:02 +0100, OndÅej BÃlka wrote:
> > On Wed, Feb 20, 2013 at 09:42:56PM +0100, Torvald Riegel wrote:
> > > On Wed, 2013-02-20 at 11:58 -0500, Rich Felker wrote:
> > > > On Wed, Feb 20, 2013 at 12:21:35PM +0100, Torvald Riegel wrote: 
> > > > > While it's true that the lock acquisition latency for locks that are
> > > > > already in the cache has been substantially decreased in the past, once
> > > > > you use the same locks concurrently with other threads, you get cache
> > > > > misses, and these are costly.  Typically, when you use locks, you use
> > > > > them because you need to synchronize and there can be concurrent actions
> > > > > by other threads.  You can hope that those other threads are not too far
> > > > > away from yourself in terms of the memory hierarchy (i.e., hope for
> > > > > locality), but that's not simple.  But we won't get less concurrency.
> > > > 
> > > > First of all, 99% of users of glibc are using it for LAMP stacks or
> > > > desktop-oriented Linux distributions. High-end computational
> > > > parallelism is in itself a 1% niche. This is the basis for my claim
> > > > that lock elision is a "pet feature".
> > > 
> > > We'll likely get even more parallelism in hardware in the future, so
> > > likely also more parallelism in programs, and this usually results in
> > > more synchronization too.  In fact, the "high-end computational
> > > parallelism" programs (I assume this means well-optimized programs) will
> > > likely often have less synchronization than the average parallel or
> > > concurrent program, I believe.  So lock elision is the more important
> > > the less optimized the synchronization in the program is, I think.
> > > 
> > When you optimize programs first thing you look for is what locks can be
> > eliminated by using atomic primitives.
> 
> I suppose you mean the atomic operations provided by current hardware.
> But that has its limits, obviously: Either because those are
> single-location, so you can't use them widely,

Single location is wide enough to cover most use cases where it matters.
It is not that hard to fiddle with data layout until changes are done on
single object. Then you need only to atomicaly change single pointer. 

> or because once you need
> multi-memory-location atomicity, building custom concurrent code for
> this quickly becomes much more complicated.

Not just custom code, building any concurent code becomes more
complicated. Reducing number of moving parts is essential just to
understand what goes on. 

> 
> Hence people use abstractions such as locks.  Avoiding locks, or making
> them more coarse-granular, doesn't necessarily improve performance.  In
> essence, you have to use locking properly, and you have to use the right
> kind of locks.  Same as with any other mechanism, be it single-location
> atomic ops or transactions.

You did not mention how difficult is to choose mechanism and how
difficult is to implement it correctly. If you know how optimize atomic
give you most control and give best performance. They are more difficult
to implement them correctly. Transactions are easy to implement
correctly but lack performance. (Performance gap will shrink.)

Locks were only tools available but as languages add transaction
support they will be used less often. 
> 
> > Optimized programs would benefit 
> > more from better optimized compare and swap.
> 
> I don't think current CAS is a problem; the best-case latency is already
> pretty low compared to past implementations (on x86 at least).  But the
> hardware implementation can't avoid the cache miss costs, so it's up to
> programmers to synchronize in an efficient way.
> 
Of course hardware can avoid cache miss cost. One possible way would be
add shared cache and keep data there while number of invalidations is
beyond given treshold. Real question is if it pays cost of added
wiring/circuitry. As transactional memory support does my idea could be
profitable too.
Andi, does intel plan something like that?


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]