This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Consensus: Tuning runtime behaviour with environment variables.
- From: Rich Felker <dalias at aerifal dot cx>
- To: Alexandre Oliva <aoliva at redhat dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Sun, 2 Jun 2013 22:23:44 -0400
- Subject: Re: Consensus: Tuning runtime behaviour with environment variables.
- References: <51A58A92 dot 4050508 at redhat dot com> <20130529055518 dot GA23030 at domone dot kolej dot mff dot cuni dot cz> <ormwraq3rx dot fsf at livre dot home> <20130601031151 dot GK20323 at brightrain dot aerifal dot cx> <ora9n9i3jc dot fsf at livre dot home> <20130602154150 dot GN20323 at brightrain dot aerifal dot cx> <ortxlgh2an dot fsf at livre dot home> <20130602215358 dot GB29800 at brightrain dot aerifal dot cx> <or38t0dslx dot fsf at livre dot home>
On Sun, Jun 02, 2013 at 08:02:02PM -0300, Alexandre Oliva wrote:
> > The hot path of __tls_get_addr should be just a couple dereferences
> > and branches which are always predicted correctly.
>
> For anyone who didn't know better, it would seem like you're arguing
> that Initial Exec is pointless.
Not pointless, but overrated. And it's not obvious to me that your
optimization is closer to initial-exec in performance than it is to
global-dynamic.
> > If it's slower than that in glibc, that's a bug in glibc.
>
> It's not, but a couple of dereferences and branches is a lot more than
> nothing. With the TLS access model I proposed and implemented, you save
> all of that, including the cost of going through the PLT for the call.
Indeed, these aspects seem more worthwhile than eliminating the body
of __tls_get_addr.
> What you don't save is the cost of a naked call to a function that just
> returns, but that's still a lot less than that plus dereferences plus
> branches plus PLT plus frame setup plus saving and restoring
> call-clobbered registers at the caller, don't you agree?
For certain values of "a lot".
> Phrasing it another way, which of these two scenarios seem faster to
> you?
The one whose _measured_ performance is better. While measurement of
the access itself (e.g. load from TLS and store to a volatile
variable, surrounded by RDTSC) would be very interesting, what really
matters is whether you can find a real-world example where the
performance is measurably different.
Rich