This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Compile AVX libm functions with -mavx

From: Matt Turner <mattst88 at gmail dot com>
To: "H.J. Lu" <hjl dot tools at gmail dot com>
Cc: Ondřej Bílka <neleai at seznam dot cz>, Mike Frysinger <vapier at gentoo dot org>, libc-alpha at sourceware dot org
Date: Wed, 3 Oct 2012 13:36:46 -0700
Subject: Re: [PATCH] Compile AVX libm functions with -mavx
References: <20121002135325.GA751@gmail.com> <201210021502.57319.vapier@gentoo.org><CAMe9rOrFsKezFsy-7XOemOk5HRwt8DVQUDGi_oRGAQsRsn7ZJA@mail.gmail.com><201210021531.51494.vapier@gentoo.org> <20121002194701.GA12305@domone.kolej.mff.cuni.cz><CAMe9rOr+7AiFMio0wqTRxf8DmR2dwtCovphYRnP2-hj1ypWpww@mail.gmail.com><CAEdQ38FyamjgtJEtaZZtFmLXaO4_4DVzQaVuXk9nR2Awc_chKg@mail.gmail.com><CAMe9rOra8D7-i-iDVTTD4SCAbw034CcyHOnJ3mxQMRkSVesLuw@mail.gmail.com><CAEdQ38Ey=+LqYb4MKZTfRjQKbBVii3nCWCdc7_snLvEQ9zZcLA@mail.gmail.com><20121003104010.GA14715@domone.kolej.mff.cuni.cz> <CAMe9rOqUSfpvwm2o4-wZL6vqGoCqKUnpkyQQCQZ2-p+zXppj_g@mail.gmail.com>

On Wed, Oct 3, 2012 at 6:12 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Wed, Oct 3, 2012 at 3:40 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
>> On Tue, Oct 02, 2012 at 10:41:35PM -0700, Matt Turner wrote:
>>> On Tue, Oct 2, 2012 at 4:45 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> > On Tue, Oct 2, 2012 at 4:07 PM, Matt Turner <mattst88@gmail.com> wrote:
>>> >> On Tue, Oct 2, 2012 at 1:19 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> >>> On Tue, Oct 2, 2012 at 12:47 PM, OndÅej BÃlka <neleai@seznam.cz> wrote:
>>> >>>>
>>> >>>> could it be a 60 cycle penalty when switching between legagy sse and avx
>>> >>>> state?
>>> >>>
>>> >>> This true. We can use -mprefer-avx128 to make sure that only 128bit AVX
>>> >>> instructions are used.
>>> >>>
>>> >>> --
>>> >>> H.J.
>>> >>
>>> >> The latency for switching between old SSE and new (AVX-style
>>> >
>>> > Latency comes from switching between the 128-bit SSE context and
>>> > the 256-bit AVX context.  If we only use the lower 128-bit AVX context,
>>> > there is no latency.
>>>
>>> I'm having a hard time confirming that.
>>>
>>> >From pages 53/54 of the pdf -- http://software.intel.com/file/36945 :
>>>
>>> > However, there is a performance impact with intermixing VEX-encoded SIMD
>>> > instructions (AVX, FMA) and legacy SSE instructions that only operate on
>>> > the XMM register state.
>>>
>>> And more to the point:
>>>
>>> > Intermixed 256-bit, 128-bit or scalar SIMD instructions that are encoded
>>> > with VEX prefixes have no transition delay due to internal state management.
>>>
>>> >> 3-operand) form is what causes the penalty. What is the purpose of
>>> >> -mprefer-avx128? I can't find a description of it online.
>>> >
>>> > I just fixed it:
>>> >
>>> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54785
>>> >
>>> > -mprefer-avx128 will avoid 256-bit AVX instructions.  Only 128-bit
>>> > AVX instructions are generated.  It has the same effect on context
>>> > switch as -msse2avx.
>>>
>>> I think that your claim is that legacy 128-bit SSE + 256-bit AVX
>>> produces stalls, but I believe the documentation to say that it's
>>> VEX-prefixed instructions in general (256-bit or otherwise) plus
>>> legacy SSE instructions that lead to stalls.
>>
>> For intel detailed description is in
>> http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
>> chapter 11-3
>>
>> They mention alternative to add vzeroupper at end of each avx function.
>
> It is not about VEX encoding.  It is about mixing 128-bit SSE instructions,
> which preserve upper 128 bits of YMM registers, with 256-bit AVX instructions.
> If we only use 128-bit AVX instructions, which clear upper 128 bits of YMM
> registers,  upper 128 bits of YMM registers are always zero and no
> vzeroupper is needed.  There is no penalty.
>
> --
> H.J.

This is usually where you'd cite a source, instead of reiterating your claim.

My reading is that in the presence of legacy SSE instructions, that
using only 128-bit AVX instructions will not cause stalls ("Table
11-2. State Transitions of Mixing AVX and SSE Code" from the link
OndÅej provided. I suppose that we will still have legacy SSE
instructions in applications and probably in libm, so using
-mprefer-avx128 is correct.

To be clear
  - Using 128-bit AVX and 256-bit AVX - no stalls
  - Using legacy SSE and 128-bit AVX - no stalls
  - Using legacy SSE and 256-bit AVX - stalls

That does not seem to match with the previously quoted paragrah:

> However, there is a performance impact with intermixing VEX-encoded SIMD
> instructions (AVX, FMA) and legacy SSE instructions that only operate on
> the XMM register state.

but I think that paragraph must be wrong, i.e., too imprecise.

Follow-Ups:
- Re: [PATCH] Compile AVX libm functions with -mavx
  - From: H.J. Lu

References:
- [PATCH] Compile AVX libm functions with -mavx
  - From: H.J. Lu
- Re: [PATCH] Compile AVX libm functions with -mavx
  - From: Mike Frysinger
- Re: [PATCH] Compile AVX libm functions with -mavx
  - From: H.J. Lu
- Re: [PATCH] Compile AVX libm functions with -mavx
  - From: Mike Frysinger
- Re: [PATCH] Compile AVX libm functions with -mavx
  - From: OndÅej BÃlka
- Re: [PATCH] Compile AVX libm functions with -mavx
  - From: H.J. Lu
- Re: [PATCH] Compile AVX libm functions with -mavx
  - From: Matt Turner
- Re: [PATCH] Compile AVX libm functions with -mavx
  - From: H.J. Lu
- Re: [PATCH] Compile AVX libm functions with -mavx
  - From: Matt Turner
- Re: [PATCH] Compile AVX libm functions with -mavx
  - From: OndÅej BÃlka
- Re: [PATCH] Compile AVX libm functions with -mavx
  - From: H.J. Lu

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]