This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Compile AVX libm functions with -mavx


On Wed, Oct 3, 2012 at 1:36 PM, Matt Turner <mattst88@gmail.com> wrote:
> On Wed, Oct 3, 2012 at 6:12 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Wed, Oct 3, 2012 at 3:40 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
>>> On Tue, Oct 02, 2012 at 10:41:35PM -0700, Matt Turner wrote:
>>>> On Tue, Oct 2, 2012 at 4:45 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>> > On Tue, Oct 2, 2012 at 4:07 PM, Matt Turner <mattst88@gmail.com> wrote:
>>>> >> On Tue, Oct 2, 2012 at 1:19 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>> >>> On Tue, Oct 2, 2012 at 12:47 PM, OndÅej BÃlka <neleai@seznam.cz> wrote:
>>>> >>>>
>>>> >>>> could it be a 60 cycle penalty when switching between legagy sse and avx
>>>> >>>> state?
>>>> >>>
>>>> >>> This true. We can use -mprefer-avx128 to make sure that only 128bit AVX
>>>> >>> instructions are used.
>>>> >>>
>>>> >>> --
>>>> >>> H.J.
>>>> >>
>>>> >> The latency for switching between old SSE and new (AVX-style
>>>> >
>>>> > Latency comes from switching between the 128-bit SSE context and
>>>> > the 256-bit AVX context.  If we only use the lower 128-bit AVX context,
>>>> > there is no latency.
>>>>
>>>> I'm having a hard time confirming that.
>>>>
>>>> >From pages 53/54 of the pdf -- http://software.intel.com/file/36945 :
>>>>
>>>> > However, there is a performance impact with intermixing VEX-encoded SIMD
>>>> > instructions (AVX, FMA) and legacy SSE instructions that only operate on
>>>> > the XMM register state.
>>>>
>>>> And more to the point:
>>>>
>>>> > Intermixed 256-bit, 128-bit or scalar SIMD instructions that are encoded
>>>> > with VEX prefixes have no transition delay due to internal state management.
>>>>
>>>> >> 3-operand) form is what causes the penalty. What is the purpose of
>>>> >> -mprefer-avx128? I can't find a description of it online.
>>>> >
>>>> > I just fixed it:
>>>> >
>>>> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54785
>>>> >
>>>> > -mprefer-avx128 will avoid 256-bit AVX instructions.  Only 128-bit
>>>> > AVX instructions are generated.  It has the same effect on context
>>>> > switch as -msse2avx.
>>>>
>>>> I think that your claim is that legacy 128-bit SSE + 256-bit AVX
>>>> produces stalls, but I believe the documentation to say that it's
>>>> VEX-prefixed instructions in general (256-bit or otherwise) plus
>>>> legacy SSE instructions that lead to stalls.
>>>
>>> For intel detailed description is in
>>> http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
>>> chapter 11-3
>>>
>>> They mention alternative to add vzeroupper at end of each avx function.
>>
>> It is not about VEX encoding.  It is about mixing 128-bit SSE instructions,
>> which preserve upper 128 bits of YMM registers, with 256-bit AVX instructions.
>> If we only use 128-bit AVX instructions, which clear upper 128 bits of YMM
>> registers,  upper 128 bits of YMM registers are always zero and no
>> vzeroupper is needed.  There is no penalty.
>>
>> --
>> H.J.
>
> This is usually where you'd cite a source, instead of reiterating your claim.
>
> My reading is that in the presence of legacy SSE instructions, that
> using only 128-bit AVX instructions will not cause stalls ("Table
> 11-2. State Transitions of Mixing AVX and SSE Code" from the link
> OndÅej provided. I suppose that we will still have legacy SSE
> instructions in applications and probably in libm, so using
> -mprefer-avx128 is correct.
>
> To be clear
>   - Using 128-bit AVX and 256-bit AVX - no stalls
>   - Using legacy SSE and 128-bit AVX - no stalls
>   - Using legacy SSE and 256-bit AVX - stalls
>
> That does not seem to match with the previously quoted paragrah:
>
>> However, there is a performance impact with intermixing VEX-encoded SIMD
>> instructions (AVX, FMA) and legacy SSE instructions that only operate on
>> the XMM register state.
>
> but I think that paragraph must be wrong, i.e., too imprecise.

That is correct.  I went back and forth with Intel AVX people on
this when I was implementing -mvzeroupper in GCC.  We only
track upper 128-bits of YMM registers when emitting vzeroupper
instructions.  Otherwise, we will generate vzeroupper all over
the place when -mavx is used.

-- 
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]