This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH RFC V2] Improve 64bit memcpy/memove for Corei7 with unaligned avx instruction

From: Ling Ma <ling dot ma dot program at gmail dot com>
To: Ondřej Bílka <neleai at seznam dot cz>
Cc: Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>, GNU C Library <libc-alpha at sourceware dot org>, Ma Ling <ling dot ml at alibaba-inc dot com>
Date: Fri, 12 Jul 2013 22:23:31 +0800
Subject: Re: [PATCH RFC V2] Improve 64bit memcpy/memove for Corei7 with unaligned avx instruction
References: <1373547096-8095-1-git-send-email-ling dot ma dot program at gmail dot com> <CAHjhQ91fVakxKNkEniz0AL-Srn3kNtLf+5AaB+VHozy5_z5zeA at mail dot gmail dot com> <20130712032333 dot GA5839 at domone dot PAOCY>

>> > +L(256bytesormore):
>> > +
>> > +#ifdef USE_AS_MEMMOVE
>> > +       cmp     %rsi, %rdi
>> > +       jae     L(copy_backward)
>> > +#endif
>
> Test by following condition
> (uint64_t)((src - dest)-n) < 2*n
> it makes branch predicable instead two unpredicable branches.
>
> Also alias memmove_avx to memcpy_avx. As they differ only when you copy 256+
>
> bytes so performance penalty of this check can be payed by halving
> memcpy icache usage alone.
Ling: Ok, I will try in new version.

>> > +       mov     %rdx, %rcx
>> > +       rep     movsb
>> > +       ret
>> > +
> Did haswell got optimized movsb? If so at which interval it works well?

Ling :rep movsb is good for most cases, haswell enhanced it by
combining data into 32
 bytes per cycle. Because memcpy may know the loop number before copying data,
rep movsb  seems to use similar loop-counter concept to avoid branch
prediction miss, adaptively prefech next loop data if current loop
data is not in L1 cache.
However it need to cost long time to warm up, so when data is less
2048, we choose avx instruction according to our experiment, and when
data is over L3 cache, it doesn't give us better result than
non-temporary instruction.

Thanks
Ling

References:
- [PATCH RFC V2] Improve 64bit memcpy/memove for Corei7 with unaligned avx instruction
  - From: ling . ma . program
- Re: [PATCH RFC V2] Improve 64bit memcpy/memove for Corei7 with unaligned avx instruction
  - From: Liubov Dmitrieva
- Re: [PATCH RFC V2] Improve 64bit memcpy/memove for Corei7 with unaligned avx instruction
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]