This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC V2] Improve 64bit memcpy/memove for Corei7 with unaligned avx instruction
- From: Ling Ma <ling dot ma dot program at gmail dot com>
- To: Ondřej Bílka <neleai at seznam dot cz>
- Cc: Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>, GNU C Library <libc-alpha at sourceware dot org>, Ma Ling <ling dot ml at alibaba-inc dot com>
- Date: Fri, 12 Jul 2013 22:23:31 +0800
- Subject: Re: [PATCH RFC V2] Improve 64bit memcpy/memove for Corei7 with unaligned avx instruction
- References: <1373547096-8095-1-git-send-email-ling dot ma dot program at gmail dot com> <CAHjhQ91fVakxKNkEniz0AL-Srn3kNtLf+5AaB+VHozy5_z5zeA at mail dot gmail dot com> <20130712032333 dot GA5839 at domone dot PAOCY>
>> > +L(256bytesormore):
>> > +
>> > +#ifdef USE_AS_MEMMOVE
>> > + cmp %rsi, %rdi
>> > + jae L(copy_backward)
>> > +#endif
>
> Test by following condition
> (uint64_t)((src - dest)-n) < 2*n
> it makes branch predicable instead two unpredicable branches.
>
> Also alias memmove_avx to memcpy_avx. As they differ only when you copy 256+
>
> bytes so performance penalty of this check can be payed by halving
> memcpy icache usage alone.
Ling: Ok, I will try in new version.
>> > + mov %rdx, %rcx
>> > + rep movsb
>> > + ret
>> > +
> Did haswell got optimized movsb? If so at which interval it works well?
Ling :rep movsb is good for most cases, haswell enhanced it by
combining data into 32
bytes per cycle. Because memcpy may know the loop number before copying data,
rep movsb seems to use similar loop-counter concept to avoid branch
prediction miss, adaptively prefech next loop data if current loop
data is not in L1 cache.
However it need to cost long time to warm up, so when data is less
2048, we choose avx instruction according to our experiment, and when
data is over L3 cache, it doesn't give us better result than
non-temporary instruction.
Thanks
Ling