This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Ling Ma <ling dot ma dot program at gmail dot com>
- Cc: libc-alpha at sourceware dot org, Ling <ling dot ml at alibaba-inc dot com>, hongjiu dot lu at intel dot com
- Date: Fri, 7 Jun 2013 12:46:13 +0200
- Subject: Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction
- References: <1370424188-4259-1-git-send-email-ling dot ml at alibaba-inc dot com> <20130605121816 dot GA11269 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dMiD=_Qf1EJ=F3hfyQDtQubDEC5pjpXKDCHrUQwhr=vzg at mail dot gmail dot com> <20130605161954 dot GA26401 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dPWPaX5prcL-uAaqS6=_ehzKeBmAFMdwV6aU34jZ0eHtQ at mail dot gmail dot com> <20130606125511 dot GA28565 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dPs9geCtrWhU1L_0DEfOWOknpzFSLmYs4gbYzGX8Zn5Hg at mail dot gmail dot com>
On Thu, Jun 06, 2013 at 08:11:15PM +0800, Ling Ma wrote:
> (To keep mail thread consistent, send again with this email address )
> Hi Ondra,
>
> Thanks for your correction!
> I'm always using test-memcpy.c from glibc to check and compare
> performance before today, based on it we find the best result and send
> out our patch, currently we should discard it?
> Soon I will test those functions with your profile and other release versions.
> If I was wrong, please correct me.
>
> Thanks
> Ling
>
Yes it is as you wrote.
I got some afterthoughts how improve memcpy/memset.
First is to copy in backward direction. It may be more friendly to cache
as recently constructed data has end in L1 cache and we will end with
starts in L1 cache which are more likely to be accessed.
Second is look how effective are prefetches on haswell.
I did not add prefetching because I cannot do that generically. For one
architecture I could determine if it help and size from which it help.
This was too chaotic for generalization.
You migth also try to improve strlen with avx2.
I tried only simple variant that did all with avx2 and it turned out
It was asymptoticaly better but had worse overhead due of higher avx2
latency.
I guess sse header with avx2 loop should be better. I use similar benchmark at
kam.mff.cuni.cz/~ondra/strlen_profile.tar.bz2
Ondra