This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Ling Ma <ling dot ma dot program at gmail dot com>
- Cc: Nix <nix at esperi dot org dot uk>, libc-alpha at sourceware dot org, hongjiu dot lu at intel dot com
- Date: Fri, 7 Jun 2013 18:07:49 +0200
- Subject: Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction
- References: <1370424188-4259-1-git-send-email-ling dot ml at alibaba-inc dot com> <20130605121816 dot GA11269 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dMiD=_Qf1EJ=F3hfyQDtQubDEC5pjpXKDCHrUQwhr=vzg at mail dot gmail dot com> <20130605161954 dot GA26401 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dPWPaX5prcL-uAaqS6=_ehzKeBmAFMdwV6aU34jZ0eHtQ at mail dot gmail dot com> <20130606125511 dot GA28565 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dPs9geCtrWhU1L_0DEfOWOknpzFSLmYs4gbYzGX8Zn5Hg at mail dot gmail dot com> <20130607104613 dot GA6343 at domone dot kolej dot mff dot cuni dot cz> <8761xqru5w dot fsf at spindle dot srvr dot nix> <CAOGi=dMV5jaS2597cksd0mW84UDd06SovsBkL5=WPez-jZWg4g at mail dot gmail dot com>
On Fri, Jun 07, 2013 at 09:37:22PM +0800, Ling Ma wrote:
> Hi Ondra,
> If we prefer to backward copy, it will cause memory false dpendence
> and impact our performance as we mentioned above.
>
> Today we introduce libmicro-0.4.2 (https://java.net/projects/libmicro/)
> and it can help us to measure performance more precisely.
>
> Based on the result we changed code and get better performance as
> compare.html shows(memcpy-avx2.S execution time is on the right, that
> of memcpy_new.s is on the left). Anyone who has haswell machine can
> test as below:
> 1) tar xjvf libmicro-memcpy.tar.bz2
> 2) cd libmicro-memcpy
> 3)make clean;make
> 4)./memcpy-test-avx2-bench &>memcpy-avx2-output (result from memcpy-avx2.S )
> 5) ./memcpy-test-new-bench &>memcpy-new-output (result from memcpy_new.s )
> 6)./multiview memcpy-new-output memcpy-avx2-output >compare.html
> (memcpy_new.s result is on the left, memcpy-avx2.S result is on the
> right )
> The compare.html shows the comparison result.
> Tomorrow we will try to use vtune, then send out comparison result if
> time is available.
>
That bechmark is wrong in several ways.
First it does not randomize size in any way. This will cause branches to
be predicted and as branch prediction can account to 20% of time results
you get will be 20% off.
Same applies to alignment, it needs to be randomized otherwise you lose
part of performance profile. Setting alignment by config variable is
pointless as it will only distinguish aligned/unaligned.
Then we move to aggregation of results.
It tests a single implementation a time which is wrong. A runtime of
process depends on many variables and you introduce bias by doing this.
Fox example as you ran
./memcpy-test-avx2-bench
cpy frequency could be 800MHz
then in
./memcpy-test-new-bench
a governor can decide to switch to 2.5GHz making results above three
times worse than they are.
Or any action that you do on computer can similary affect these.
Proper way is test both of them at once and randomize which gets
selected.
Please post comparison with all those issues fixed.