This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Ling Ma <ling dot ma dot program at gmail dot com>
- Cc: libc-alpha at sourceware dot org, liubov dot dmitrieva at gmail dot com
- Date: Tue, 30 Jul 2013 16:22:55 +0200
- Subject: Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
- References: <CAOGi=dMfjBWkFOhUh7QjBM=XiJqkP+6sEsVSHgz+=wC9z1+O=w at mail dot gmail dot com> <20130730071521 dot GA8596 at domone dot kolej dot mff dot cuni dot cz> <20130730071717 dot GA8741 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dOCH41BCXY+yN7_w4Ed4DCAHQKJMvJhKUs-pi3EkxHp=g at mail dot gmail dot com> <20130730113445 dot GA4577 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dMPnGq_35r9TmTHkPn6oS-kbjb=eFmFWQL+N9DBMreu-A at mail dot gmail dot com> <CAOGi=dNEnXMAR61U2+Qk_=VQ7v9yi771PDAoKeBjEYSDUYBLHA at mail dot gmail dot com>
On Tue, Jul 30, 2013 at 08:49:58PM +0800, Ling Ma wrote:
> Ljuba could you please test our patch on haswell with gcc.403 we sent.
> I also will test it to compare among, without prfetch, or with
> prefetchw and prefetcht0,
> gcc.403 benchmark should be more reliable and stringency.
>
Are you sure? In my testcases memset_big and memset_hash we seen a 30%
performance regression.
A memory access pattern of memset in both cases are nearly identical.
When we run it with your tool it should find regression. Otherwise it
does not report data related with reality.
Ljuba, Could you try test them. First you need to compile files
gcc -O2 memset_big.c -o memset_big
gcc -O2 memset_hash.c -o memset_hash
Then at step 12. in readme.txt also run
./memset_big
./memset_hash
We want to minimize time of program runs. Best way to measure it is to
measure how long it took program to complete.
It has major disadvantage that for deterministic programs you need to
run them for days to reduce noise and get statistically significant
results.
Then a simplification that I did and you also do is to measure only time
spend in function that changed instead of entire time. When you can be
sure that your modifications do not change running time of rest of
program much then you can get results much faster and for much wider
range of programs.
This is not case here as prefetching changes memory layout which changes
running time of rest of program so only first alternative is advisable.