This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction


On Mon, Jun 10, 2013 at 09:28:30PM +0800, Ling Ma wrote:
> CPU2006 benchmark is very hard to improve so that the above 5%
> improvement for single core may become the goal of next generation
> CPU, and the improvement number is much less for benchmark specjbb. We
> hardly accept above 1% improvement of those industry benchmarks only
> for optimized memcpy_avx2 even though it is the fastest.
> 
 memcpy_avx2 is not fastest but 33% slower, we already know
that. I wrote it here:

> On Thu, Jun 06, 2013 at 02:55:12PM +0200, OndÅej BÃlka wrote:
> > These results show that your patch is 35% slower for gcc see following 
> > line.
> > 
> > Time ratio to fastest:
> > memcpy_glibc: 134.517062% memcpy_new_small: 100.000000% memcpy_new:
> > 101.120206% __memcpy_avx2: 136.926079%

Also from 403.gcc page:

"
Benchmark Description

403.gcc is based on gcc Version 3.2. It generates code for an AMD
Opteron processor. The benchmark runs as a compiler with many of its
optimization flags enabled. 
"

It is not clear how this measures memcpy, I do not know how what
percentage of spec test in memcpy but based on my experience it is at
most few percent. Noise from rest of code much more than that.

You can test it by running benchmark 100 times and then use 
https://en.wikipedia.org/wiki/Student%27s_t-test 
if you get statisticaly significant results which I doubt.


> we presented the results because of 2 reasons:
> 1) Haswell CPU has full capability of handling indirect jump
> instruction in memmcpy_avx2 in real-world scenario.
> 2)if we continue to test the benchmark for more times, we will find
> which is better. For example we can test memcpy_avx2, memcpy_new over
> 3 times respectively , if we find which has more times of better
> results, although the difference is very small, the stable results can
> give us the right answer.
> 
> Thanks
> Ling
> 
> 
> 
> 
> 2013/6/10, Andreas Jaeger <aj@suse.com>:
> > On 06/10/2013 08:17 AM, Ling Ma wrote:
> >> Last week, we separated 403.gcc from cpu2006 benchmark and compiled
> >> with additional option -mstringop-strategy=libcall to avoid rep_4byte,
> >> rep_8byte, rep_byte that use rep movs instructions. 403.gcc has plenty
> >> of branch instructions, and is very sensitive for branch prediction
> >> miss rate. Currently we are concerning about whether memcpy_avx2 cause
> >> more branch prediction miss over benefit from it in real world
> >> scenario, so 403.gcc will help us to verify it.
> >>
> >> We tested 403.gcc linked with memcpy_new, 403.gcc linked with
> >> memcpy_avx2 for 3 times respectively:
> >>
> >> 403.gcc for memcpy_new results are below: (bigger and better)
> >> 1) 67.63718
> >> 2) 66.899156
> >> 3) 66.982456
> >>
> >> 403.gcc for memcpy_avx2 results are below:
> >>
> >> 1) 66.805236
> >> 2) 67.29362
> >> 3) 67.63718
> >>
> >> Above comparison results indicate memcpy_avx2 seem to be better,
> >> and we would like to do more experiments.
> >
> >
> > If I take the arithmetic mean of these I get:
> > 67.17293066666666666666 vs 67.24534866666666666666
> >
> > That's far less than 1 percent - so not conclusive at all,
> >
> > Andreas
> > --
> >   Andreas Jaeger aj@{suse.com,opensuse.org} Twitter/Identica: jaegerandi
> >    SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 NÃrnberg, Germany
> >     GF: Jeff Hawn,Jennifer Guild,Felix ImendÃrffer,HRB16746 (AG NÃrnberg)
> >      GPG fingerprint = 93A3 365E CE47 B889 DF7F  FED1 389A 563C C272 A126
> >


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]