This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Ling Ma <ling dot ma dot program at gmail dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Tue, 30 Jul 2013 09:15:21 +0200
- Subject: Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
- References: <CAOGi=dMfjBWkFOhUh7QjBM=XiJqkP+6sEsVSHgz+=wC9z1+O=w at mail dot gmail dot com>
On Tue, Jul 16, 2013 at 09:35:39PM +0800, Ling Ma wrote:
> >> + sub $0x80, %rdx
> >> +L(gobble_128_loop):
> >> + prefetcht0 0x1c0(%rdi)
> >> + vmovaps %ymm0, (%rdi)
> >> + prefetcht0 0x280(%rdi)
> > you should be so aggressive with prefetches when you know how much data
> > you use. This fetches unnecessary data which can double cache usage and
> > generaly slow us down.
> Haswell could issue 2 loads & 1 store in one cycle, so we can use it
> to prefetch our data if data is not in cache, even though the data is
> in L1 cache without hurting performance, our experiments also proved
> it.
>
That experiments proved nothing. A benchmark below shows that
prefetching data that you do not need can degrade performance by 30%. It
is simple to fix so you should do it.
Otherwise you need to prove that benefits of prefetching are bigger than
risk described above.
A whole program benchmarking is only way to do it, measuring only time
spend in memset is not acceptable here due of scenarios:
1. compute something that occupies 1/2 of L1 cache
2. do lot of memsets to initialize structures
nonprefetching: memset fills second 1/2 of L1 cache
prefetching: memset fills whole L1 cache evicting data from 1.
3. Compute something with data from 1.
A time spend in step 2 is nearly identical in both scenarios yet
when we account time spend in 1 and 3 prefetching one will come worse
than nonprefetching one.
size: 32000
0.29 0.29
0.29 0.29
0.29 0.29
0.29 0.29
0.29 0.29
0.29 0.29
0.29 0.29
0.29 0.29
0.29 0.29
0.29 0.29
size: 256000
0.33 0.33
0.33 0.33
0.33 0.33
0.33 0.33
0.33 0.33
0.33 0.33
0.33 0.33
0.33 0.33
0.33 0.33
0.33 0.33
size: 1024000
0.35 0.36
0.35 0.36
0.35 0.36
0.35 0.36
0.35 0.36
0.35 0.36
0.35 0.36
0.35 0.36
0.35 0.36
0.35 0.36
size: 204800
0.34 0.35
0.34 0.35
0.34 0.35
0.34 0.35
0.35 0.35
0.34 0.35
0.34 0.35
0.34 0.35
0.34 0.35
0.34 0.35
size: 4048000
0.67 0.81
0.67 0.79
0.67 0.80
0.67 0.79
0.67 0.80
0.67 0.80
0.67 0.81
0.67 0.81
0.68 0.81
0.68 0.80
size: 8096000
1.00 1.33
1.00 1.33
0.99 1.33
0.99 1.33
0.99 1.33
0.99 1.33
1.00 1.34
0.99 1.33
0.99 1.33
0.99 1.33