This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction


On Tue, Jul 30, 2013 at 05:26:09PM +0800, Ling Ma wrote:
> We never find prefetcht1 is good instruction to pre-fetch data on
> core2, nehalem, sandybridge, and haswell. Our experiments  show
> prefetchw is best in your cases.

But your code was following:

+L(gobble_128_loop):
+       prefetcht0      0x1c0(%rdi)
+       vmovaps %ymm0, (%rdi)
+       prefetcht0      0x280(%rdi)
+       vmovaps %ymm0, 0x20(%rdi)
+       vmovaps %ymm0, 0x40(%rdi)
+       vmovaps %ymm0, 0x60(%rdi)
+       lea     0x80(%rdi), %rdi
+       sub     $0x80, %rdx
+       jae     L(gobble_128_loop)

Which uses prefetcht0 (A prefetcht1 in mine benchmark was typo.)

I updated benchmark (attached) with your code with and without prefetching. 1)

Ljuba could you test it on haswell?

1) With minor modification of vinserti128->vinsertf128 This uses only
avx and avx2 is not needed

> In your code, memset only handle 256 bytes, in this case we don't need
> to use prefetch because hardware prefetch is enough for us in small
> size, but it can tell us whether prefetch will hurt performance so we

Does haswell improved hardware prefetcher to fetch from next page? I
changed layout of benchmark so that data ends at page boundary.

> run it, result is below, it indicates prefetchw on haswell is
> harmless, even it is redundant code in memset on haswell.
> 
Your test was invalid as you did compared apples with oranges
(prefetcht0 vs prefetchw) To see how your code fares you should replace 
it with your implementation with and without prefetch. 
You need that to be exactly what you submitted and if that means
prefetchw then post new version.

> Then we modified memset2 to handle 4096 bytes
> in test.c as bellow
> ...
> char ary[SIZE+4096];
> ...
> memset2(ary+(512*((unsigned)rand_r(&seed)))%SIZE,0,4096);
> and run your code on haswell as below, result shows prefetchw get
> better  performance
> and harmless

With that 'improvement' you defeated purpose of benchmark. It
demonstrated increase cache usage by touching only first half of data
and having second half fetched by prefetch. 

As you only changed size a writes now overlap and there will be no
extra memory usage.

Also changing it to 4096 decreases percentage of wasted memory. Before
it was 50% (256 saved/512 fetched) now its around 11% (4096 saved/ 4608
fetched)

Attachment: memset_cache.tar.bz2
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]