This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction


> Please provide link to benchmark and how it is ran. Without it you have
> only your word and I could say for example:
Ling: Attachment includes patch and readme, we use it to measure
memset for memset_avx2 and memset_sse2.
And we use the same approach to modify memcpy.S, then measured
memcpy-avx-unaligned.

> Do you think that using memset to write one byte is likely case?
> If not then you slowed memset by always doing unnecessary check that
>  can be done only when after we know that n<4.
Ling: rearrange  as memcpy sequence.

> by duplicating bzero and memset code you doubled icache pressure and
> made branch prediction weaker. You should in bzero only set arguments
> and jump to memset.
Ling: this latest version only include memset


>> +	and	$0xff, %esi
>> +	imul %esi, %ecx
> that is slower than code generated from _mm_set1_epi8
> instrinct. Also you have sse4 available so you can just
> use pshufb
Ling: fix it in new version

>> +	sub	$0x80, %rdx
>> +L(gobble_128_loop):
>> +	prefetcht0 0x1c0(%rdi)
>> +	vmovaps	%ymm0, (%rdi)
>> +	prefetcht0 0x280(%rdi)
> you should be so aggressive with prefetches when you know how much data
> you use. This fetches unnecessary data which can double cache usage and
> generaly slow us down.
Haswell could issue 2 loads & 1 store in one cycle, so we can use it
to prefetch our data if data is not in cache, even though the data is
in L1 cache without hurting performance, our experiments also proved
it.

Thanks
Ling

Attachment: memset-gcc-403-test.tar.bz2
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]