This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Faster strlen


+  pmovmskb %xmm3, %edx
+  sub %rdi, %rax
+        movq    %rdx, %rcx
+        negq    %rcx
+        andq    %rdx, %rcx

Please, use <tab>instruction<tab> format instead of different styles
on different lines.

And I suggest to use L macro for new labels to improve readability and
to satisfy to the style of other assembler files in glibc.

+  add $16, %rax
+  .p2align 4
+  .align64_loop:

L(align64_loop):

--
Liubov Dmitrieva

2012/10/9 H.J. Lu <hjl.tools@gmail.com>:
> On Sun, Oct 7, 2012 at 10:27 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
>> Hello, I investigated strlen bit more and improved pminub variant.
>>
>> I got upto 10% speedup by unrolling main loop. I did not measured
>> difference when I unrolled loop more.
>>
>> I also benchmarked atom and added variant which is identical to
>> strlen-sse2-pminub except bsf is replaced by table lookup.
>>
>> Last addition is attempt to generate VEX encoded strlen. I need only to
>> pass -mavx flag when compiling strlen_avx.S but do not know how.
>>
>> Benchmarks are at usual place. To fit all functions consider only random
>> alignment. I also increased granularity of sampling.
>>
>> http://kam.mff.cuni.cz/~ondra/benchmark_string/
>>
>> Results for this patch are
>> http://kam.mff.cuni.cz/~ondra/benchmark_string/benchmark_strlen_7_10_2012.tar.bz2
>>
>> On sandy bridge
>> http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_sandy_bridge/strlen/html/test_r.html
>> there is phase change around sizes 1500-2000. Do you know what caused it?
>>
>> Other optimalization is prefetching. Most of time prefetching variant is
>> slower than nonprefetching(as large strings are rare.)
>> On sandy bridge prefetching is free. I need additional flag to ifunc to
>> indicate that.
>>
>> I disabled prefetching in my patch.
>>
>> On atom ironicaly strlen-sse2-no-bsf was slower than pminub variant
>> except for string less than 16 bytes long.
>>
>> For exit from main loop of no-bsf variant using bsfq instead binary
>> search saves 10 cycles. Multiplication+table lookup is also slow in atom
>> because 64bit multiplication is slow.
>>
>> I used pminub variant with  bsf instruction replaced by my table lookup. This
>> is by about 8 cycles faster on atom.
>>
>> I did not reschedule instructions for atom for easier review.
>>
>> sse2, pminub, no-bsf, sse4 variants are everywhere slower than my patch so I
>> remove them. pminub and no-bsf are used in strcat and will be removed in
>> separate patch.
>>
>> 2012-10-07  Ondrej Bilka  <neleai@seznam.cz>
>>         * sysdeps/x86_64/strlen.S:
>>           Use unrolled pminub variant by default.
>>         * sysdeps/x86_64/multiarch/strlen_avx.S:
>>           Recode default variant using VEX prefix.
>>         * sysdeps/x86_64/multiarch/strlen_atom.S:
>>           New variant tailored to atom.
>>         * sysdeps/x86_64/strlen.S: Updated function selection.
>>         * sysdeps/x86_64/multiarch/strlen-sse4.S: deleted
>>         * sysdeps/x86_64/multiarch/Makefile: updated
>>
>
> Please rename strlen_atom.S to strlen-no-bsf.S since it
> depends on bit_Slow_BSF, not Atom.
>
> Thanks.
>
> --
> H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]