This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Faster strlen
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: Ondřej Bílka <neleai at seznam dot cz>
- Cc: libc-alpha at sourceware dot org
- Date: Tue, 9 Oct 2012 08:02:33 -0700
- Subject: Re: [PATCH] Faster strlen
- References: <20121007172752.GA22344@domone.kolej.mff.cuni.cz>
On Sun, Oct 7, 2012 at 10:27 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
> Hello, I investigated strlen bit more and improved pminub variant.
>
> I got upto 10% speedup by unrolling main loop. I did not measured
> difference when I unrolled loop more.
>
> I also benchmarked atom and added variant which is identical to
> strlen-sse2-pminub except bsf is replaced by table lookup.
>
> Last addition is attempt to generate VEX encoded strlen. I need only to
> pass -mavx flag when compiling strlen_avx.S but do not know how.
>
> Benchmarks are at usual place. To fit all functions consider only random
> alignment. I also increased granularity of sampling.
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/
>
> Results for this patch are
> http://kam.mff.cuni.cz/~ondra/benchmark_string/benchmark_strlen_7_10_2012.tar.bz2
>
> On sandy bridge
> http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_sandy_bridge/strlen/html/test_r.html
> there is phase change around sizes 1500-2000. Do you know what caused it?
>
> Other optimalization is prefetching. Most of time prefetching variant is
> slower than nonprefetching(as large strings are rare.)
> On sandy bridge prefetching is free. I need additional flag to ifunc to
> indicate that.
>
> I disabled prefetching in my patch.
>
> On atom ironicaly strlen-sse2-no-bsf was slower than pminub variant
> except for string less than 16 bytes long.
>
> For exit from main loop of no-bsf variant using bsfq instead binary
> search saves 10 cycles. Multiplication+table lookup is also slow in atom
> because 64bit multiplication is slow.
>
> I used pminub variant with bsf instruction replaced by my table lookup. This
> is by about 8 cycles faster on atom.
>
> I did not reschedule instructions for atom for easier review.
>
> sse2, pminub, no-bsf, sse4 variants are everywhere slower than my patch so I
> remove them. pminub and no-bsf are used in strcat and will be removed in
> separate patch.
>
> 2012-10-07 Ondrej Bilka <neleai@seznam.cz>
> * sysdeps/x86_64/strlen.S:
> Use unrolled pminub variant by default.
> * sysdeps/x86_64/multiarch/strlen_avx.S:
> Recode default variant using VEX prefix.
> * sysdeps/x86_64/multiarch/strlen_atom.S:
> New variant tailored to atom.
> * sysdeps/x86_64/strlen.S: Updated function selection.
> * sysdeps/x86_64/multiarch/strlen-sse4.S: deleted
> * sysdeps/x86_64/multiarch/Makefile: updated
>
Please rename strlen_atom.S to strlen-no-bsf.S since it
depends on bit_Slow_BSF, not Atom.
Thanks.
--
H.J.