This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Faster strlen

From: "H.J. Lu" <hjl dot tools at gmail dot com>
To: Ondřej Bílka <neleai at seznam dot cz>
Cc: libc-alpha at sourceware dot org
Date: Tue, 9 Oct 2012 08:02:33 -0700
Subject: Re: [PATCH] Faster strlen
References: <20121007172752.GA22344@domone.kolej.mff.cuni.cz>

On Sun, Oct 7, 2012 at 10:27 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
> Hello, I investigated strlen bit more and improved pminub variant.
>
> I got upto 10% speedup by unrolling main loop. I did not measured
> difference when I unrolled loop more.
>
> I also benchmarked atom and added variant which is identical to
> strlen-sse2-pminub except bsf is replaced by table lookup.
>
> Last addition is attempt to generate VEX encoded strlen. I need only to
> pass -mavx flag when compiling strlen_avx.S but do not know how.
>
> Benchmarks are at usual place. To fit all functions consider only random
> alignment. I also increased granularity of sampling.
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/
>
> Results for this patch are
> http://kam.mff.cuni.cz/~ondra/benchmark_string/benchmark_strlen_7_10_2012.tar.bz2
>
> On sandy bridge
> http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_sandy_bridge/strlen/html/test_r.html
> there is phase change around sizes 1500-2000. Do you know what caused it?
>
> Other optimalization is prefetching. Most of time prefetching variant is
> slower than nonprefetching(as large strings are rare.)
> On sandy bridge prefetching is free. I need additional flag to ifunc to
> indicate that.
>
> I disabled prefetching in my patch.
>
> On atom ironicaly strlen-sse2-no-bsf was slower than pminub variant
> except for string less than 16 bytes long.
>
> For exit from main loop of no-bsf variant using bsfq instead binary
> search saves 10 cycles. Multiplication+table lookup is also slow in atom
> because 64bit multiplication is slow.
>
> I used pminub variant with  bsf instruction replaced by my table lookup. This
> is by about 8 cycles faster on atom.
>
> I did not reschedule instructions for atom for easier review.
>
> sse2, pminub, no-bsf, sse4 variants are everywhere slower than my patch so I
> remove them. pminub and no-bsf are used in strcat and will be removed in
> separate patch.
>
> 2012-10-07  Ondrej Bilka  <neleai@seznam.cz>
>         * sysdeps/x86_64/strlen.S:
>           Use unrolled pminub variant by default.
>         * sysdeps/x86_64/multiarch/strlen_avx.S:
>           Recode default variant using VEX prefix.
>         * sysdeps/x86_64/multiarch/strlen_atom.S:
>           New variant tailored to atom.
>         * sysdeps/x86_64/strlen.S: Updated function selection.
>         * sysdeps/x86_64/multiarch/strlen-sse4.S: deleted
>         * sysdeps/x86_64/multiarch/Makefile: updated
>

Please rename strlen_atom.S to strlen-no-bsf.S since it
depends on bit_Slow_BSF, not Atom.

Thanks.

-- 
H.J.

Follow-Ups:
- Re: [PATCH] Faster strlen
  - From: Dmitrieva Liubov
- Re: [PATCH] Faster strlen
  - From: OndÅej BÃlka

References:
- [PATCH] Faster strlen
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]