This is the mail archive of the
libc-ports@sources.redhat.com
mailing list for the libc-ports project.
Re: [Patch, mips] Faster strcmp for mips
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Steve Ellcey <sellcey at mips dot com>
- Cc: libc-ports at sourceware dot org
- Date: Fri, 15 Nov 2013 20:02:00 +0100
- Subject: Re: [Patch, mips] Faster strcmp for mips
- Authentication-results: sourceware.org; auth=none
- References: <1384464221 dot 2484 dot 86 dot camel at ubuntu-sellcey> <20131114231434 dot GA5331 at domone dot podge> <1384539604 dot 2484 dot 102 dot camel at ubuntu-sellcey>
On Fri, Nov 15, 2013 at 10:20:04AM -0800, Steve Ellcey wrote:
> On Fri, 2013-11-15 at 00:14 +0100, OndÅej BÃlka wrote:
>
> > Problem is that aligned access could be only third of total calls. Also
> > around 90% calls need to check only first 16 bytes. These numbers differ
> > by application on x64 they are rule rather than exception. I collected
> > data of running gcc+gnuplot here:
> > http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcmp_profile/results_gcc/result.html
>
> Very interesting, you have obviously spent a lot more time then I have
> on strcmp.
>
> > > This means it could be loading bytes beyond the end of the strings being
> > > compared but it looks like other architecture specific strcmp functions
> > > are also doing this optimization and the newlib version of strcmp also does
> > > this.
> > >
> > Also unaligned case can be made faster with bit of extra effort.
> > How fast are unaligned loads on MIPS? They should be faster than
> > assembling value by shifts from two loads manualy which should be faster
> > than byte-by-byte checking.
>
> MIPS doesn't have unaligned loads but it has 'partial' loads where two
> loads can give you four (or eight) bytes in two load operations
> regardless of alignment. I may need to look at this some more though
> that will probably mean going to assembly language instead of C.
>
You can also first try this in c by using shifts.
>
> > Also here
> > > + bx.v = x; by.v = y;
> > > + if (bx.b.B0 - by.b.B0 != 0 || bx.b.B0 == 0)
> > > + return bx.b.B0 - by.b.B0;
> > > + if (bx.b.B1 - by.b.B1 != 0 || bx.b.B1 == 0)
> > > + return bx.b.B1 - by.b.B1;
> > > + if (bx.b.B2 - by.b.B2 != 0 || bx.b.B2 == 0)
> > > + return bx.b.B2 - by.b.B2;
> > > + if (bx.b.B3 - by.b.B3 != 0 || bx.b.B3 == 0)
> > > + return bx.b.B3 - by.b.B3;
> > > + if (bx.b.B4 - by.b.B4 != 0 || bx.b.B4 == 0)
> > > + return bx.b.B4 - by.b.B4;
> > > + if (bx.b.B5 - by.b.B5 != 0 || bx.b.B5 == 0)
> > > + return bx.b.B5 - by.b.B5;
> > > + if (bx.b.B6 - by.b.B6 != 0 || bx.b.B6 == 0)
> > > + return bx.b.B6 - by.b.B6;
> > > + return bx.b.B7 - by.b.B7;
> >
> > There is alternative way to find answer without branching.
> > You need to compute a number that has nonzero i-th byte if i-th bytes of
> > x and y differ or they are 0. Then find first nonzero bit which can be
> > done quickly by CLZ instruction.
> >
> > This could be implemented as following
> >
> > uint64_t bitmask = DETECTNULL8(x) | (x ^ y);
> > uint64_t bitfirst = bitmask ^ (bitmask - 1);
> > int pos = (ffsl(bitfirst) - 1) / 8;
> > return a[pos] - b[pos];
>
> This looks very interesting, I tried just plugging it in but it didn't
> work in some cases. I need to stare at this some more to understand
> exactly how it should work and what cases are failing for me.
>
I wrote that late at nigth and first idea was use __builtin_clz which
would yield following code.
uint64_t bitmask = DETECTNULL8(x) | (x ^ y);
uint64_t bitfirst = bitmask ^ (bitmask - 1);
int pos = (63 - __builtin_clzl(bitfirst)) / 8;
return a[pos] - b[pos];
I decided that using ffls was shorter but for some reasons I kept
bitfirst there. A correct version is
uint64_t bitmask = DETECTNULL8(x) | (x ^ y);
int pos = (ffsl(bitmask) - 1) / 8;
return a[pos] - b[pos];