This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Faster strlen

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Andi Kleen <andi at firstfloor dot org>
Cc: libc-alpha at sourceware dot org
Date: Fri, 12 Oct 2012 00:24:21 +0200
Subject: Re: [PATCH] Faster strlen
References: <20121007172752.GA22344@domone.kolej.mff.cuni.cz><m2mwzvu52k.fsf@firstfloor.org><20121009150620.GA11196@domone.kolej.mff.cuni.cz><20121009153216.GU16230@one.firstfloor.org>

I added test for icache. I also tried to fill BTB by

.align 64
a1: sub $1, %esi
a1: testl $512, %esi
ja a2
jmp b2
...
b512: ret

atom benchmark is here:
http://kam.mff.cuni.cz/~ondra/benchmark_string/atom/strlen/html/test.html

and full results are archived here(commit 247bba4):
http://kam.mff.cuni.cz/~ondra/benchmark_string/benchmark_strlen_11_10_2012.tar.bz2

I also added unaligned load version that uses same idea as my strchr. On
fx10 and sandy bridge it is faster on nehalem its sometimes faster,
sometimes slower
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_nehalem/strlen/html/test.html

On Tue, Oct 09, 2012 at 05:32:16PM +0200, Andi Kleen wrote:
> > > I have doubts that table lookups are a good idea if it blows away
> > > the working set in L1 for the application.
> > It does not have this problem. It does lookup only for powers of 2 which 
> > fits 11 cache lines.
> 
> 11 cache lines is a lot of L1 cache.
> 
> It also depends on the layout, often L1s have a low associativity,
> so you're more likely to throw out valuable application data
This is exclusively for atom so data from here http://7-cpu.com/cpu/Atom.htm 
are following
L1 Data cache = 24 KB. 64 B/line. 512 KBytes, 6-way set associative,
64-byte line size. Write back. 

L1 Instruction cache = 32 KB. 64 B/line, 8-WAY. 

This could be shared by all string functions. Also it is only way I know
that is on atom faster than bsf 16 cycles.
> 
> > 
> > However it has problem that atom L2 cache has slow latency. When I
> > add access 8 random reads between calls then performance becomes
> > same as pminub. 
> 
> But pminub has the advantage that it doesn't force out the cache lines
> of the application. It may well win in the real world.
That depends on how often is string function used in loop. Probably
adding performance counter and running trapped library would provide 
answer when I could persudate it on sufficient number of users to run it.
> 
> Testing icache (running some large dummy code in the tester) would be
> also good.
> 
> -Andi

References:
- [PATCH] Faster strlen
  - From: OndÅej BÃlka
- Re: [PATCH] Faster strlen
  - From: Andi Kleen
- Re: [PATCH] Faster strlen
  - From: OndÅej BÃlka
- Re: [PATCH] Faster strlen
  - From: Andi Kleen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]