This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC] Faster memset.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Carlos O'Donell <carlos at redhat dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Wed, 10 Apr 2013 08:25:10 +0200
- Subject: Re: [RFC] Faster memset.
- References: <20130323145420 dot GA18058 at domone dot kolej dot mff dot cuni dot cz> <20130326172514 dot GA14436 at domone dot kolej dot mff dot cuni dot cz> <5164A07F dot 9090606 at redhat dot com>
On Tue, Apr 09, 2013 at 07:13:03PM -0400, Carlos O'Donell wrote:
> On 03/26/2013 01:25 PM, OndÅej BÃlka wrote:
> > On Sat, Mar 23, 2013 at 03:54:20PM +0100, OndÅej BÃlka wrote:
> >> Hello,
> >> I looked how is memset implemented and as it used computed jumps which
> >> are expensive I decided to write different implementation.
> > snip
> >>
> >> For behaviour on unit tests (for real programs I need to also
> >> handle calls from dynamic linker.) see following:
> >> http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile.html
> >>
> > I collected some data, so far how gcc uses memset. See
> > http://kam.mff.cuni.cz/~ondra/memset_dryrun.tar.bz2
> > I now do not know which implementation is faster on intel processors.
> >
> > run make, then ./show to see gcc workload. If you want compute statistic
> > best way is use modified replay.c file.
> >
> > It helps that there consecutive memsets are in 36% of cases called with
> > same size, 71% size of previous two.
> >
> > On amd ./benchmark script which runs current and my implementation is
> > faster on my implemenatation and I am reasonably sure it is in practice.
> >
> > For intel ./benchmark is faster with current implementation. Problem is
> > that it does not take into account cache behaviour that happened in
> > meantime.
> > On previous test my implementations gains mostly when current
> > implementation computed jump is not in cache and this benchmark
> > underestimates this factor.
>
> So a win for one and a loss for the other.
>
> How much of a win and how much of a loss?
>
When I did profiling it supports theory that cache cost dominates and my
implementation is faster. Result is here.
http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile/result.html
Results are slower when random test, I am not sure why.
Ondra