This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] Faster memset.


On 03/26/2013 01:25 PM, OndÅej BÃlka wrote:
> On Sat, Mar 23, 2013 at 03:54:20PM +0100, OndÅej BÃlka wrote:
>> Hello,
>> I looked how is memset implemented and as it used computed jumps which
>> are expensive I decided to write different implementation. 
> snip
>>
>> For behaviour on unit tests (for real programs I need to also 
>> handle calls from dynamic linker.) see following:
>> http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile.html
>>
> I collected some data, so far how gcc uses memset. See
> http://kam.mff.cuni.cz/~ondra/memset_dryrun.tar.bz2
> I now do not know which implementation is faster on intel processors.
> 
> run make, then ./show to see gcc workload. If you want compute statistic
> best way is use modified replay.c file.
> 
> It helps that there consecutive memsets are in 36% of cases called with 
> same size, 71% size of previous two.
> 
> On amd ./benchmark script which runs current and  my implementation is
> faster on my implemenatation and I am reasonably sure it is in practice.
> 
> For intel ./benchmark is faster with current implementation. Problem is
> that it does not take into account cache behaviour that happened in
> meantime. 
> On previous test my implementations gains mostly when current
> implementation computed jump is not in cache and this benchmark
> underestimates this factor.

So a win for one and a loss for the other.

How much of a win and how much of a loss?

Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]