This is the mail archive of the libc-ports@sources.redhat.com mailing list for the libc-ports project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.


On 09/03/2013 02:57 PM, OndÅej BÃlka wrote:
> On Tue, Sep 03, 2013 at 01:52:34PM -0400, Carlos O'Donell wrote:
>> On 09/03/2013 01:37 PM, OndÅej BÃlka wrote:
>>>> We have one, it's the glibc microbenchmark, and we want to expand it,
>>>> otherwise when ACME comes with their patch for ARM and breaks performance
>>>> for targets that Linaro cares about I have no way to reject the patch
>>>> objectively :-)
>>>>
>>> Carlos, you are asking for impossible. When you publish benchmark people
>>> will try to maximize benchmark number. After certain point this becomes
>>> possible only by employing shady accounting: Move part of time to place
>>> wehre it will not be measured by benchmark (for example by having
>>> function that is 4kb large, on benchmarks it will fit into instruction
>>> cache but that does not happen in reality). 
>>
>> What is it that I'm asking that is impossible?
>>
> Having static set of benchmarks that can say if implementation is
> improvement. 

I agree that a static set of benchmarks will not provide us with an
accurate answer for "Is this patch objectively better for performance?"

I don't see that that makes it impossible. The static set of benchmarks
at a given point in time (since we should be adding new benchmarks) may
have some error with respect to "fastest" for a given ISA/device/workload
(which I will shorten as `workload' from now on). Therefore I would not 
say it's impossible.

It's probably impossible for the error between the estimator and reality
being close to zero though. That's just expected.

> We are shooting to moving target, architectures change and as what we write
> will code that will come to users with considerable delay and factors we 
> used for decision will change in meantime.

What's wrong with a moving target?

I have spoken with CPU designers and I've been told that they do whole 
system profiling in order assist in making instruction to microcode
decoding decisions. Therefore what we select as optimal sequences are 
also fed forward into *new* architecture designs.

> Once implementation reach certain quality question what is better
> becomes dependent on program used. Until we could decide from profile
> feedback we will lose some percents by having to use single
> implementation.

I agree. The eventual goal of the project is to have some kind of
whole system benchmarking that allows users to feed in their profiles
and allow us as developers to see what users are doing with our library.

Just like CPU designers feed in a whole distribution of applications
and look at the probability of instruction selection and tweak instruction
to microcode mappings.

I am willing to accept a certain error in the process as long as I know
we are headed in the right direction. If we all disagree about the
direction we are going in then we should talk about it.

I see:

microbenchmarks -> whole system benchmarks -> profile driven optimizations

With the latter driving the set of tunnables we expose in the library,
and possibly even the functions selected by the IFUNC resolvers at
program startup.
  
>>> Taking care of common factors that can cause that is about ten times
>>> more complex than whole system benchmarking, analysis will be quite
>>> difficult as you will get twenty numbers and you will need to decide
>>> which ones could made real impact and which wont.
>>
>> Sorry, could you clarify this a bit more, exactly what is ten times
>> more complex?
>>
> Having benchmark suite that will catch all relevant factors that can
> affect performance. Some are hard to qualify for them we need to know
> how average program stresses resources.

I agree.

I would be happy to accept a patch that does:
* Shows the benchmark numbers.
* Explains relevant factors not caught by the benchmark that affect
  performance, what they are, and why the patch should go in.

My goal is to increase the quality of the written rationales for
performance related submissions.

> Take instruction cache usage, a function will occupy cache lines and we
> can accurately measure probability and cost of cache misses inside
> function. What is hard to estimate is how this will affect rest of
> program. For this we would need to know average probability that cache
> line will be referenced in future.

Good example.

>> If we have N tests and they produce N numbers, for a given target,
>> for a given device, for a given workload, there is a set of importance
>> weights on N that should give you some kind of relevance.
>>
> You are jumping to case when we will have these weights. Problematic
> part is getting those.

I agree.

It's hard to know the weights without having an intuitive understanding
of the applications you're running on your system and what's relevant
for their performance.

Perhaps my example of a weighted average is too abstract to use today.

>> We should be able to come up with some kind of framework from which
>> we can clearly say "this patch is better than this other patch", even
>> if not automated, it should be possible to reason from the results,
>> and that reasoning recorded as a discussion on this list.
>>
> What is possible is to say that some patch is significantly worse based
> on some criteria. There is lot of gray area where decision is unclear.

:-)

Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]