This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Remove statisticaly unsound benchmarks


I forgot problem number three. Statistical significance.
This is normaly not problem with benchmarks as repeating test 1000 times
is relatively cheap. 
However these benchmarks try only 32 times for all sizes. Given variance
of clock this may not be enough to rule out random effects for small
sizes.

On Fri, Apr 19, 2013 at 11:16:18PM +1000, Allan McRae wrote:
> On 19/04/13 22:46, OndÅej BÃlka wrote:
> > Hi,
> > I found that testcase measures time. However its measurement is flawed
> > in severeal ways described below.
> > 
> > Problem number one. Minimum performance does not correspond to average 
> > performance.
> > You can have implementations A and B where A has minimum 3 and average
> > 100 while B has minimum 10 and average 20(even in ideal conditions). 
> > Picking variant with smaller minimum is obviously flawed due of this
> > bias.
> > 
> > If we take aside this problem then second problem is lack of robustnes.
> > See http://en.wikipedia.org/wiki/Sample_maximum_and_minimum
> > In our conditions minimum will pick more random noise than signal.
> 
> Do these arguments actually apply to what is being tested here?
> 
> I find it difficult (although I do no think entirely impossible) to
> generate a case where method A gives a lower minimum but higher mean
> than method B.  I suppose if method A had substantially higher variance
> in (e.g.) cache misses... 
Easy, implementation A has branch that is predicted 50% of time, B is
is predicted 90% of time but need one extra cycle of computation. 

A more realistice example is that we want to avoid is that more branching 
which means higher variance.
For simplicity assume that branch misprediction cost 20 cycles. 
Implementation A contains sequence of 4 branches that have 50% to be
mispredicted. B contains only 1 branch but extra computation takes 15
cycles.

In 1/16 of cases there will be no misprediction and minimum so 15 cycles 
lower. 
However mean for A is 4*20/2 = 40 and for B 15+20/2 = 25

Large part of optimizations is in ensuring that branch prediction is
good in average case and even possibility of bias introduced by minimum
is unacceptable.


> But that does not invalidate the minimum, but
> rather say it could be supplemented with a mean (or better, median to
> avoid outlier effects).
> 
Well as first step we could have minimum and mean. Now you can ask what
has bigger weigth mean or minimum? Then you can ask why should you
mention data that is misleading. 
And you cannot use median because distributions are not gaussian. 

Simplest example is 16 byte load which with probability 1/4 crosses
cache line. According to median you will not consider contribution of
cross-cache behaviour. Situation where split to likely and less likely
case that need be treated separately are quite common. 

> The properties of the minimum of a sample from a distribution you
> mention all assume a continuous variate - i.e. the distribution has
> tails.  In the cases here, there is a hard limit for the minimum.  So
> the variance of the minimum is not as unstable.  In fact, going to the
> wikipedia page you linked:

Well there is also technical problem number four: clock. 
Finding good unbiased clock is hard problem. 
You must handle its limited precision. On x64 timestamp counter measures
cycles in multiples of bus frequency (18 on by machine if I remember correctly).
This causes minimum and median be multiple of 18. 

However when you take mean then standard techinque of adding white noise
allows you to measure it with byte granuality

> "However, with clean data or in theoretical settings, they can sometimes
> prove very good estimators, particularly for platykurtic distributions,
> where for small data sets the mid-range is the most efficient estimator."
> 
> Which yes I know is not saying exactly about the minimum, but it still
> applies somewhat...
> 


> > An mean http://en.wikipedia.org/wiki/Sample_mean_and_sample_covariance
> > should be used instead. 
> > However you may need to first filter outliers (context switch, minor page faults...)
> > 
> > Second problem is that measurements are done in tight loop. Even if
> > measurements were done correctly repeately calling function with same
> > arguments is also problematic.
> > 
> > It estimates performance when users repeately calls function with same
> > arguments and little else.
> > 
> > In experiments you must replicate real world conditions all much as
> > possible. By neglecting factors you introduce more and more errors into
> > estimates.
> > 
> > Here doing randomized tests capture more factors and will almost always
> > be closer to real world performance. If you need to choose if to trust
> > randomized or idealized benchmark an randomized is better bet. 
> > Why should we keep, read this data when we know it is useless?
> 
> I do not agree that the generated data is useless.  It shows the speed
> for a specific case.  

Well then we can have faster qsort in linear time. Data may be already
sorted so we check this and we are fast in best case.

> If enough cases are tested, it will cover the
> range of calls in realistic usage.  So the key is to pick a good set of
> parameters that reflect all paths through the function calls.  This also
> requires knowledge of which paths are more important to optimise due to
> usage, but you are going to need that to generate "randomised" tests anyway.
> 

> > Another argument againist tight loops is following:
> > Assume you picked a set (say 5) of benchmarks. When you pick benchmarks
> > that cover cases encountered in real world and are as close as posible
> > maximizing performance will likely improve performance in real world.
> > However if you pick benchmarks that are distant and really do not cover
> > entire use case space then maximizing results on those benchmarks will
> > result in random change for real world.
> 
> As above, picking good parameters to test is the key.
> 
> > Given these flaws any results that use them are unscientific and we
> > should remove them.
> 
> I disagree.  Yes better benchmarks could probably be made, but the
> current ones are not wrong in of themselves and are still better than
> nothing.  And I have seen substantial real world improvements been made
> on the basis of their output so far.
> 
Like sse4 implementations?

Ondra


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]