This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] randomize benchtests

From: Torvald Riegel <triegel at redhat dot com>
To: OndÅej BÃlka <neleai at seznam dot cz>
Cc: Siddhesh Poyarekar <siddhesh dot poyarekar at gmail dot com>, GNU C Library <libc-alpha at sourceware dot org>
Date: Tue, 21 May 2013 11:01:57 +0200
Subject: Re: [PATCH] randomize benchtests
References: <20130422120018 dot GA30323 at domone dot kolej dot mff dot cuni dot cz> <CAAHN_R1pHJLoS3iP7KrQMmA4gPLawvuRWoK8Xy13VPMBbyPk+Q at mail dot gmail dot com> <20130422125625 dot GA30639 at domone dot kolej dot mff dot cuni dot cz> <1368786270 dot 3054 dot 3119 dot camel at triegel dot csb> <20130517104445 dot GA14264 at domone dot kolej dot mff dot cuni dot cz> <1368789361 dot 3054 dot 3193 dot camel at triegel dot csb> <20130517114722 dot GA14705 at domone dot kolej dot mff dot cuni dot cz> <1368794664 dot 3054 dot 3294 dot camel at triegel dot csb> <20130517140516 dot GA18757 at domone dot kolej dot mff dot cuni dot cz> <1368802712 dot 8410 dot 46 dot camel at triegel dot csb> <20130517172536 dot GA12259 at domone dot kolej dot mff dot cuni dot cz>

On Fri, 2013-05-17 at 19:25 +0200, OndÅej BÃlka wrote:
> On Fri, May 17, 2013 at 04:58:32PM +0200, Torvald Riegel wrote:
> > On Fri, 2013-05-17 at 16:05 +0200, OndÅej BÃlka wrote: 
> > > On Fri, May 17, 2013 at 02:44:24PM +0200, Torvald Riegel wrote:
> > > > On Fri, 2013-05-17 at 13:47 +0200, OndÅej BÃlka wrote:
> > > > > On Fri, May 17, 2013 at 01:16:01PM +0200, Torvald Riegel wrote:
> > > > > > On Fri, 2013-05-17 at 12:44 +0200, OndÅej BÃlka wrote:
> > > > > > > On Fri, May 17, 2013 at 12:24:30PM +0200, Torvald Riegel wrote:
> > > > > > > > On Mon, 2013-04-22 at 14:56 +0200, OndÅej BÃlka wrote:
> > > > > > > > > On Mon, Apr 22, 2013 at 05:44:14PM +0530, Siddhesh Poyarekar wrote:
> > > > > > > > > > On 22 April 2013 17:30, OndÅej BÃlka <neleai@seznam.cz> wrote:
> snip
> > > > > This only adds noise which can be controled by sufficient
> > > > > number of samples.
> > > > > 
> > > > > Reproducibity? These tests are not reproducible nor designed to be
> > > > > reproducible.
> > > > 
> > > > They should be, though not necessarily at the lowest level.  If they
> > > > wouldn't be reproducible in the sense of being completely random, you
> > > > couldn't derive any qualitative statement -- which we want to do,
> > > > ultimately.
> > > 
> > > You must distinguish between http://en.wikipedia.org/wiki/Noise
> > > and http://en.wikipedia.org/wiki/Bias. Expected value of former does not
> > > depend on selected implementation where it does not matter in latter.
> > 
> > This is unrelated to what I said.
> > 
> It is related (see below).
> 
> > > What should be reproducible are ratios between implementations in single
> > > test(see below). This is thing that matters.
> > 
> > That's *one thing* that we can try to make reproducible.  What matters
> > in the end is that we find out whether there was a performance
> > regression, meaning that our current implementation doesn't have the
> > performance properties anymore that it once had (eg, it's now slower
> > than an alternative).  Our performance tests need to give us a
> > reproducible results in the sense that we can rely on them showing
> > performance regressions.
> > 
> To test regressions you need to compare with alternatives. Testing all
> alternatives together is better as it avoid lot of possible errors.
> 
> > > When you compare different
> > > runs you introduce bias without care.
> > 
> > When comparing results from different machines, we *may* be comparing
> > apples and oranges, but we're not necessarily doing so; this really just
> > depends on whether the difference in the setups actually make a
> > difference for the question we want to answer.
> >
> I said that they are worthless to make conclusions. You cannot decide
> that something is faster because you had two different measurements. To
> what happens you need more granular benchmarking.
> 
> > Where did I suggest to compare results from different machines *as-is*
> > without considering the differences between the machines?  But at the
> > same time, we can't expect to get accurate measurements all the time, so
> > we need to deal with imprecise data.
> > 
> > > and as you said:
> > > > Even if there is noise that we can't control, this doesn't mean it's
> > > > fine to add further noise without care (and not calibrating/measuring
> > > 
> > > 
> > > > 
> > > > > Runs in same machine are affected by many environtmental
> > > > > effects and anything other than randomized comparison of implementations
> > > > > in same run has bias that makes data worthless.  
> > > > 
> > > > Even if there is noise that we can't control, this doesn't mean it's
> > > > fine to add further noise without care (and not calibrating/measuring
> > > > against a loop with just the rand_r would be just that).
> > > > 
> > > Adding noise is perfectly fine. You estimate variance and based of this
> > > information you choose number of iterations such that measurement error caused 
> > > by it is in 99% of cases within 1% of mean.
> > > 
> > > You probably did mean bias
> > 
> > Whether something is noise or bias depends on the question your asking.
> >
> Noise and bias are techincal term with fixed meaning and you must
> distinguish between them.

I believe you didn't understand what I said.  I was not disputing that
both concepts are separate, but that what is what depends on the
question you're asking; that is, what your model and assumptions are.
Just look at the noise explanation you cited: "random unwanted data
without meaning" is exactly dependent on which properties you assign a
meaning and which ones you ignore.  For example, in your example, if you
are measuring the jitter for a set of measurements, then adding random()
is not noise.

> When you write benchmark
> 
> time = exact measurement;
> if (implementation ==  mine)
>   time /= 2;
> 
> then it is biased but not noisy.
> 
> When you write
> 
> time = exact measurement + random();
> 
> then it is noisy but not biased.
> 
> From wikipedia:
> 
> NOISE:
> 
> In signal processing or computing noise can be considered random
> unwanted data without meaning; that is, data that is not being used to
> transmit a signal, but is simply produced as an unwanted by-product of
> other activities. "Signal-to-noise ratio" is sometimes used to refer to
> the ratio of useful to irrelevant information in an exchange.
> 
> BIAS:
>  
> In statistics, there are several types of bias: Selection bias, where
> there is an error in choosing the individuals or groups to take part in
> a scientific study. It includes sampling bias, in which some members of
> the population are more likely to be included than others. Systematic bias 
> or systemic bias are external influences that may affect the accuracy of 
> statistical measurements.
> 
> > > and this is reason why we do not try to
> > > compare different runs.
> > 
> > But in practice, you'll have to to some extent, if we want to take
> > user-provided measurements into account.
> > 
> For user-provided measurements first question is verify that they
> measure correct metric. There are several pitfalls you must avoid.
> 
> > > > Even if we have different runs on different machines, we can look for
> > > > regressions among similar machines.  Noise doesn't make the data per se
> > > > worthless.  And randomizing certain parameters doesn't necessarily
> > > Did you try run test twice? 
> > 
> > ??? I can't see any relevant link between that sentence of yours and
> > what we're actually discussing here.
> >
> You need to know how big these random factors are. Run test twice and
> you will see.

I do know, and yet there's still no link to what we were discussing.

> > > > remove any bias, because to do that you need to control all the
> > > > parameters with your randomization.  And even if you do that, if you
> > > That you cannot eliminate something completely does not mean that you
> > > should give up. A goal is to manage http://en.wikipedia.org/wiki/Systematic_error
> > > and have it within reasonable bounds. 
> > 
> > How does that conflict with what I said?
> > 
> > > A branch misprediction and cache issues are major issues that can cause
> > > bias and randomization is crucial. 
> > 
> > The point I made is that randomization isn't necessarily avoiding the
> > issue, so it not a silver bullet.
> > 
> > > Other factors are just random and they do not favor one implementation
> > > over another in significant way. 
> > 
> > I believe we need to be careful here to not be dogmatic.  First because
> > it makes the discussions harder.  Second because we're talking about
> > making estimations about black boxes; there's no point in saying "X is
> > the only way to measure this" or "Only Y matters here", because we can't
> > expect to 100% know what's going on -- it will always involve a set of
> > assumptions, tests for those, and so on.  Thus being rather open-minded
> > than dogmatic is helpful.
> >
> There is difference between being scientific and dogmatic.

Sure, and you can *try* to follow a scientific approach and still be
dogmatic at the same time.  So don't be dogmatic, and be open to
re-evaluating what you thought was the "truth"; maybe we didn't
understand all aspects of it right away...

I'm not convinced anymore that we can have a meaningful, constructive
discussion.  Thus I won't continue to take part in this conversation.

Torvald

> There are
> good reasons why you should do things in certain way unless you know
> that reason doing something other is usually a mistake. 
> 
> What is best way to measure something depends what you want to measure.
> So what do you thing are important properties?
> 
> What we estimate are not entirely black boxes. We have relatively
> accurate models how processor work.  
> 
>  
> > > > use, say, a uniform distribution for a certain input param, but most our
> > > > users are actually interested in just a subset of the input range most
> > > I do exactly that, see my dryrun/profiler. I do not know why I am
> > > wasting time trying to improve this.
> > > > of the time, then even your randomization isn't helpful.
> > > > 
> > > > To give an example: It's fine if we get measurements for machines that
> > > > don't control their CPU frequencies tightly, but it's not fine to throw
> > > > away this information (as you indicated by dismissing the idea of a
> > > > warning that someone else brought up).
> > > Where did I wrote that I dismissed it? I only sayed it that there are
> > > more important factors to consider. If you want to write patch that warns 
> > > and sets cpu frequency fine.
> > 
> > The answer to this suggestion by Petr Baudis started with:
> > "Warning has problem that it will get lost in wall of text. 
> > 
> > Only factor that matters is performance ratio between implementations."
> > 
> > Which sounds pretty dismissive to me.  If it wasn't meant like that,
> > blame translation...
>

References:
- Re: [PATCH] randomize benchtests
  - From: Torvald Riegel
- Re: [PATCH] randomize benchtests
  - From: OndÅej BÃlka
- Re: [PATCH] randomize benchtests
  - From: Torvald Riegel
- Re: [PATCH] randomize benchtests
  - From: OndÅej BÃlka
- Re: [PATCH] randomize benchtests
  - From: Torvald Riegel
- Re: [PATCH] randomize benchtests
  - From: OndÅej BÃlka
- Re: [PATCH] randomize benchtests
  - From: Torvald Riegel
- Re: [PATCH] randomize benchtests
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]