This is the mail archive of the libc-ports@sources.redhat.com mailing list for the libc-ports project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] ARM: NEON detected memcpy.


On 09/04/13 13:58, Carlos O'Donell wrote:
On 04/09/2013 05:04 AM, Richard Earnshaw wrote:
On 03/04/13 16:08, Joseph S. Myers wrote:
I was previously told by people at ARM that NEON memcpy wasn't a good idea
in practice because of raised power consumption, context switch costs etc.
from using NEON in processes that otherwise didn't use it, even if it
appeared superficially beneficial in benchmarks.

What really matters is system power increase vs performance gain and
what you might be able to save if you finish sooner.  If a 10%
improvement to memcpy performance comes at a 12% increase in CPU
power, then that might seem like a net loss.  But if the CPU is only
50% of the system power, then the increase in system power increase
is just half of that (ie 6%), but the performance improvement will
still be 10%.  Note that 20% is just an example to make the figures
easier here, I've no idea what the real numbers are, and they will be
hightly dependent on the other components in the system: a back-lit
display, in particular, will use a significant amount of power.

It's also necessary to think about how the Neon unit in the processor
is managed.  Is it power gated or simply clock gated.  Power gated
regions are likely to have long power-up times (relative to normal
CPU operations), but clock-gated regions are typically
instantaneously available.

Finally, you need to consider whether the unit is likely to be
already in use.  With the increasing trend to using the hard-float
ABI, VFP (and Neon) are generally much more widely used in code now
than they were, so the other potential cost of using Neon (lazy
context switching) is also likely to be a non-issue, than if the unit
is almost never touched.

My expectation here is that downstream integrators run the
glibc microbenchmarks, or their own benchmarks, measure power,
and engage the community to discuss alternate runtime tunings
for their systems.

The project lacks any generalized whole-system benchmarking,
but my opinion is that  microbenchmarks are the best "first step"
towards achieving measurable performance goals (since whole-system
benchmarking is much more complicated).

At present the only policy we have as a community is that faster
is always better.


You still have to be careful how you measure 'faster'. Repeatedly running the same fragment of code under the same boundary conditions will only ever give you the 'warm caches' number (I, D and branch target), but if the code is called cold (or with different boundary conditions in the case of the Branch target cache) most of the time in real life, that's unlikely to be very meaningful.

R.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]