This is the mail archive of the libc-ports@sources.redhat.com mailing list for the libc-ports project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCHv2] ARM: NEON optimized implementation of memcpy.


On Tuesday 14 July 2009 20:39:27 ext Daniel Jacobowitz wrote:
> On Tue, Jul 14, 2009 at 08:17:22PM +0300, Siarhei Siamashka wrote:
> > > We also have a NEON memcpy at CodeSourcery (and performance
> > > improvements to non-NEON memcpy), as well as versions of some other
> > > string functions, adapted to glibc, that ARM recently contributed to
> > > newlib, but those are also waiting on copyright assignments from ARM. 
> > > I haven't compared the performance of the two implementations.
> >
> > Do you have this code available for general public somewhere already? I
> > can benchmark your implementations of these functions and provide some
> > feedback.
>
> Sure - if you grab our latest Lite Edition tools from the web site
> you'll get this code.  Either source or binary package.
>
>   http://www.codesourcery.com/sgpp/lite/arm

The memcpy implementation from that package is done in C, probably with the
hope that the compiler can generate some good code for it. I highly doubt that
this is going to happen any time soon, so normal assembly code will be always
better.

> > It looks like __aeabi_memcpy* may need a separate implementation anyway.
> > Any extra hops are bad for the performance. Though saving and restoring
> > NEON registers should not add too much overhead.
>
> Yes, it ought to get a separate implementation; we haven't done this
> yet because GCC doesn't generate calls to them.

It's good to know, just because the way they are now, performance would be
only lost. Is there anything else that may be using these __aeabi_memcpy*
functions at the moment?

And probably these functions even don't belong to glibc, but need to be
statically linked with each application for best performance. Because the
inter-library calls go through ".plt" section, wasting some of the extra cpu
cycles on jumping through the code like this:

    8474:       e28fc600        add     ip, pc, #0      ; 0x0
    8478:       e28cca09        add     ip, ip, #36864  ; 0x9000
    847c:       e5bcf18c        ldr     pc, [ip, #396]!

Of course applying the same logic to all the other functions from glibc would
be not very reasonable :-)

There must be some reason why these __aeabi_memcpy* functions exist in the
first place. Probably somebody thought that handling very small copies is
performance critical. Don't know if this is actually justified in practice.

> The NEON restriction is a bit weird. 

I can see a good reason behind this. Let's suppose that the compiler wants to
do a copy of some data from one place to another in the middle of a large
function, which is responsible for doing some complex floating point
calculations involving the use of lots of registers. Having to ensure that no
VFP/NEON registers will get corrupted as a side effect of this memcpy call is
an extra burden on the compiler. A special calling convention for this
function makes everything easier and faster.

> This function is supposed to be optimized for large transfers, where NEON is
> most likely to be useful. 

It's much better if the function is optimized for any transfers. And as it
turns out, this is not particularly challenging to achieve.

-- 
Best regards,
Siarhei Siamashka


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]