This is the mail archive of the libc-ports@sources.redhat.com mailing list for the libc-ports project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Optimize MIPS memcpy


On 3/09/2012, at 9:12 PM, Andrew T Pinski wrote:

> Forgot to CC libc-ports@ .
> On Sat, 2012-09-01 at 18:15 +1200, Maxim Kuvyrkov wrote:
>> This patch improves MIPS assembly implementations of memcpy.  Two optimizations are added: prefetching of data for subsequent iterations of memcpy loop and pipelined expansion of unaligned memcpy.  These optimizations speed up MIPS memcpy by about 10%.
>> 
>> The prefetching part is straightforward: it adds prefetching of a cache line (32 bytes) for +1 iteration for unaligned case and +2 iteration for aligned case.  The rationale here is that it will take prefetch to acquire data about same time as 1 iteration of unaligned loop or 2 iterations of aligned loop.  Values for these parameters were tuned on a modern MIPS processor.
>> 
> 
> This might hurt Octeon as the cache line size there is 128 bytes.  Can
> you say which modern MIPS processor which this has been tuned with?  And
> is there a way to not hard code 32 in the assembly but in a macro
> instead.

This was implemented with NetLogic XLR/XLP in mind.

The above description that I wrote was not completely accurate with regards to why we are assuming 32-byte prefetch (as I mentioned, this patch was developed almost 3 years ago).  For 32-bit ABIs one iteration of the loop processes 32-bytes of data -- that's how much can fit into available 8 registers at once.  Therefore we are choosing to prefetch in 32-byte blocks and have 1 prefetch instruction per iteration (well, 2 prefetches actually -- one for read and one for write).  It is possible to issue prefetch instructions only every Nth iteration, but the overhead of doing so will likely be greater than the benefit.

For 64-bit ABIs we process 64 bytes per iteration, so we could deal with just a single 64-byte-or-wider prefetch per iteration.  As it happens, XLR/XLP prefetch 32 bytes at a time, so the current implementation issues 2 prefetches per iteration.

It is feasible to use 2 macros for 64-bit implementation: PREFETCH32 and PREFETCH64.  XLR/XLP would define both these macros to "pref", while Octeon would define PREFETCH64 to "pref" and PREFETCH32 to "nop", thus issuing a single prefetch per iteration.  

However, I doubt that the above improvement worths the increased complexity of the memcpy implementation.  I would expect most modern CPU to quickly discard extraneous prefetch instructions.  And the most we can reasonably save here is to remove 1 read and 1 write prefetch instructions for 64-bit memcpy.

Andrew, if you still think that it would provide significant performance improvement to Octeon to issue as few prefetches as possible, would you please compare performance between the two approaches (removing the second prefetch from 64-bit implementation is a trivial change) and get back to the list with the results?

Thank you,

--
Maxim Kuvyrkov
Mentor Graphics


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]