This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] aarch64: optimize the unaligned case of memcmp
On Mon, Jun 26, 2017 at 2:00 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Sebastian Pop wrote:
>> On 06/23/2017 04:28 PM, Wilco Dijkstra wrote:
>>
>> > Where is the setup of limit_wd and limit???
>>
>> You are right, my patch was not quite correct: I was missing the
>> initialization of limit_wd, like so:
>>
>> lsr limit_wd, limit, #3
>>
>> limit is the number of bytes to be compared passed in as a parameter to
>> memcmp.
>
> You're still missing the setting of limit. Your current version will do the
> words up to limit - (limit & 7), and then do byte by byte using the original
> value of limit, so it's going well outside its bounds...
You are right, I was missing the "and limit, limit, #7". With that added
the performance looks different than the byte-by-byte memcmp.
Still the performance is lower than with aligning src1:
Benchmark Time CPU Iterations
--------------------------------------------------------------------
BM_string_memcmp_unaligned/8 1288 ns 1288 ns 540945
5.92208MB/s
BM_string_memcmp_unaligned/16 1303 ns 1303 ns 537143
11.7123MB/s
BM_string_memcmp_unaligned/20 341 ns 341 ns 2064228
55.9994MB/s
BM_string_memcmp_unaligned/30 405 ns 405 ns 1726750
70.5799MB/s
BM_string_memcmp_unaligned/42 405 ns 405 ns 1728170
98.8833MB/s
BM_string_memcmp_unaligned/55 563 ns 562 ns 1239350
93.2568MB/s
BM_string_memcmp_unaligned/60 539 ns 539 ns 1298194
106.109MB/s
BM_string_memcmp_unaligned/64 2378 ns 2378 ns 359461
25.6695MB/s
And for larger data sets the performance is still lower than when aligning src1:
Benchmark Time CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp_unaligned/8 1288 ns 1288 ns 543230
5.9221MB/s
BM_string_memcmp_unaligned/64 2377 ns 2377 ns 359351
25.6742MB/s
BM_string_memcmp_unaligned/512 6444 ns 6444 ns 184103
75.7774MB/s
BM_string_memcmp_unaligned/1024 4869 ns 4868 ns 143785
200.599MB/s
BM_string_memcmp_unaligned/8k 33090 ns 33089 ns 21279
236.107MB/s
BM_string_memcmp_unaligned/16k 66748 ns 66738 ns 10436
234.123MB/s
BM_string_memcmp_unaligned/32k 131781 ns 131775 ns 5106
237.147MB/s
BM_string_memcmp_unaligned/64k 291907 ns 291860 ns 2334
214.143MB/s