This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Rename __memcmp_sse4_2 to __memcmp_sse4_1.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>
- Cc: "H.J. Lu" <hjl dot tools at gmail dot com>, Matt Turner <mattst88 at gmail dot com>, Andreas Jaeger <aj at suse dot com>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Fri, 12 Jul 2013 19:05:16 +0200
- Subject: Re: [PATCH] Rename __memcmp_sse4_2 to __memcmp_sse4_1.
- References: <51DCE51F dot 7000001 at suse dot com> <CAMe9rOqb3_DnhSh0jPh9=suJo5c+WjegxfDh1+1go6pY+7+PLA at mail dot gmail dot com> <CAEdQ38Go4UY=k==nYT_6S86-tsOoxOO=Wn=8_pNk+LkkxSxU_Q at mail dot gmail dot com> <CAMe9rOpgaNgGSdoM5rXdhLT-TqVEJjGMyHgKRP=t+2LrSTpFAA at mail dot gmail dot com> <CAEdQ38FBeyuJpQ1eSHnM5w=8MHD3cfFjgWekkXnRFHO+Aathnw at mail dot gmail dot com> <CAMe9rOompuMMzQm+RX=ejoPMX0uWmXarvSZa_fp-Fi1p_-8o1Q at mail dot gmail dot com> <CAHjhQ91+RSKU=1F4vQ1XrJ=1j1wAv6HuQJh_s9BzcBOOTP8BDg at mail dot gmail dot com> <20130712030150 dot GA7461 at domone dot PAOCY> <CAHjhQ92CdsOemOAj+k_8gwxmJH5dsmdyNdDepWufrff4AuW1UQ at mail dot gmail dot com> <20130712162050 dot GA12414 at domone dot PAOCY>
On Fri, Jul 12, 2013 at 06:20:50PM +0200, OndÅej BÃlka wrote:
> On Fri, Jul 12, 2013 at 10:12:34AM +0400, Liubov Dmitrieva wrote:
> > Do you mean AMD? For Intel there is no a machine without SSE4_1 where
> > sse2 unaligned version is faster than ssse3.
> >
> Good to know.
>
> I looked at sources and found that memcmp is horribly misoptimized as usual.
>
> As in 70% cases difference is found in first 16 characters and 99% in 64
> characters loop case is cold.
>
> This is not much problem when n>48 as starting unaligned comparison handles
> this effectively for differences in first 16 characters.
>
> However otherwise there is lot of jumps to choose based on size which is
> ineffective.
>
> Code also answered what I thought was roadblock and why I did not try to
> optimize memcmp: That n is authoritative and we can seqfault when
> there is unallocated memory after first difference in range specified by
> n.
>
> I will prepare patch with faster memcmp.
>
For first 16 characters best I can come with is following:
#define LT _mm_cmplt_epi8
#define get_mask(x) ((uint64_t) _mm_movemask_epi8 (x))
#define first_bit(x) ((x)^((x)-1))
tp_vector va = LOADU (a);
tp_vector vb = LOADU (b);
tp_vector lt = first_bit (get_mask (LT (va,vb)) | ( 1 << 16));
tp_vector gt = first_bit (get_mask (LT (vb,va)) | ( 1 << 16));
if (get_mask (LT (va,vb)) | get_mask (LT (vb,va)))
return lt-gt; // maybe swapped.
It finds first byte that is smaller and first byte that is bigger.
Then it creates byte masks which will come positive/negative based which
of these bytes was bigger.
movdqu (%rsi), %xmm0
movdqu (%rdi), %xmm1
movdqa %xmm0, %xmm2
pcmpgtb %xmm1, %xmm2
pcmpgtb %xmm0, %xmm1
pmovmskb %xmm2, %edx
pmovmskb %xmm1, %eax
movl %eax, %ecx
orl %edx, %ecx
je .L3
orl $65536, %edx
movl %eax, %ecx
leal -1(%rdx), %eax
orl $65536, %ecx
xorl %edx, %eax
leal -1(%rcx), %edx
xorl %ecx, %edx
subl %edx, %eax
ret
Comments?