This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: New x86-64 memcpy


Jakub, 

> 1) as the l1/l2 cache sizes and prefetchw flag are only used 
> in libc.so
> version, there is no point to have those vars (why were they 
> 8 byte rather
> than 4 byte btw?) in _rtld_global, they can very well be hidden inside
> of libc.so, therefore they can be accessed like:
> movl _x86_64_l1_cache_size_half(%rip), %r8d
> which is certainly faster than loading its address from GOT and then
> using second memory load read the actual value.  The values can be
> initialized in a static routine with constructor attribute.

That sounds reasonable.  However, is having the constructor attribute soon enough?

They had 8 bytes each in order to allow direct comparisons with the count in a register without having to load the value.  Even if in memcpy they can be used as 4-byte variables, I have other routines that would benefit from them being 8 bytes long.

> 2) even for Intel CPUs it is possible to determine L1 data cache size
> and glibc's sysconf (_SC_LEVEL1_DCACHE_SIZE) already knows 
> how to do it

I'll take a look at leveraging it.

> 3) the function didn't have cfi directives, eventhough it changes %rsp
> and saves/restores call saved registers

I guess that using the red zone is better.  As the routine has several exit points to improve performance, after each one new CFI directives would have to be added, which complicates maintaining the code.

> 4) various formatting issues (spaces instead of tabs etc.)

I thought that it had tabs only...  I'll make sure it does.

> 5) glibc i?86/x86_64 assembly style uses explicit instruction suffixes

OK

> So, attached are the two patches combined with the above 
> things changed.
> Initially I thought cacheinfo.c could just call
> __sysconf (_SC_LEVEL1_DCACHE_SIZE) and __sysconf 
> (_SC_LEVEL2_CACHE_SIZE),
> unfortunately that doesn't work because the test ld.so (the 
> one to determine
> which objects are needed to compile from libc_pic.a as rtld-*.os)
> doesn't link then - the real sysconf just brings with it too much
> from libc_pic.a.  Perhaps even better would be to unify the cacheinfo
> detection between i386 and x86_64 (basically have one common
> cacheinfo.h with most of the routines, but using cpuid inline routine)
> and then separate i386 and x86_64 cacheinfo.c including it 
> and defining
> its own version of cpuid inline (on x86_64 we don't need to 
> dance around
> %ebx), i386 cacheinfo.c would include detection whether cpuid insn
> can be used at all and x86_64 cacheinfo.c would include these new
> _x86_64_* variables and constructor.

I'll investigate this.
 
> BTW, why do you use push/pop instead of just saving/restoring 
> the values
> from red zone?  That would mean at least simpler unwind info.

I agree.

> Also, for mempcpy, IMHO it is a bad idea to compute result 
> value early,
> I believe in all code paths the right return value is 
> available in %rdi
> register, so the pushq/popq %rax would be unneeded for 
> mempcpy and instead
> before each rep; retq you'd add #if MEMPCPY_P movq %rdi, %rax #endif.

I'll double-check that RDI has the expected value always.  Otherwise, I'll just use an entry in the red zone.

> Looking at test-memcpy numbers (which I admit is certainly not a good
> benchmark), I don't see very visible win on quadcore Core2 though:

Well, timing the test is not a good way to check it out, as it probably spends a lot of time in printf.  See the attached chart showing the relative performance between your vanilla and patched results.
 
> I will certainly retry tonight on Athlon64 X2 when I get
> physically to it.  In any case e.g. SPEC numbers would
> be interesting too.

SPEC gets more interesting with new versions memcpy and memset.  memcpy alone doesn't budge it much.

Thanks,

-- 
_______________________________________________________
Evandro Menezes               AMD            Austin, TX

Attachment: memcpy-core2-vanilla-memcpy-core2-patched-ratio.png
Description: memcpy-core2-vanilla-memcpy-core2-patched-ratio.png


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]