This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On Fri, Feb 16, 2007 at 05:38:46PM -0600, Menezes, Evandro wrote: > I implemented a new version of memcpy for x86-64 that provides an overall > performance improvement over the current one on both AMD and Intel > processors. > > It has several algorithms tuned for specific block size ranges, > considering the sizes of the cache subsystems. For instance, making use > of repeated string instructions, software prefetching and streaming > stores. > > As it uses several algorithms depending on the block size, the code is > fairly long. But given that ld.so doesn't really need as many algorithms, > at build-time a specialized version for ld.so has only a handful of worthy > algorithms. > > In addition to the source-code patches, I also attached the resulting data > obtained on a 2.4GHz Athlon64 with DDR2-800 RAM and on a 3GHz Core2 with > DDR2-533. The file memcpy-opteron-old.txt has the original output of > string/test-memcpy on the Athlon64 system and the file > memcpy-opteron-new.txt the output using the new routine. The files > memcpy-core2-old.txt and memcpy-core2-new.txt contain the same results but > on the Core2 system. I see a few issues: 1) as the l1/l2 cache sizes and prefetchw flag are only used in libc.so version, there is no point to have those vars (why were they 8 byte rather than 4 byte btw?) in _rtld_global, they can very well be hidden inside of libc.so, therefore they can be accessed like: movl _x86_64_l1_cache_size_half(%rip), %r8d which is certainly faster than loading its address from GOT and then using second memory load read the actual value. The values can be initialized in a static routine with constructor attribute. 2) even for Intel CPUs it is possible to determine L1 data cache size and glibc's sysconf (_SC_LEVEL1_DCACHE_SIZE) already knows how to do it 3) the function didn't have cfi directives, eventhough it changes %rsp and saves/restores call saved registers 4) various formatting issues (spaces instead of tabs etc.) 5) glibc i?86/x86_64 assembly style uses explicit instruction suffixes So, attached are the two patches combined with the above things changed. Initially I thought cacheinfo.c could just call __sysconf (_SC_LEVEL1_DCACHE_SIZE) and __sysconf (_SC_LEVEL2_CACHE_SIZE), unfortunately that doesn't work because the test ld.so (the one to determine which objects are needed to compile from libc_pic.a as rtld-*.os) doesn't link then - the real sysconf just brings with it too much from libc_pic.a. Perhaps even better would be to unify the cacheinfo detection between i386 and x86_64 (basically have one common cacheinfo.h with most of the routines, but using cpuid inline routine) and then separate i386 and x86_64 cacheinfo.c including it and defining its own version of cpuid inline (on x86_64 we don't need to dance around %ebx), i386 cacheinfo.c would include detection whether cpuid insn can be used at all and x86_64 cacheinfo.c would include these new _x86_64_* variables and constructor. BTW, why do you use push/pop instead of just saving/restoring the values from red zone? That would mean at least simpler unwind info. Also, for mempcpy, IMHO it is a bad idea to compute result value early, I believe in all code paths the right return value is available in %rdi register, so the pushq/popq %rax would be unneeded for mempcpy and instead before each rep; retq you'd add #if MEMPCPY_P movq %rdi, %rax #endif. Looking at test-memcpy numbers (which I admit is certainly not a good benchmark), I don't see very visible win on quadcore Core2 though: $ ~/timing elf/ld.so --library-path vanilla/ string/test-memcpy --direct > /dev/null Strip out best and worst realtime result minimum: 0.858424000 sec real / 0.000017988 sec CPU maximum: 0.885605000 sec real / 0.000041098 sec CPU average: 0.862428714 sec real / 0.000019401 sec CPU stdev : 0.002703822 sec real / 0.000001518 sec CPU $ ~/timing elf/ld.so --library-path . string/test-memcpy --direct > /dev/null Strip out best and worst realtime result minimum: 0.857600000 sec real / 0.000017456 sec CPU maximum: 1.162678000 sec real / 0.000036033 sec CPU average: 0.859858000 sec real / 0.000019178 sec CPU stdev : 0.001414669 sec real / 0.000001500 sec CPU $ ~/timing elf/ld.so --library-path vanilla/ string/test-memcpy --direct > /dev/null Strip out best and worst realtime result minimum: 0.858311000 sec real / 0.000017796 sec CPU maximum: 0.905352000 sec real / 0.000038400 sec CPU average: 0.861902142 sec real / 0.000019158 sec CPU stdev : 0.002512279 sec real / 0.000000852 sec CPU $ ~/timing elf/ld.so --library-path . string/test-memcpy --direct > /dev/null Strip out best and worst realtime result minimum: 0.857419000 sec real / 0.000018074 sec CPU maximum: 0.870351000 sec real / 0.000032102 sec CPU average: 0.861215571 sec real / 0.000019397 sec CPU stdev : 0.002651920 sec real / 0.000001001 sec CPU $ ~/timing elf/ld.so --library-path vanilla/ string/test-memcpy --direct > /dev/null Strip out best and worst realtime result minimum: 0.858271000 sec real / 0.000017894 sec CPU maximum: 0.866028000 sec real / 0.000038928 sec CPU average: 0.862063750 sec real / 0.000019215 sec CPU stdev : 0.002647184 sec real / 0.000000988 sec CPU $ ~/timing elf/ld.so --library-path . string/test-memcpy --direct > /dev/null Strip out best and worst realtime result minimum: 0.857654000 sec real / 0.000018043 sec CPU maximum: 1.393263000 sec real / 0.000036258 sec CPU average: 0.860786428 sec real / 0.000019350 sec CPU stdev : 0.002447096 sec real / 0.000000892 sec CPU I will certainly retry tonight on Athlon64 X2 when I get physically to it. In any case e.g. SPEC numbers would be interesting too. Jakub
Attachment:
P
Description: Text document
Attachment:
test-memcpy.vanilla
Description: Text document
Attachment:
test-memcpy.patched
Description: Text document
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |