This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Hi H.J, Ulrich, >>I also plan to post results on the AMD Barcelona processor soon. >> I plan to fix the issues pointed out by Ulrich in AMD's previous >> submission and add an AMD path that addresses the performance issues >> noted above. I have tested the memset posted by H.J on AMD's Barcelona processor. I have also cleaned up the memset tuned for AMD Barcelona as per the review done by Ulrich as submitted at http://sources.redhat.com/ml/libc-alpha/2007-08/msg00054.html (attached patch 001-memset-amd.diff) and bootstrapped it on AMD64. The performance of H.J's memset, AMD's memset and the original memset currently in glibc on Barcelona is compared. The comparative performance data is attached in memset_perf_data_comp.txt. In order to come up with a blended memset for x86-64, it would be useful to discuss the performance on AMD and Intel hardware and agree on the design decisions of the common code path. H.J's memset uses an integer jmp table for a 2 byte to 144 byte block. This is a common blended code path for all x86-64 processors. Looking at the performance in this range, here are some of the key observations from memset_perf_data_comp.txt: - At 1 byte, AMD's memset is ~23% slower than H.J's or the original. - Between 2B and 43B, H.J's memset and AMD's memset are at par. - Between 64B and 128B, H.J's memset is 8% to 21% slower at most blocks. For a block larger than 144 bytes, H.J's memset aligns the block to 16 bytes and handles the prologue with another integer jmp table. The prologue is common blended code path for all x86-64 processors. After being aligned, blocks larger than 144 bytes can follow an SSE code path or an integer code path based on what the sysconfig indicates for a given x86-64 processor. Currently any AMD processor follows the integer code path and that is the AMD recommended path for memset. Any block larger than 144 bytes will also reuse the 2 byte to 144 byte jmp table for epilogue, irrespective of the x86-64 processor. So the alignment, prologue and epilogue code are common blended code paths. On the other hand, AMD's memset aligns the block if it is larger than or equal to 512 bytes and aligns it to 8 bytes. For blocks larger than 144 bytes, AMD plans to do some analysis to understand whether the early 16 byte alignment and/or the prologue and/or epilogue code are contributing to any slow down in those blocks or whether the AMD's memset needs to be improved. H.J: Can you clarify how the 144 byte boundary was chosen to end the integer jmp table and align blocks? For the non-SSE2 code path beyond 144 bytes, we would like to integrate the code used in AMD's memset (including any improvements we make) that gives us a measurable speedup on Barcelona. For eg, the use of rep stos between 8KB and 48KB. Another improvement for us is at blocks larger than the largest cache size (L2 or L3 if avalaible) (when NOT_IN_GLIBC is defined) or half the largest cache size (when NOT_IN_GLIBC is not defined). In this range the sub block that is smaller than the full or half cache size is set with rep stos and the remaining sub block is set with movnti. I would appreciate any feedback from the list. Thanks, Harsha
Attachment:
001-memset-amd.diff
Description: 001-memset-amd.diff
Attachment:
memset_perf_data_comp.txt
Description: memset_perf_data_comp.txt
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |