This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: [PATCH] correction for PPC __compare_and_swap


> From: dmehaffy@us.ibm.com
> cc: brianmc1@us.ibm.com
> Date: Wed, 9 May 2001 16:53:50 -0500
> 
> While this patch technically does not fix an error in PPC
> __compare_and_swap, it does fix a performance problem with lock/unlock in
> pthreads.  This was uncovered when we were fixing bugs in the kernel and
> libraries with Java running on a 4-way SMP Power 3 processor. This change
> in combination with other work did cause Java to function correctly,
> so it may have been a timing issue, but it is easy to argue that this
> change
> should be accepted based on its performance merits alone.

Hmmm.  OK, so it's not a correctness change, it's a performance
change.  I can go for that.

> It is well known that sync's should be avoided whenever possible on PPC  as
> these result in broadcasts to all processor to make sure memory is coherent
> on all processors.  This is very expensive.  The proposed change will get
> rid of two syncs as written.  The isync is local to the processor and does
> not cause any bus traffic between processors and is all that is necessary
> in this case.
> 
> Kaz was on the right track suggesting that the compare and swap should not
> use sync and isync at all but be just an atomic primitive and the calling
> routines should be responsible for doing the sync or isync before or after
> as appropriate.

Yes, I agree with that too.

>  Because the routine is used for lock and unlock,  it is
> actually better to write them as separate routines, inlining the compare
> and swap which also gets rid of a lwarx/stwcx. pair which are also somewhat
> expensive.   The optimimum way to write the routines is as follows (in
> pseudo code):
> 
> unlock:  sync
>          stw 0, lock_location
>          blr

Yes.

> In the unlock case the sync is all that is  necessary to make all changes
> protected by the lock globally visible.  Note that no lwarx or stwcx. is
> needed.
> 
> lock:
> 1:   lwarx   r5, lock_location
>      cmpiw r5, 0
>      bne 2f:
>      stwcx. 1, lock_location
>      bne 1b
>      isync
>      blr
> 2:   need to indicate the lock is already locked (could spin if you want to
>      in this case or put on a sleep queue)
>      blr
> 
> In the lock case no sync is required at the beginning because all changes
> are visible at this point from the unlocker of the lock.  The isync is
> necessary to make sure the processor has done no speculative loads before
> the actual acquire of the lock. 

Now I see.  The isync is being used as a cheaper, local, sync.  That
makes sense.

> Only the processor that is getting the lock needs to get rid of the
> speculative loads because he has the lock, the other processor don't
> care so there is no need to do a sync which would broadcast the
> results to all processor making them invalidate their data.
> 
> This is the implementation we have been using in AIX since the first
> introduction of PowerPC SMPs.  It has been found to be the most optimal
> path length and cycle count for implementation of locks.  In this form
> there is only one sync and one isync for a lock/unlock pair as compared to
> the 4 syncs (which are very heavy weight) in the current code.  Also there
> is the elimination of a lwarx/stwcx. pair as well which cause bus
> transactions which are expensive because of the reservation protocol.

I can't work out how to get the library to stop using compare_and_swap
to release locks...  Any ideas?  I don't want to touch the
machine-independent part if I can help it.

> I would be happy to supply a formal patch that implements locks this way
> for PPC (though it is very simple to implement from the pseudo code).  Also
> to note this is exactly how the sample lock code supplied in the PowerPC
> Architecture book (page 254) and the PowerPC Microprocessor Family: The
> Programming Environments (page E4) suggest that locks be implemented for
> optimal performance.

I'd be very happy if people could test this patch on an SMP system and
see how it does.

-- 
- Geoffrey Keating <geoffk@geoffk.org>

===File ~/patches/cygnus/glibc-linuxthreads-lesssync.patch===
2001-05-09  Geoff Keating  <geoffk@redhat.com>

	* sysdeps/powerpc/pt-machine.h 
	(HAS_COMPARE_AND_SWAP_WITH_RELEASE_SEMANTICS): Define.
	(__compare_and_swap): Remove memory barriers.
	(__compare_and_swap_with_release_semantics): New function.

Index: linuxthreads/sysdeps/powerpc/pt-machine.h
===================================================================
RCS file: /cvs/glibc/libc/linuxthreads/sysdeps/powerpc/pt-machine.h,v
retrieving revision 1.7
diff -p -u -u -p -r1.7 pt-machine.h
--- pt-machine.h	2000/07/22 02:24:23	1.7
+++ pt-machine.h	2001/05/10 00:06:21
@@ -1,6 +1,6 @@
 /* Machine-dependent pthreads configuration and inline functions.
    powerpc version.
-   Copyright (C) 1996, 1997, 1998, 2000 Free Software Foundation, Inc.
+   Copyright (C) 1996, 1997, 1998, 2000, 2001 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -38,20 +38,14 @@ register char * stack_pointer __asm__ ("
 /* Compare-and-swap for semaphores. */
 /* note that test-and-set(x) is the same as !compare-and-swap(x, 0, 1) */
 
-#define HAS_COMPARE_AND_SWAP
+#define HAS_COMPARE_AND_SWAP_WITH_RELEASE_SEMANTICS
 #define IMPLEMENT_TAS_WITH_CAS
 
-#if BROKEN_PPC_ASM_CR0
-static
-#else
-PT_EI
-#endif
-int
+PT_EI int
 __compare_and_swap (long int *p, long int oldval, long int newval)
 {
   int ret;
 
-  MEMORY_BARRIER ();
   __asm__ __volatile__ (
 	   "0:    lwarx %0,0,%1 ;"
 	   "      xor. %0,%3,%0;"
@@ -62,6 +56,26 @@ __compare_and_swap (long int *p, long in
 	: "=&r"(ret)
 	: "r"(p), "r"(newval), "r"(oldval)
 	: "cr0", "memory");
+  __asm__ __volatile__ ("isync" : : : "memory");
+  return ret == 0;
+}
+
+PT_EI int
+__compare_and_swap_with_release_semantics (long int *p, 
+					   long int oldval, long int newval)
+{
+  int ret;
+
   MEMORY_BARRIER ();
+  __asm__ __volatile__ (
+	   "0:    lwarx %0,0,%1 ;"
+	   "      xor. %0,%3,%0;"
+	   "      bne 1f;"
+	   "      stwcx. %2,0,%1;"
+	   "      bne- 0b;"
+	   "1:    "
+	: "=&r"(ret)
+	: "r"(p), "r"(newval), "r"(oldval)
+	: "cr0", "memory");
   return ret == 0;
 }
============================================================


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]