This is the mail archive of the gsl-discuss@sourceware.org mailing list for the GSL project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [row/column majority] gslclapack

From: "Robert G. Brown" <rgb at phy dot duke dot edu>
To: Z F <mail4me9999 at yahoo dot com>
Cc: James Bergstra <james dot bergstra at umontreal dot ca>,gsl-discuss <gsl-discuss at sources dot redhat dot com>
Date: Sat, 1 Apr 2006 16:48:46 -0500 (EST)
Subject: Re: [row/column majority] gslclapack
References: <20060330210436.97924.qmail@web33804.mail.mud.yahoo.com>

On Thu, 30 Mar 2006, Z F wrote:

The difference is how a[i,j] are interpreted

a[i,j] => a[i*Nj+j]
a[i,j] => a[j*Ni+i]

one is called the row major, the other is the column major (sorry do
not know which is which), but
nothing changed. As long as the access is done via this indexing.

Can someone tell me why one is faster than the other? Or more
convenient?


Sorry, I was out of town for a few days.  In case nobody else answered,
"to make sure that they match in your algorithms".

In C they are referenced a[i][j]...[m][n], usually after being defined
and allocated via something like (e.g., and I'm not checking my code
carefully here so I may have an error or two):

  int i,j;
  double **a;
  a = (double **) malloc (imax*sizeof(*double));
  for(i=0;i<imax;i++)
    a[i] = (double *) malloc (jmax*sizeof(double));
    for(j=0;j<jmax;j++) a[i][j] = i*imax + j;
  }

To free a it is NOT ENOUGH to just free a -- one has to go through and
free all the blocks whose address is stored in a[i], THEN free a.
Otherwise you leak memory.  You iterate the process to do ***a or ****a
(three or four dimensional double precision tensors).  You could also
define or use a complex or quaternionic or whatever struct and malloc
the highest index time sizeof the struct -- the code is quite flexible
wrt data typing and objects.

This is common enough, but not necessarily very good code.  Two things
that one can do to make it "better" code are a) add or subtract an
offset so that the indices don't have to start at 0 and run to imax-1.
Fortran indices by default run from 1 to imax.  b) Allocate the entire
vector of memory as a block.  In the code above, if it STAYS on the CPU
you are LIKELY enough to get a contiguous block of memory from the
sequential mallocs, but you are not guaranteed to -- it is up to the
kernel.  If the system is multitasking and it happens to give another
job a timeslice in the middle of the loop of mallocs, it is somewhat
LIKELY to introduce an arbitrarily large offset in the very middle of
what you "think" is a single contiguous block of memory.  Needless to
say, this will make all sorts of matrix code unhappy every time your
loop addressing invisibly crosses the boundary.  Some mallocs/compilers
MIGHT recognize this at run time and work with the kernel to free up a
contiguous block, but I wouldn't count on it.

The offset part is trivial.  The contiguous block part is managed by
introducing one more pointer:

  int i,j;
  double **a;
  double *adata;
  adata = (double *) malloc(imax*jmax*sizeof(double));
  /*
   * adata is now a CONTIGUOUS block of memory long enough to hold the
   * entire matrix.  We now have to pack the ADDRESSES of the appropriate
   * offsets into **a.
   */
  a = (double **) malloc (imax*sizeof(*double));
  for(i=0;i<imax;i++)
    a[i] = &adata[jmax*i];
    for(j=0;j<jmax;j++) a[i][j] = i*imax + j;
  }

Where I'm being sloppy about casts (arg of malloc should be size_t, for
example IIRC).  For a 2x2 matrix this should result in:
  (simulated memory layout as address:(type) or address:contents)
adata = [0000:(double)][0008:(double)][0010:(double)][0018:(double)]
a = [0020:(double *)][0024:(double *]

Then

a = [0020:0000][0024:0010]

(a[i] now contains the addresses 0000 and 0010 in adata)

and finally

adata = [0000:(double)0][0008:(double)1][0010:(double)2][0018:(double)3]

where (note well) we ADDRESSED the vector via a[i][j]!  This saves one
time messing with offset arithmetic.  Note again that to free a without
leaks you have to both free a and free adata, although you can do lots
of other neato stuff with this idea -- maintain different "windows" into
adata, free a WITHOUT freeing adata.  Some of which is of course very
dangerous coding practice:-)

As to why one has to be careful -- in your actual code you want to be
sure to try to do operations like:

 for(i = 0;i<imax;i++){
   b[i] = 0.0;
   for(j = 0;j<jmax;j++){
     b[i] += a[i][j]*c[j];
   }
 }

and try NOT to do:

 for(i = 0;i<imax;i++){
   b[i] = 0.0;
   for(j = 0;j<jmax;j++){
     b[i] += a[j][i]*c[j];
   }
 }

It may seem silly, but you want this to match up between your C
assumptions in allocating and laying out **a and any fortran-derived
routines for doing linear algebra.  Consequently you want to make sure
that fortran lays out a(i,j) the same way that you've laid out a[i][j].
Obviously that means "at least" unit offsets and strictly rectangular
matrices, but fortran doesn't (IIRC) actually SPECIFY which one of these
two indices is the local/contiguous one.  However, I could be wrong as I
haven't used fortran -- by choice -- for close to twenty years...;-)
Also I think that fortran has even more restrictions or constraints when
trying to use fortran routines in C code -- most of the times I've tried
this I've given up before figuring it out for sure because it is a PITA.

IMO, the ability to do malloc/pointer magic like this is both one of C's
real strengths and one of its "numerical" weaknesses.  It is a strength
because you can do some very cool things and retain a lot of control
over just what the system is doing with memory and in just what order --
it is left as an exercise how to (for example) allocate a single vector
from which one can pack addresses into a triangular y[l][m] complex
array indexed by l \in 0,lmax, m \in -l,l (angular momentum indices with
no wasted space and fully automatic indexing (no actual computation of
offsets except during the original allocation/packing.  Or to pack up a
triangular array of pretty much arbitrary structs.  Also, the vector
adata can be passed quite freely to e.g. ODE solvers and you can still
use array forms (no offset arithmetic) in the derivative evaluators by
passing in the address of **a as one of the ODE parameters.
a[i][j][k][l][m][n] REALLY beats the hell out of trying to do offset
arithmetic with explicit e.g imax, ioffset, jmax, joffset, kmax,
koffset...nmax, noffset (and the indices) for a six dimensional array.
Compilers usually do a[][]... offset arithmetic "fast", as well, but
have to be told to do the offset computed the hard way fast.

It is a weakness because the compiler cannot KNOW how to optimize
a[i][j] based operations in terms of e.g. loop unrolling and using
registers the way it can in fortran.  Fortran's rigid structure and
memory allocation permits compiler optimizations that are usually
somewhat stronger than those that can be used by a c compiler.

Anyway, this is how I THINK it goes.  I'm always happy to be corrected
by a real expert.

rgb

Thanks

Lazar

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com


--
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]