This is the mail archive of the guile@cygnus.com mailing list for the Guile project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: SIGSEGV in scm_gc_mark ()

To: Jan Nieuwenhuizen <janneke@gnu.org>
Subject: Re: SIGSEGV in scm_gc_mark ()
From: Jim Blandy <jimb@red-bean.com>
Date: 19 Jul 1999 02:38:07 -0500
Cc: guile@gnu.org, "ir. Wendy" <hanwen@cs.uu.nl>
References: <199906280814.KAA02760@appel.flower>


> We are struggeling with a weird bug in LilyPond/Guile that seems
> to occur only on (Linux)-powerpc.  It seems that a cell goes corrupt,
> which is then found during a mark/sweep pass.  The same bug has been
> present in the last fifteen patch levels of LilyPond.  In one particular 
> version (pl 39), I'm getting the 'rogue pointer in heap' error.
> There were quite some changes to Lily's code, both C++ and embedded Scheme.
> 
> One other oddity; I've never encountered flawed output or other signs
> of corruption.  By default, all LilyPond output is generated through the 
> evaluation of scheme objects; one by one, during runtime.  But, Lily 
> has the option not to evaluate these objects, and write them to a 
> script instead.  When choosing this alternative output option, the
> resulting scripts run fine, and produce the correct output.
> 
> Any ideas of how to tackle this bug?

I've forwarded this reply to guile@cygnus.com; there are lots of
people there with experience tracking down these sorts of problems.

One approach would be to use the garbage collector as a heap
validator, and do an n-ary search to find out exactly when the cell is
corrupted.  Change scm_igc so that, after doing a garbage collection,
it truncates the free list to some small number of cells, controlled
by a global variable, say scm_debug_alloc_count.  Then you will get a
garbage collection after every `scm_debug_alloc_count' allocations.

Run the program under GDB, set scm_debug_alloc_count to 1000 or so,
and set a breakpoint with an `ignore' count (with the `ignore'
command) of a million or so on scm_igc.  When scm_igc crashes, use
`info break' to check the remaining ignore count; see how many times
the function has been called.

Start the program again, and set the ignore count to run up to the
last successful call to scm_igc.  Now set scm_debug_alloc_count to a
smaller value, so GC's will happen more often, and see how many
further calls to scm_igc succeed.  Repeat the process with smaller and
smaller values of scm_debug_alloc_count, until you know the two calls
to SCM_NEWCELL between which the corruption occurs.  Then start
looking at your code.


In general, it would be nice to automate this whole process by having
the GC say, when it notices an error, "the heap was corrupted sometime
between the NNNNth and MMMMth cell allocation," and then further have
an environment variable that forces a GC after a certain number of
allocations, by counting the free list.  Then it would be pretty easy
to do these binary searches for heap corruption.  That would be a nice
patch to have, if someone wanted to write it.

Follow-Ups:
- Re: SIGSEGV in scm_gc_mark () [RESOLVED by upgrading]
  - From: Jan Nieuwenhuizen <janneke@gnu.org>

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]