This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
RE: Conservative GC and glibc data structures

From: Roland McGrath <roland at redhat dot com>
To: "Boehm, Hans" <hans dot boehm at hp dot com>
Cc: <libc-alpha at sourceware dot org>, "Filip Pizlo" <pizlo at mac dot com>
Date: Fri, 24 Feb 2006 03:17:14 -0800 (PST)
Subject: RE: Conservative GC and glibc data structures
> There are clearly three kinds of regions of interest here:
> 
> A: Tls "roots" allocated either at the base of thread stacks other than
> the main one.

I assume "either" should be removed here, and then I know what you mean.

> Based on what you say, it sounds like all type A regions can currently
> be found with the aid of pthread_getattr_np()?  

Correct.  You need to know the direction of stack growth for the machine
and use pthread_attr_getguardsize along with pthread_attr_getstack to
calculate the bounds of the accessible stack for that thread.  (You can
also use pthread_getattr_np to get the bounds of the initial thread's
stack.  In the case of the initial thread, this does not include the thread
descriptor where TLS roots lie.)

> Currently we intercept thread creation and just use the stack pointer at
> that point.  We can possibly also get the actual stack base by expanding
> to the next unmapped page, but that's ugly, and has some debugging
> issues.

Indeed.  The actual stack base (as from pthread_getattr_np) is what you
need if you want to cover the contents of the thread descriptor, which is
the root holding TLS data and pthread_getspecific data (as well as some
libc-internal data that doesn't contain user pointers).

> B: Anything allocated before main starts.  This includes tls for the
> main stack.  (I gather that some of these are allocated through a
> malloc() defined in the dynamic loader that can't be intercepted, but is
> not used after main starts?)

I would not divide it quite this way.  There are a few interesting separate
stages of startup in this regard.  

First, the dynamic linker does initial loading of shared libraries.  In
this stage, it uses a trivial internal malloc that allocates space that is
never freed (it's used 99.44% for data that never dies).  That allocator
uses the remainder of the last page of ld.so's own .bss segment
(i.e. outside what you think from the phdrs, but you can discern where it
is from the phdrs and the page size), and then allocates additional
individual pages with mmap as needed; it does not track its allocations in
any way, so there is no particular record anywhere of which pages it mmap'd
(and they will be located randomly by the kernel).  This is how the initial
thread's thread descriptor is allocated.  The only other allocations in
this stage are for the dynamic linker's internal data structures, such as
the struct link_map for each shared library loaded via DT_NEEDED to satisfy
the main executable's dependencies (among other things).  These data
structures will be pointed to by roots in the dynamic linker's own
.data/.bss, but some may be linked lists pointing to elements in mmap'd
pages you don't know are valid pointers, and the elements in those pages
might point to elements allocated later (via dlopen et al) using the normal
malloc path.

After that initial stage, the dynamic linker and everything internal to
libc is using the exported malloc and friends for all its allocations.  If
you've linked in a replacement malloc (and memalign etc.) then it's using
those.  In the normal case, it's using the normal libc malloc and getting
the same behavior from it that an application calling malloc gets.

Next, shared library initializers run.  libc's own initializer runs first.
This is where C++ constructors for C++ shared libraries run.
Then the executable's own C++ constructors run.
Then main starts.

> C: Further regions referenced by A and allocated through malloc().

The allocations in dynamic linker startup are always special.  Everything
else is just malloc, and is the same as malloc will be later on.  If you
are intercepting malloc at link time, then you've intercepted it.  If not,
you haven't and it is like other malloc calls you weren't told about.  Any
malloc in initializers is no different from the malloc in libc when you
call fopen.  That internal data structure can point to user-allocated
memory if you call setvbuf et al.  Similarly, a user pointer passed to
on_exit may be stored in memory malloc'd by libc.  There are other
examples.  All such pointers will either be in libc.so's .data/.bss (or
some library's) or will be reachable from there via malloc'd memory (except
per-thread storage of the various sorts).

> That's known to be inadequate for model 2, at least in the multithreaded
> case.  I believe some dynamic loader data structures get collected.

If e.g. dlopen is used, then memory you don't know about will contain the
only pointers to normally-malloc'd memory for the dynamic linker's data
structures concerning the dlopen'd libraries.

> My impression is that type C regions are used to hold __thread tls data
> that's introduced by dlopen'ed libraries, and pthread_getspecific data
> if a top-level buffer overflows?  Thus we could at least guarantee to
> scan __thread data introduced by the main executable even if we miss
> those?  That would be better than nothing, but not ideal.

Maybe.  The TLS ABIs specify several models of TLS access, and it's a
compile/link-time choice what model to use.  For the model of access
normally used by an executable itself (i.e. what the compiler normally
produces without -fPIC), the nature of the ABI dictates that TLS data be in
the thread descriptor (thread stack or initial thread's special descriptor,
as detailed above).  When libc uses TLS data of its own, it's also compiled
using a model like this, and it is the recommended model for libraries that
expect to be linked directly and don't need to be usable via dlopen
(because it performs better).  The "dynamic" TLS access models put no
constraints on where the TLS data might be stored.  Everything else may be
malloc'd (sometimes implicitly at the time of access), like other internal
data structures are when you call into libc or libdl.  In practice, glibc
uses space in the thread descriptor for everything it knows about at
startup time (i.e. the the main executable and dependencies), and maybe
more (it leaves a little slacks so that dlopen'd libraries with small
enough TLS segments can win).  That is the expected optimal way to go about
it, but all you are really guaranteed is that TLS segments of objects with
DF_STATIC_TLS are part of the thread descriptor.

Similarly, pthread_getspecific data is either in the thread descriptor or
malloc'd and pointed to by the thread descriptor.  In practice, it malloc's
chunks of 32 entries at a time and the first chunk is directly in the
thread descriptor rather than malloc'd.  But that is internal happenstance
subject to change without notice, and they could all be malloc'd.  

> 2) Reliable symbol interception (see below, though this may be the wrong
> mailing list for that, and I may have misunderstood the issue),

Symbol interception is a bigger subject and a complex one.  I would not
like this discussion of finding heap to get bogged down in those issues.
(We are the right people to talk to about that, we just hate thinking about
it.)  I'll talk about that under separate cover.

> 1) A way to identify the main thread tls region in B,
>
> 3) A way to iterate over the malloc'ed tls regions C.

I want to continue to belabor exactly what this is or isn't getting you.
As you can understand, we are loathe to add interfaces because we will be
stuck with them forever.  I'd hate add something that covers part of the
need, but has to be changed or augmented later.  Once we add anything,
we'll keep it at least for binary compatibility even if we later remove it
from the usable API or change it.  If present demands are modest, I want to
consider the full set of future needs we can now discern.

Just to be doubly sure we are clear, when I say TLS I am talking
specifically about ELF TLS (where __thread variables go).  There are some
other kinds of per-thread data.  There are void * values stored with
pthread_setspecific.  When a thread exits (calls pthread_exit or returns
from its function) its return value is a user-supplied void * left for
pthread_join to collect; that might very well be a user-allocated pointer
that exists nowhere else until someone completes a call to pthread_join.
There are some libc-internal data structures (the sorts of things libc
would use __thread for, but some are stored specially instead of using ELF
TLS space); these that exist now do not contain any user-supplied pointers,
though they may contain normally-malloc'd pointers.  For all these things,
the thread descriptor either contains them, or points to normally-malloc'd
storage that does.

It sounds like you are trying to cover the case where you are not
intercepting malloc (model 1), and all your users' roots are found in data
segments (reported by dl_iterate_phdr) or in TLS segments (__thread data).
That is perhaps a reasonable compromise to make with your users.  If they
pass GC-allocated pointers to be stored by libc (on_exit, setvbuf, etc.),
or use pthread_setspecific, then they must also keep copies of those
pointers reachable from data/TLS roots.  An interface just to get TLS
information is attractive in that it's already coherent, public information
that fits in naturally with existing interfaces.

>From the TLS point of view, A, B, C, initial thread or other thread, it's
all the same.  The main executable and each shared library, whether loaded
at startup or with dlopen (and the dynamic linker itself) are all ELF
objects.  Each one either has a PT_TLS segment or it doesn't.  You see them
with dl_iterate_phdr.  For each particular thread, each individual TLS
segment at a given moment is allocated or it's not.  It would be simple for
dl_iterate_phdr and/or dlinfo to tell you the address of the object's TLS
block in the current thread, or that it hasn't allocated it, or that the
object has no PT_TLS.  dl_iterate_phdr is somewhat heavy-weight and takes a
lock so that all dl_iterate_phdr calls are serialized.  dlinfo is cheaper
and nonserializing to call in each thread, but has to be called explicitly
with each object handle; so you'd have one thread call dl_iterate_phdr and
collect a list of objects with TLS segments, and then set off each thread
to call dlinfo on each object in the list.  If it is substantially better
for you to make calls from only one thread to get information about the
other threads, an interface like that may need some more thought.

In a certain way, this is a fundamentally different approach.  The neat
thing about conservative GC, what made me decide all those years ago that
you were a very cool guy, is that it's conservative.  You don't have to
think about it--if it might be a root, it's a root!  Back in the days of
innocence, data segment and stack got everything that was.  No special
contracts about where I have to store it to make it a root, just any place
I store things will be a root because you know to look at every basic form
of storage I've got (all two of them).  Leaving aside my making trouble
with the dynamic linker's magic mindless malloc, the modern world of
threads just adds per-thread root chunk thingie (the per-thread version of
data segment, aka thread descriptor), and per-thread stack.  If you get
those, then the hard-core conservative approach is again available.

When not intercepting malloc, not knowing the bounds of the thread
descriptor (for the initial thread) means you can't see pointers passed to
pthread_setspecific and the like.  But, when not intercepting malloc, even
with the thread descriptor you wouldn't see pointers passed to
pthread_setspecific for the 33rd key anyway, nor pointers passed to the nth
on_exit call, or to setvbuf, or fopencookie or so on.  We could perhaps
provide something to enumerate existing pthread_key_t's, but we can't
enumerate user pointers stored by on_exit, setvbuf, et al.  You can use a
thread TLS segment enumerator, and say to users, roots are "your variables",
be they global, static, auto, or __thread, in your executable or in extra
shared libraries.

When intercepting malloc, just knowing the thread descriptors (along with
stacks, data segments, and all malloc calls) leads you to everything that
is.  So you don't need to think about TLS specifically.  You just always
reach it, along with pthread_setspecific and setvbuf and everything else.
The one exception is the dynamic linker data structures allocated in early
startup.  These are never roots leading to user-allocated pointers, only to
normally malloc'd blocks used for more dynamic linker data structures.

There is a different approach to the dynamic linker's allocations than
considering them known heap regions expected to be reachable from roots in
data segments.  Rather than looking to its prior allocations for roots, we
can look at the intercepted malloc allocations that it does.  The
interposed malloc et al functions can use __builtin_return_address (0) and
compare the caller's PC to the known text range of the dynamic linker.
(You can get that at startup with dl_iterate_phdr.)  Only code inside the
dynamic linker will allocate pointers to be stored in the dynamic linker
data structures that might be in heap regions you don't know about.  When
the call originates in the dynamic linker, record the pointer being
returned in a special list of roots.  The interposed free function does the
same check, and when called from the dynamic linker, rather than doing
nothing, it removes that pointer from the list of roots.  Or, just use an
entirely different noncollecting allocator for those calls (perhaps the
original libc malloc/free, if you can cooperate with its sbrk use).  In the
case of the dynamic linker, I think it would be safe to skip looking at
those blocks for other collected pointers (they will point only to things
allocated by more malloc calls from the dynamic linker itself).  In
principle you could do the same for any shared library that you've decided
uses free fastidiously, if you wanted to for some reason (in the general
case you would want those allocations to be scanned for collected
pointers).  (I can imagine a leak-debugging mode where you might do this
for a given library and then report pointers found nowhere but on the
special list of roots, citing the free calls the library failed to make.)
Perhaps all in all this is too much hassle and you'd really just like to
enumerate those special heap regions to treat them like more data segments.
But it's one way to address this problem, and it can be done with existing
interfaces.

There is nothing especially difficult or costly about keeping track of the
pages allocated in dynamic linker startup.  It just seems like an odd
internal detail to expose an interface for you to enumerate them.  Yet, if
you have that, plus data segments, and stacks (and malloc), then you get to
everything bar none and you don't even need to know about thread
descriptors as distinct from part of a thread stack block or part of the
enumerated early-allocated heap.  So I am torn.  One way would be just to
provide an interface to get the bounds of a thread's descriptor block.
That works cleanly for initial thread or other threads, and you needn't do
any calculations with stack and guard sizes to properly encompass the
unseen base your thread's stack block; you can just go from the stack
pointer you know in your thread start function.  It covers all manner of
per-thread data, but leaves you to your own devices (such as those above)
for the dynamic linker's roots for its blocks allocated with your malloc.
The other way would be to just provide a way to see all the dynamic
linker's heap pages, so you can treat them like data segments.  That means
you don't have to know about thread descriptors at all, there is just heap
and more heap and stack blocks, and you have to do the pthread_getattr_np
machinations to find the complete bounds of stack blocks.  That choice is
not entirely clear, and neither proposition is shiningly clean.

In summary: The model 1 use with "your variables are roots" contract is
clearly worthy independent of a plan for perfect model 2.  Ways to
enumerate TLS segment information fit naturally into existing interfaces
(dl_iterate_phdr, dlinfo), whose particulars I mentioned above.  Fancier
interfaces (info about non-self threads) are conceivable if you have a case
for that.


Thanks,
Roland
References:
- RE: Conservative GC and glibc data structures
  - From: Boehm, Hans
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]