This is the mail archive of the gdb@sources.redhat.com mailing list for the GDB project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

denser dwarf macro information



[Meant to post this to both the GCC and GDB lists, and get GDB folks'
comments too.  Please CC: gcc@gcc.gnu.org on replies.]

Could I have interested GCC folks' comments and suggestions on the
proposal below?  I've run it by the Dwarf committee, and incorporated
their suggestions.  The general reaction seemed positive.

Dan Berlin has said he'd be willing to work on the GCC side of a
prototype implementation.  I'd be happy to do the GDB side.  We'd do
it as a vendor extension at first, and once we'd gotten some
experience with it, we could get it incorporated into the Dwarf spec.



$Id: die-macros,v 1.8 2002/04/09 22:20:55 jimb Exp $

This is a proposal for a change to the way Dwarf represents
preprocessor macro information that:

- stores macro information more compactly than is possible with the
  representation currently in the Dwarf spec (for the two programs
  we've gathered data on, we expect to see reductions in the size of
  the macro data of roughly 5:1 and 3.6:1), and

- allows for compact vendor extensions (which is critical to allow
  vendors to experiment effectively with new representations for very
  common constructs).

This proposal has four parts:

- First, we show some data which suggest that substantial savings are
  possible over the existing representation.

- Next, we explain why the existing macro representation's mechanisms
  for vendor extensions are inadequate to support experimentation with
  denser mechanisms.

- Then we propose a new representation which we suspect does not share
  these shortcomings; this suspicion needs to be verified by experiments.

- Finally, we suggest some minor variations in the representation
  which trade consistency with the rest of the Dwarf spec for space.
  The experiments should show the impact of the variations, and help
  us choose the right balance.


The Existing Macro Information Representation Inhibits Helpful
Compression Strategies
==============================================================

To evaluate the space costs of including macro debugging information
in an executable, we compiled two programs, the GNU Debugger (GDB),
and the Rsync remote file copying program, requesting that Dwarf macro
information be generated and included in the final executable.

                                                   GDB            Rsync
size of source code:                                   1.3Mloc         22kloc
total size of .debug_macinfo section:           14 424 605b     2 300 292b
number of DW_MACINFO_define and                    351 260         85 936
DW_MACINFO_undef entries:
total size of those entries' string operands:   13 311 062b     2 061 973b
total size of string operands, after               247 138b        55 124b
removing duplicate entries:
% duplication in string data:                           98%            97%

Clearly, the .debug_macinfo section wastes a great deal of space
storing many copies of the same strings.  This duplication arises
because header files contain many macro definitions, and are often
included into more than one compilation unit.  Thus, each compilation
unit contributes a separate copy of its header files' macro
information to the final executable file.

Some linkers (including the GNU linker) can remove duplicate strings
from an executable's .debug_str section, and adjust DW_FORM_strp
references to point to the remaining unique copies.  If we could
replace the string operands of DW_MACINFO_define and DW_MACINFO_undef
entries with offsets into into the .debug_str section, moving the
actual definitions and names there, and then allow the linker to
remove duplications, how much space would we save?

                                                   GDB            Rsync
size of data in .debug_macinfo other             1 113 543b       238 319b
than strings (total size - string size):
size of one four-byte .debug_str offset          1 405 040b       343 744b
per define/undef entry (4 * # of entries):
total size of string operands, after               247 138b        55 124b
removing duplicate entries:
total of the above:                              2 765 721b       637 187b
overall compression:                                 ~ 5:1        ~ 3.6:1

The compression ratio is significant for both programs, but obviously
varies substantially from one program to the next.  We suspect that
larger programs will have more duplication in their .debug_macinfo
sections, simply because adding more compilation units to a program
brings in more copies of common headers' macros.


Why We Can't Use The Existing Macro Vendor Extension Mechanism
To Try This Out
==============================================================

We would like to implement the compression technique proposed above in
the framework of the existing macro info representation, but certain
weaknesses of the macro info's vendor extension mechanism make this
impossible.

A DW_MACINFO_vendor_ext entry cannot easily hold a reference into
another section, like the .debug_str section.  A DW_MACINFO_vendor_ext
entry takes two operands: a uleb128 value, and a null-terminated
string.  The offset into the .debug_str section will need to be
relocated, since the linker rearranges that section's contents.  But
most linkers cannot relocated uleb128 values, so the offset cannot be
the entry's first operand.  And since the offset may contain bytes
whose value is zero, it cannot be stored in the string operand.

Furthermore, the DW_MACINFO_vendor_ext's string operand should begin
with some unique string a consumer can use to distinguish one vendor's
extensions from another's.  However, entries representing definitions
and undefinitions are very common, so even a small identifying string
per entry would consume large amounts of space.  If GCC were to
represent definitions and undefinitions using DW_MACINFO_vendor_ext
entries whose string operands began with the three characters "GNU",
those identifying bytes would occupy 1.0Mb in GDB, or 257kb in rsync,
increasing the size of the compressed sections proposed above by 38%
and 40% respectively.  It seems wasteful to spend this much space on
bytes which carry no real information.

In general, we think this shows up a general weakness in the Dwarf
macro info representation's vendor extension mechanism: it is
generally unsuitable for entries which occur very frequently.


A Denser Representation
=======================

Instead of the special-purpose encoding for macro information
described in the extant "Macro Information" section, this proposal
uses DIEs to represent macro information.  This confers the following
advantages over the special-purpose encoding:

- DIEs are more flexible: their characteristics are described as a
  list of attribute/form/value triplets, not as a fixed series of
  values with unchangeable representations.

- DIEs are more compact: using the DW_FORM_strp form for #define and
  #undef entries and then removing duplicated strings from the
  .debug_str section allows significant space savings over the
  DW_MACINFO_define and DW_MACINFO_undef entry forms, which store
  macro names and definitions in-line.

- DIEs are more extensible: vendors can introduce custom tags and
  attributes that are just as dense as the standard data, but a
  consumer can still skip a DIE if it doesn't recognize its tag, or
  skip an attribute if it doesn't recognize its name.  The
  DW_MACINFO_vendor_ext form is usually substantially larger than the
  other macro info entries, because the string must begin with some
  distinctive text, to avoid confusion with other vendor's extensions.

- DIEs are treelike: this allows a slightly more natural
  representation of the #inclusion tree than paired
  DW_MACINFO_start_file and DW_MACINFO_end_file entries.

- DIEs are unoriginal: reading them requires mechanisms a Dwarf reader
  already has, not special-purpose code.

Editorial comments and questions are in [[double brackets]].

[[My hard copy of the Dwarf draft is out of date, so I've included
section titles with the numbers.]]

In section 3.1, "Compilation Unit Entries", add a sentence to the
end of the first paragraph for DW_AT_macro_info:

  (This attribute is deprecated in favor of DW_TAG_macro_info DIEs,
  which provide the same information in a more flexible, more compact
  representation; see section N.M, "DIE-based Macro Information".)

Add a new section to the end of chapter 3, titled "DIE-based Macro
Information":

  (The representation for macro information described here supercedes
  the representation described in section 6.3, "Macro Information".
  It provides the same content in a more flexible, compact
  representation.)

  <rationale italics>
  Some languages, such as C and C++, provide a way to define and
  invoke macros which expand into strings of tokens.  Macro expansion
  logically takes place before parsing the program into functions,
  statements, expressions, and the like; most Dwarf information
  describes the latter constructs, and thus reflects the program after
  macro expansion, rather than as the programmer wrote it.

  Since there is no necessary relationship between the boundaries of
  macro invocations and those of syntactic constructs in the expanded
  program, it is difficult to fit macros into the tree of DIEs
  representing those syntactic constructs.  For this reason, Dwarf
  describes the program's macros in their own, independent tree of
  DIEs.  This provides enough information to allow a debugger to find
  the macro definitions in scope at any given point in the source
  code.
  </rationale italics>

  A compilation unit may have one child DIE with the tag
  DW_TAG_macro_info, describing the preprocessor macros in scope at
  each point in the compilation unit's source files.  The macro
  information is represented as a tree of DIEs, reflecting the tree of
  preprocessor file inclusion directives (`#include' or `#import', for
  example) responsible for assembling the compilation unit's complete
  source code.

  <subsection>
  <title>Macro File Entries</title>

  Dwarf represents each source file contributing to a compilation
  unit, whether it is the ``base'' source file submitted to the
  compiler, or a file included in the compilation by an `#include'
  directive, as a DIE with the tag DW_TAG_macro_file.  The DIE for the
  base source file is a direct child of the DW_TAG_macro_info DIE.
  The DIE for an included source file is a child of the DIE
  representing the including file.

  A DW_TAG_macro_file DIE may have a DW_AT_file_number attribute
  giving the name of the source file it represents.  The attribute's
  value, a constant, is a file number in the statement information
  table for the compilation unit of which the DW_TAG_macro_info DIE is
  a child.  If the source file is not a named file on disk (source
  code submitted via the standard input stream on a UNIX system, for
  example), the attribute's value should select a filename consisting
  of the empty string.

  If a DW_TAG_macro_file DIE is not a direct child of the
  DW_TAG_macro_info DIE, it may have a DW_AT_include_line attribute, a
  constant, whose value is the number of the source file line in the
  including file that contains the inclusion directive responsible for
  bringing this file into the compilation unit.  (Like the line
  numbers in the Dwarf line number information, the first line of the
  file is numbered 1.)

  If a given source file is included into a compilation unit more than
  once, there will be a separate DW_TAG_macro_file DIE for each
  inclusion.  A consumer can distinguish the various inclusions of a
  given file by their parent DIEs and their DW_AT_include_line
  attributes (if present).

  The children of a DW_TAG_macro_file die --- macro definition and
  undefinition DIEs --- must appear in order of increasing
  DW_AT_include_line and DW_AT_decl_line attribute values.  That is,
  they must appear in the tree in the same order that they appear in
  the original program.

  <subsection>
  <title>Macro Definition Entries</title>

  Dwarf represents each macro definition with a DIE with either the
  tag DW_TAG_object_macro or the tag DW_TAG_function_macro.

  A DIE representing a macro definition in a source file is a child of
  the DIE representing that source file.  Such a DIE may have a
  DW_AT_decl_line attribute, giving the number of the source line on
  which the definition starts.

  DIEs representing macros that are pre-defined by the compiler (to
  identify the compiler, operating system or processor architecture,
  for example), appear directly as children of the DW_TAG_macro_info
  DIE.  DIEs representing macros defined by the user outside any
  source file (on the command line, for example) also appear directly
  as children of the DW_TAG_macro_info DIE.  Such definitions and
  undefinitions appear in the order in which they take effect:
  superceding directives follow those they supercede.

  A definition's DIE may have a DW_AT_name attribute, a string, which
  gives the macro's name, and a DW_AT_replacement_text attribute, also
  a string, which gives the text an invocation of the macro should be
  replaced by.

  If the macro is an object-like macro, its DIE has the tag
  DW_TAG_object_macro.

  If the macro is a function-like macro, its DIE has the tag
  DW_TAG_function_macro.  The DIE may have zero or more child DIEs,
  each with the tag DW_TAG_formal_parameter, describing the formal
  parameters from left to right.  Each parameter DIE may have a
  DW_AT_name attribute, a string, giving the parameter's name.  If the
  macro takes a variable number of arguments, its DIE has a child DIE
  with the tag DW_TAG_unspecified_parameters, following the parameter
  DIEs, if any.  This DIE has no attributes or children.

  <subsection>
  <title>Macro Undefinition Entries</title>

  Dwarf represents the removal of a macro definition (for example, via
  an `#undef' preprocessor directive) with a DIE with the tag
  DW_TAG_macro_undef.  This DIE is a child of the DIE representing the
  file in which the undefinition occurs.  It may have a
  DW_AT_decl_line attribute, giving the number of the source line on
  which the undefinition occurs.

  An undefinition's DIE may have a DW_AT_name attribute, a string,
  giving the name of the macro whose definition's scope ends at the
  given line number.


Add a new paragraph to the top of section 6.3, "Macro Information":

  (The representation described in this section is superceded by the
  one described in section N.M, "DIE-based Macro Information", which
  provides the same information in a more flexible, more compact
  representation.)

Add a new paragraph to section 7.5.4, "Attribute Encodings", after the
first paragraph:

  Some linkers are able to remove duplicate strings from the
  .debug_str section.  Thus, using DW_FORM_strp to represent strings
  which may occur many times in the final executable can provide a
  substantial space savings.  For example, header files often contain
  many macro definitions, and are included into many different
  compilation units.  This means that the same strings often appear
  many times as macro names or replacement lists in the linked
  executable's Dwarf information.  In this case, using DW_FORM_strp
  and allowing the linker to remove duplicates from the .debug_str
  section ensures that the text of each header file's definitions
  appears only once in the final executable's debugging information.


Areas For Experimentation
========================= 

In the text above, we break the macro's name, parameters, and
replacement list out into separate attributes or DIEs.  This is more
consistent with the way the rest of the .debug_info section works.
However, doing so may make DW_FORM_strp duplicate compression less
effective: rather than replacing the macro's name, parameters, and
replacement list all with a single 4-byte (or 8-byte) offset, we'd
emit a .debug_str offset for each of those components.  For strings
shorter than an offset (say, one-character macro parameter names), the
Dwarf producer would select DW_FORM_string instead, but emitting any
text at all is still a waste.

The alternative would be to give the DW_TAG_{function,object}_macro
tags a single DW_AT_definition attribute, with the same syntax a
DW_MACINFO_define entry's string operand has now.  This might give
better compression ratios.

A third possibility would be to use three attributes: DW_AT_name (the
name of the macro), DW_AT_parameters (a comma-separated list of
parameter names), and DW_AT_definition (the replacement string).  This
may (or may not) result in better compression as it would allow the
commoning up of common parameter entries.

It's probably worthwhile experimenting with these alternatives, to see
how much difference the choice makes.  If it's not a significant cost,
then the form in the proposal above seems most consistent with the
rest of the Dwarf specification.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]