This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Questions regarding address relaxation on IA-64

From: Jim Wilson <wilson at specifix dot com>
To: Alexander Monakov <monoid at ispras dot ru>
Cc: binutils at sourceware dot org
Date: Thu, 22 Mar 2007 13:31:39 -0700
Subject: Re: Questions regarding address relaxation on IA-64
References: <op.tplo50buqmwd4b@localhost>

On Thu, 2007-03-22 at 20:21 +0300, Alexander Monakov wrote:
> It seems that on IA64 addresses of global variables are loaded with two  
> instructions: "addl rXX = r1, <offset>" and "ld8 rXX = [rXX]", with the  
> latter being later changed to "nop" by the linker. This causes the  
> following questions:

These are the R_IA_64_LTOFF22X and R_IA_64_LDXMOV relocations, used for
link time rewriting of the code to optimize global variable reference.

You may want to reference the Itanium Processor-specific Application
Binary Interface (psABI) and the Itanium Software Conventions and
Runtime Architecture Guide (SCRA).  Both are available from the
developer.intel.com web site, along with other places.  The first one
talks about relocations, and the second one talks about coding
conventions.  In this case, this is the code sequence emitted to access
external data, e.g. a global variable.

>   * Is the purpose of the "ld8" instruction to load the correct offset if  
> it does not fit into "addl" immediate operand?

In pic code, a global variable is accessed indirectly via the GOT.  So
the addl instruction computes the address of the GOT entry that holds
the address of the global variable, and then the ld8 loads that address
into a register.

As an optimization, at link time, if we detect that the global variable
address is within range of the GP register, then we can compute the
address directly with the addl instruction, and the ld8 is no longer
needed.  This results in faster code by eliminating a load.  Because
deleting an instruction is hard, we replace the load with a nop.  This
optimization could potentially be performed on any target, but as far as
I know only the IA-64 target does this.

>   * Is it possible to use "movl rXX = <offset>" (move long immediate, in  
> MLX bundle) + "addl rXX = r1, rXX" for the same purpose?

That would no longer be position independent (PIC), and hence would
violate the Itanium ABI which requires all code to be PIC.  It would
also not function in a shared library, which can only work when code is
PIC.

>   * Is it possible to tell compiler and linker that offsets will be small  
> enough so that only "addl rXX = r1, <offset>" will be needed (and if it is  
> not possbile, why)?

The offset between the GP value and the global variable address won't be
known until link time, as we don't know which object file will define
the global variable, and we don't know whether it will be linked in
early or late on the command line.  The position off the defining object
file on the link command may affect whether it is close enough to the GP
value.  We also don't know whether it will even be linked in, it might
be in a shared library for instance.  We also won't know the size of the
got and other sections put in the same segment as the got until link
time.  There are probably also other factors I can't think of
immediately.

Since we have to decide at compile time whether to emit the addl/ld8, we
have no choice but to emit both insns, and let the linker optimize.

Note that is a variable is defined static in the same module that we are
compiling, then we can know a little about where the variable will end
up.  We can and do emit a different code sequence in this case.  See the
discussion of "own" data in the SCRA.

> I have noticed that with -mno-pic GCC generates "movl rXX = <address>"  
> (MLX bundle). This causes a couple of questions, too:

-mno-pic violates the Itanium ABI, and can't be used for application
code.  This exists only for use by the kernel and some low level
drivers.  Or maybe it is EFI (the bios) that uses it.  I don't remember
exactly.

>    * Is it possible to use "mov rXX = <offset-or-address>" (short immediate  
> form) + "ld8 rXX = [rXX]", with ld8 being changed to "nop" by linker if  
> necessary?

Introducing a load instruction will make code slower which is
undesirable in general.  Beyond that, there is the problem that this
works only if you can put the global variable address someplace
convenient to load it from.  If you have non-pic code, then you don't
have a GP reg or got, which means there isn't anyplace convenient to
store the address.

>    * Why is mov+ld8 preferred in PIC code, and movl - in non-PIC code?

It is addl+ld8 in PIC code not mov+ld8.  This is a fundamental property
of PIC code.  PIC means position-independent code.  PIC code can be
loaded anyplace in memory without requiring additional relocations (an
over simplification but details aren't important now).  PIC code works
by having one special value, the GP reg, which gets initialized at load
time.  The GP points at the GOT (global offset table), and the GOT
contains the address of every global variable used in the code.  We can
now access any global variable within any code changes no matter where
the code is loaded in memory by using the addl+ld8 sequence.

In non-PIC code, we don't need the got.  We just load the address of the
variable directly via a movl instruction.

Please see the psABI and SCRA.

> On Itanium2, 8-byte loads can issue from memory ports 0 and 1 only, so our  
> scheduler places stop bits after each pair of ld8s to avoid stalls due to  
> resource oversubscription.

Does emitting these extra stop bits gain us anything?  Either way, the
hardware is going to stall, whether it figures out on its own, or
whether we tell it to stall.  If there is no penalty for letting the
hardware stall on its own, then maybe we should.

> What can you suggest to solve this problem? Maybe linker should be taught  
> to delete stop bit following a bundle, if it relaxed the bundle so that it  
> consists of nops only, and there is a stop bit preceding this bundle?

Sounds reasonable.  We still have the resource over subscription
problem, as there are only so many nops we can execute before the
hardware stalls, so we have to be careful about how many stop bits we
delete.  The linker currently doesn't know anything about resource
constraints or templates, as it doesn't have to, so it would be
complicated to do anything clever there.  But this is only if we want to
avoid resource stalls.  If we don't care about resource stalls due to
too many nops, we could just delete now unnecessary stop bits and not
worry about it.

If we want to get more involved, we could try to delete entire bundles
that end up as nops.  The reason why the IA-64 linker relaxation doesn't
try to delete instructions is because we have to worry about keeping the
bundles correct.  It is too much of a hassle to try to reorganize
bundles at link time.  But if linker relaxation gives us an entire
bundle of nops, then there would be no problem with deleting an entire
bundle, and that would avoid resource stalls with issuing useless nops.
It would take a bit of work to write the code though.  The IA-64 linker
relaxation stuff is already a bit complicated.

See elfNN_ia64_relax_ldxmov in binutils src/bfd/elfNN-ia64.c.
-- 
Jim Wilson, GNU Tools Support, http://www.specifix.com

References:
- Questions regarding address relaxation on IA-64
  - From: Alexander Monakov

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]