This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] Allow linker scripts to specify multiple output regions for an output section?


On 22.02.17 15:28, Thomas Preudhomme wrote:
> There has been some interest in the past in having syntactic support for
> specifying mapping of an output section to multiple memory regions in the
> GNU LD scripting language (eg.
> https://sourceware.org/bugzilla/show_bug.cgi?id=14299). I would like to
> propose a scheme here and welcome any feedback.

TL;DR: Detailed response begins after 6 paragraphs.

OK, in the absence of prior discussion, I'll just think aloud as I
correlate the proposal with my experience in three decades developing
embedded systems. Unfortunately, the one time an MMU was involved, that
was done by the time I became involved, but memory holes are all black.

The closest scenario I recall is where there were disparate physical
memories, both on and off chip, I simply added a MEMORY region for each
such block, e.g. Flash, 16bit SRAM, 8bit SRAM, a couple of small ones
for specific memory mapped system chips with bunches of config
registers, and maybe an FPGA in the mix. Add comments for device names
and the waitstate generator values, and the script serves as central
documentation too.

With that one-to-one region mapping, there was never any conflict over
where stuff should be located, and non were interchangeable. It is as
described by "some on-chip memory and some off-chip memory, but at
non-contiguous addresses" in the above link. And where we had both 8 and
16 bit SRAMS, it was most definitely consistent with "a region of
on-chip SRAM which performs better for code, and the remainder performs
better for data", except that using the wrong one was fatal rather than
merely inferior.

One issue I've encountered is detecting region overflow when multiple
output sections contribute to its content, but existing syntax supports
that, e.g.:

MEMORY
{
  flash   (rx)  : ORIGIN = 0, LENGTH = 32K
  ram    (rw!x) : ORIGIN = 0x800060, LENGTH = 2K
  eeprom (rw!x) : ORIGIN = 0x810000, LENGTH = 1K
}

. = ASSERT (_etext + SIZEOF (.data) <= LENGTH(flash) , "Error: .text + .data
collectively overflow the flash memory." ) ;

But the need to flow across memory holes never eventuated in practice,
as a modest chunk of on-chip RAM could always be used for e.g. sdata,
leaving no need for flowing. All other regions were always incompatible,
making flowing impossible.

...
> If LMA is specified, the image(startup code etc.) most likely handles
> the copying from load address to output section VMA.

Yes, it does. And in the generic init code I've encountered, it has just
been a single copy loop for e.g. bss, performing a contiguous block copy.
(And when I've written it, that was true too.)

> Multiple segment spec means the output section can be part of more
> than one segment and ‘fillexp’ simply fills the output section loaded
> with the fill value.

Trans-hole flowing would also require a runtime copy loop for each
non-contiguous block, or a table-driven multi-block copier, with the
run-time table somehow initialised from the linker script. (I can
imagine using variables defined in the linker script, and the .RPT
assembler directive - maybe.)

> Now, this does not have a method to specify output section spanning multiple
> memory regions. For example, if there are 2 RAM regions RAML and RAMU and
> the user wants an output section to first fill RAML and then when RAML is
> full, i.e. when the remaining space in RAML cannot accommodate a full input
> section, start filling RAMU, the user has to split the sections into
> multiple output sections. If we extend this syntax to specify multiple
> output regions, we can make the linker map the output section to multiple
> regions by filling the output region with input sections in the order
> specified in the ‘output-section-command’ and when its full (meaning when
> the remaining gap in a region cannot accommodate one full input section, it
> starts from the next output region.

This seems to be the alternate view of the problem of asking ld to flow
code around holes in a region, something it still can't do, IIRC. I
state it that way, because two non-contiguous memory regions over which
code (or data) may be interchangeably flowed, are identical to a single
region with a hole.

The proposal does seem to be a way to think about addressing that issue:

> Eg.
> 
> MEMORY
> 
> {
>   RAML (rwx) : ORIGIN = 0x1FFF0000, LENGTH = 0x00010000
>   RAMU (rwx) : ORIGIN = 0x20000000, LENGTH = 0x00040000
>   RAMZ (rwx) : ORIGIN = 0x20040000, LENGTH = 0x00040000	
> }
> 
> SECTIONS
> {
>   .text 0x1000 : { *(.text) _etext = . ; }
>   .mdata  :
>   AT ( ADDR (.text) + SIZEOF (.text) )
>   { _data = . ; *(.data) *(.data.*); _edata = . ; } > RAML, RAMU, RAMZ
> }

Without the need for new syntax or complex init code generators,
having gcc flow code across up to 5 pages of flash plus .lowtext and a
floating .hightext was compatible with the linker script and tests shown
here:

http://lists.nongnu.org/archive/html/avr-gcc-list/2012-12/msg00044.html

While details have faded from wet RAM, ISTR that holes were
manufacturable by not populating any of the 5 pages, which gcc sees as
named spaces. The gcc stuff was done in the AVR back end, IIRC, while an
implementation in ld would be generic.

> Illustration:
> 
> Consider an example where we have the following input .data sections:
> 
> .data: size 0x0000FFF0
> .data.a : size 0x000000F0
> .data.b : size 0x00003000
> .data.c : size 0x00000200
> 
> With the above scheme, this will be mapped in the following way to RAML,RAMU
> and RAMZ:
> 
> RAML : (0x1FFF0000 - 0x1FFFFFF0): .data
>        (0x1FFFFFF0 - 0x1FFFFFFF): *** GAP ***

Would GAP use ALIGNMENT, or introduce a new parameter?

How would the target-specific relocations required to break code across
the hole be handled by ld? E.g. break a small AVR code loop (with 6-bit
relative addressing range) and you'll need a LJMP to bridge the hole,
and another with reversed loop conditionality to close the loop.
Multiply that task by all the possible relocs, and again by all the
possible CPU targets, and it's never-ending work for a software team for
life.

It seems more 

> RAMU : (0x20000000 - 0x200000F0): .data.a
>        (0x200000F0 - 0x200030F0): .data.b
>        (0x200030F0 - 0x200032F0): .data.c
> 
> 
> It will not affect the specification in terms of the other attributes, but
> one (LMA):
> 
> * Output section VMA: No change - this just specifies where the output
> section will start.
> 
> * type: No change - this is for the output section as a whole - output
> memory regions will not change it.
> 
> * LMA: The output section can still be loaded from one LMA and mapped to
> output VMA - the only change here is that the loader will need to map the
> output sections to VMA with the same pattern as the multiple output region
> matching code above. Can a loader do that? Can ad-hoc loaders do this? Or do
> all loaders assume that regions are continguous when output section is
> mapped to VMAs?

Contiguous. Hole-flowing is what you're proposing to implement, both the
linker internal component (target-specific reloctions), and the generic
(e.g. table driven) multi-block copy loop synthesiser for custom init
code generation. How that would integrate with existing init code in
various implementations, I have no idea.

If LMA can also be flowed around a hole, then runtime init code must be
able to handle not only non-contiguous delivery, but gapped pick-up. Has
the complexity of simultaneously handling different gaps in both been
considered?

...
> For orthogonality and consistency, we would want to apply the multiple
> region feature to overlays too. The semantics will not be different from the
> algorithm mentioned above. The only caveat is that the overlay
> manager/loader will need to handle the swapping in and out of sections that
> run from the VMA consistently with the mapping algo described above. Do we
> want this for overlays too?

Expanding the complexity of a single-problem solution to cover other
situations seems courageous, unless it naturally falls out of the
narrower solution. As overlays are used e.g. when RAM size or CPU
instruction addressing range is constrained, but there's ample flash,
then the likelihood of holes in either is limited, I suspect.

Specifying discrete output sections with VMAs placed around the physical
holes is another way to dodge them. They can all be allocated to a
global encompassing memory region. Flowing is performed manually by
assigning suitable code chunks to preferred input sections. Automating
that, as intimated above, is non-trivial.

Caveat: Above thoughts have flowed without aid of caffeine, and are
        recollections from old battles.

Erik


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]