This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Proposed patch] Huge performance regression in ld -r since binutils >= 2.21


Hi,

While migrating all the Amadeus C++ code base from binutils 2.20 to
binutils 2.25.1, we discovered a strong performance regression in partial
linking, starting from binutils 2.21. The commit that introduced the
regression is:

commit 0672748ac053cc4f8159af0d21ca88ae8b3778e6
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 30 18:27:32 2010 +0000

    Remove relocation against discarded sections for relocatable link.

Mailing list archive link:
https://sourceware.org/ml/binutils/2010-04/msg00479.html


The purpose of the patch was to completely remove debug relocation that
were referring to discarded section, to avoid filling the resulting file
with empty relocs. However, to do that, relocs are removed, one by one,
from a contiguous array, using memmove. When you are handling a debug
section having millions of relocs, and when many of them should be
discarded, this results in millions of memmove at the end of the
relocation array.

This is the scenario that happens in our case. We have some complex C++
applications and for some componentS we package them in .o files generated
with "ld -r". We use partial linking at several levels: we have module1.o
(which embeds all .o from module 1, with ld -r), module2.o, and from these
two we build module3.o (using ld -r of module1.o + module2.o + others
module3-specific .o files). Since all these modules reference the same
common base classes, where some methods are inlined, a lot of .text.method
are inlined by g++, but then discarded by ld because many .o files embed
the same sections. So the above memmove is called many times.

When we create the first level module1.o and module2.o, everything works
fine. ld -r last less than 10s. However, that creates huges .o files,
having huge debug sections with millions of relocs. Later when we build
module3.o from module1.o and module2.o, since the memmove happens on huge
relocs array, ld end up consuming 100% CPU for minutes. With binutils
2.25.1 one single link operation lasts more than 5 minutes, compared to
10s with ld 2.20. With gold we haven't any issue, whatever the version.

I cannot provide you our proprietary code, but I have created a
reproducer. The attached Makefile (just run with "make -j N") will
generate A0.o ... A150.o, link the 50 first .o into first.o, then 50
following ones in second.o, then link the 50 last ones + first.o +
second.o in all.o. The latest link step takes tens of minutes instead of
tens of seconds.

I have a proposal of patch to fix that, see the patch attached. Note that
I know it only works for x86/x64 and that it will break other targets, I
just want to make sure you agree on the idea of the fix. To fix this
performance issue, I choose to have two iterators:
 - next_rel: that points to the next relocation that we have handled and
validated
 - rel: that points to the current relocation that we are currently
studying for relocation.
When a relocation must be discarded, rather memmove'ing the whole array,
we just increment rel but not next_rel. At each end of loop, if next_rel
don't match rel (iterators are not synced anymore because one or more
reloc was discarded), then we copy the content of rel into the content of
next_rel. SO in worst case, we copy all the element of the array only
once.

This patch was tested without regression on x64. It generates
byte-for-byte identical binaries. Linking time is back to tens of seconds
again.

Here is the proposed patch. Please note that I don't really like the usage
of "goto" and the fact that we have to modify all "continue" statements.
If you have better suggestions please raise them, I'll adapt my patch.

Cheers,
Romain

>From 50ff8e7d9c8d3f979c69a65ea131b8d7182f082f Mon Sep 17 00:00:00 2001
From: Romain Geissler <romain.geissler@amadeus.com>
Date: Wed, 4 Nov 2015 16:44:51 +0100
Subject: [PATCH] Avoid massive memmove when skipping a debug relocation with
 ld -r

---
 bfd/elf-bfd.h      | 23 ++++++++++------------
 bfd/elf32-i386.c   | 56 ++++++++++++++++++++++++++++++++++++------------------
 bfd/elf64-x86-64.c | 52 +++++++++++++++++++++++++++++++++-----------------
 3 files changed, 82 insertions(+), 49 deletions(-)

diff --git a/bfd/elf-bfd.h b/bfd/elf-bfd.h
index 6a87d71..8fb8f16 100644
--- a/bfd/elf-bfd.h
+++ b/bfd/elf-bfd.h
@@ -2491,10 +2491,9 @@ extern asection _bfd_elf_large_com_section;
    link, we remove such relocations.  Otherwise, we just want the
    section contents zeroed and avoid any special processing.  */
 #define RELOC_AGAINST_DISCARDED_SECTION(info, input_bfd, input_section,	\
-					rel, count, relend,		\
+					rel, next_rel, relend,		\
 					howto, index, contents)		\
   {									\
-    int i_;								\
     _bfd_clear_contents (howto, input_bfd, input_section,		\
 			 contents + rel[index].r_offset);		\
 									\
@@ -2514,22 +2513,20 @@ extern asection _bfd_elf_large_com_section;
 	    rel_hdr = _bfd_elf_single_rel_hdr (input_section);		\
 	    rel_hdr->sh_size -= rel_hdr->sh_entsize;			\
 									\
-	    memmove (rel, rel + count,					\
-		     (relend - rel - count) * sizeof (*rel));		\
-									\
 	    input_section->reloc_count--;				\
-	    relend -= count;						\
-	    rel--;							\
+        next_rel--;							\
 	    continue;							\
 	  }								\
       }									\
 									\
-    for (i_ = 0; i_ < count; i_++)					\
-      {									\
-	rel[i_].r_info = 0;						\
-	rel[i_].r_addend = 0;						\
-      }									\
-    rel += count - 1;							\
+    rel->r_info = 0;						\
+    rel->r_addend = 0;						\
+									\
+    if (next_rel != rel)						\
+    {									\
+      *next_rel = *rel;						\
+    }									\
+									\
     continue;								\
   }

diff --git a/bfd/elf32-i386.c b/bfd/elf32-i386.c
index 69c0b54..ef5ee15 100644
--- a/bfd/elf32-i386.c
+++ b/bfd/elf32-i386.c
@@ -3151,6 +3151,7 @@ elf_i386_relocate_section (bfd *output_bfd,
   bfd_vma *local_got_offsets;
   bfd_vma *local_tlsdesc_gotents;
   Elf_Internal_Rela *rel;
+  Elf_Internal_Rela *next_rel;
   Elf_Internal_Rela *relend;
   bfd_boolean is_vxworks_tls;
   unsigned plt_entry_size;
@@ -3176,8 +3177,9 @@ elf_i386_relocate_section (bfd *output_bfd,
   plt_entry_size = GET_PLT_ENTRY_SIZE (output_bfd);

   rel = relocs;
+  next_rel = rel;
   relend = relocs + input_section->reloc_count;
-  for (; rel < relend; rel++)
+  for (; rel < relend; rel++, next_rel++)
     {
       unsigned int r_type;
       reloc_howto_type *howto;
@@ -3196,7 +3198,7 @@ elf_i386_relocate_section (bfd *output_bfd,
       r_type = ELF32_R_TYPE (rel->r_info);
       if (r_type == R_386_GNU_VTINHERIT
 	  || r_type == R_386_GNU_VTENTRY)
-	continue;
+        goto copy_rel_and_continue;

       if ((indx = r_type) >= R_386_standard
 	  && ((indx = r_type - R_386_ext_offset) - R_386_standard
@@ -3323,10 +3325,10 @@ elf_i386_relocate_section (bfd *output_bfd,

       if (sec != NULL && discarded_section (sec))
 	RELOC_AGAINST_DISCARDED_SECTION (info, input_bfd, input_section,
-					 rel, 1, relend, howto, 0, contents);
+					 rel, next_rel, relend, howto, 0, contents);

       if (info->relocatable)
-	continue;
+        goto copy_rel_and_continue;

       /* Since STT_GNU_IFUNC symbol must go through PLT, we handle
 	 it here if it is defined in a non-shared object.  */
@@ -3419,7 +3421,7 @@ elf_i386_relocate_section (bfd *output_bfd,
 		     we need to include the symbol value so that it
 		     becomes an addend for the dynamic reloc.  For an
 		     internal symbol, we have updated addend.  */
-		  continue;
+          goto copy_rel_and_continue;
 		}
 	      /* FALLTHROUGH */
 	    case R_386_PC32:
@@ -3759,7 +3761,7 @@ elf_i386_relocate_section (bfd *output_bfd,
 		 need to include the symbol value so that it becomes
 		 an addend for the dynamic reloc.  */
 	      if (! relocate)
-		continue;
+            goto copy_rel_and_continue;
 	    }
 	  break;

@@ -3833,8 +3835,7 @@ elf_i386_relocate_section (bfd *output_bfd,
 		  bfd_put_32 (output_bfd, elf_i386_tpoff (info, relocation),
 			      contents + roff);
 		  /* Skip R_386_PC32/R_386_PLT32.  */
-		  rel++;
-		  continue;
+          goto skip_next_rel_and_continue;
 		}
 	      else if (ELF32_R_TYPE (rel->r_info) == R_386_TLS_GOTDESC)
 		{
@@ -3859,7 +3860,7 @@ elf_i386_relocate_section (bfd *output_bfd,
 			     contents + roff - 1);
 		  bfd_put_32 (output_bfd, -elf_i386_tpoff (info, relocation),
 			      contents + roff);
-		  continue;
+          goto copy_rel_and_continue;
 		}
 	      else if (ELF32_R_TYPE (rel->r_info) == R_386_TLS_DESC_CALL)
 		{
@@ -3874,7 +3875,7 @@ elf_i386_relocate_section (bfd *output_bfd,
 		  roff = rel->r_offset;
 		  bfd_put_8 (output_bfd, 0x66, contents + roff);
 		  bfd_put_8 (output_bfd, 0x90, contents + roff + 1);
-		  continue;
+          goto copy_rel_and_continue;
 		}
 	      else if (ELF32_R_TYPE (rel->r_info) == R_386_TLS_IE)
 		{
@@ -3927,7 +3928,7 @@ elf_i386_relocate_section (bfd *output_bfd,
 		    }
 		  bfd_put_32 (output_bfd, -elf_i386_tpoff (info, relocation),
 			      contents + rel->r_offset);
-		  continue;
+          goto copy_rel_and_continue;
 		}
 	      else
 		{
@@ -3976,7 +3977,7 @@ elf_i386_relocate_section (bfd *output_bfd,
 		  else
 		    bfd_put_32 (output_bfd, elf_i386_tpoff (info, relocation),
 				contents + rel->r_offset);
-		  continue;
+          goto copy_rel_and_continue;
 		}
 	    }

@@ -4172,8 +4173,7 @@ elf_i386_relocate_section (bfd *output_bfd,
 			  - htab->elf.sgotplt->output_offset,
 			  contents + roff + 8);
 	      /* Skip R_386_PLT32.  */
-	      rel++;
-	      continue;
+          goto skip_next_rel_and_continue;
 	    }
 	  else if (ELF32_R_TYPE (rel->r_info) == R_386_TLS_GOTDESC)
 	    {
@@ -4212,7 +4212,7 @@ elf_i386_relocate_section (bfd *output_bfd,
 			  - htab->elf.sgotplt->output_section->vma
 			  - htab->elf.sgotplt->output_offset,
 			  contents + roff);
-	      continue;
+          goto copy_rel_and_continue;
 	    }
 	  else if (ELF32_R_TYPE (rel->r_info) == R_386_TLS_DESC_CALL)
 	    {
@@ -4245,7 +4245,7 @@ elf_i386_relocate_section (bfd *output_bfd,
 		  bfd_put_8 (output_bfd, 0xd8, contents + roff + 1);
 		}

-	      continue;
+          goto copy_rel_and_continue;
 	    }
 	  else
 	    BFD_ASSERT (FALSE);
@@ -4269,8 +4269,7 @@ elf_i386_relocate_section (bfd *output_bfd,
 	      memcpy (contents + rel->r_offset - 2,
 		      "\x65\xa1\0\0\0\0\x90\x8d\x74\x26", 11);
 	      /* Skip R_386_PC32/R_386_PLT32.  */
-	      rel++;
-	      continue;
+          goto skip_next_rel_and_continue;
 	    }

 	  if (htab->elf.sgot == NULL)
@@ -4335,7 +4334,7 @@ elf_i386_relocate_section (bfd *output_bfd,
 		abort ();
 	      elf_append_rel (output_bfd, sreloc, &outrel);
 	      if (indx)
-		continue;
+          goto copy_rel_and_continue;
 	      else if (r_type == R_386_TLS_LE_32)
 		relocation = elf_i386_dtpoff_base (info) - relocation;
 	      else
@@ -4410,6 +4409,25 @@ check_relocation_error:
 	      return FALSE;
 	    }
 	}
+
+copy_rel_and_continue:
+      if (next_rel != rel)
+      {
+        *next_rel = *rel;
+      }
+
+      continue;
+
+skip_next_rel_and_continue:
+      if (next_rel != rel)
+      {
+        *next_rel = *rel;
+      }
+
+      rel++;
+      next_rel++;
+
+      continue;
     }

   return TRUE;
diff --git a/bfd/elf64-x86-64.c b/bfd/elf64-x86-64.c
index 18983e8..9dd5f13 100644
--- a/bfd/elf64-x86-64.c
+++ b/bfd/elf64-x86-64.c
@@ -3439,6 +3439,7 @@ elf_x86_64_relocate_section (bfd *output_bfd,
   bfd_vma *local_got_offsets;
   bfd_vma *local_tlsdesc_gotents;
   Elf_Internal_Rela *rel;
+  Elf_Internal_Rela *next_rel;
   Elf_Internal_Rela *relend;
   const unsigned int plt_entry_size = GET_PLT_ENTRY_SIZE (info->output_bfd);

@@ -3455,8 +3456,9 @@ elf_x86_64_relocate_section (bfd *output_bfd,
   elf_x86_64_set_tls_module_base (info);

   rel = relocs;
+  next_rel = rel;
   relend = relocs + input_section->reloc_count;
-  for (; rel < relend; rel++)
+  for (; rel < relend; rel++, next_rel++)
     {
       unsigned int r_type;
       reloc_howto_type *howto;
@@ -3476,7 +3478,7 @@ elf_x86_64_relocate_section (bfd *output_bfd,
       r_type = ELF32_R_TYPE (rel->r_info);
       if (r_type == (int) R_X86_64_GNU_VTINHERIT
 	  || r_type == (int) R_X86_64_GNU_VTENTRY)
-	continue;
+        goto copy_rel_and_continue;

       if (r_type >= (int) R_X86_64_standard)
 	{
@@ -3535,10 +3537,10 @@ elf_x86_64_relocate_section (bfd *output_bfd,

       if (sec != NULL && discarded_section (sec))
 	RELOC_AGAINST_DISCARDED_SECTION (info, input_bfd, input_section,
-					 rel, 1, relend, howto, 0, contents);
+					 rel, next_rel, relend, howto, 0, contents);

       if (info->relocatable)
-	continue;
+        goto copy_rel_and_continue;

       if (rel->r_addend == 0 && !ABI_64_P (output_bfd))
 	{
@@ -3682,7 +3684,7 @@ elf_x86_64_relocate_section (bfd *output_bfd,
 		     we need to include the symbol value so that it
 		     becomes an addend for the dynamic reloc.  For an
 		     internal symbol, we have updated addend.  */
-		  continue;
+          goto copy_rel_and_continue;
 		}
 	      /* FALLTHROUGH */
 	    case R_X86_64_PC32:
@@ -4223,7 +4225,7 @@ direct:
 		 need to include the symbol value so that it becomes
 		 an addend for the dynamic reloc.  */
 	      if (! relocate)
-		continue;
+            goto copy_rel_and_continue;
 	    }

 	  break;
@@ -4295,8 +4297,7 @@ direct:
 			      elf_x86_64_tpoff (info, relocation),
 			      contents + roff + 8 + largepic);
 		  /* Skip R_X86_64_PC32/R_X86_64_PLT32/R_X86_64_PLTOFF64.  */
-		  rel++;
-		  continue;
+          goto skip_next_rel_and_continue;
 		}
 	      else if (ELF32_R_TYPE (rel->r_info) == R_X86_64_GOTPC32_TLSDESC)
 		{
@@ -4319,7 +4320,7 @@ direct:
 		  bfd_put_32 (output_bfd,
 			      elf_x86_64_tpoff (info, relocation),
 			      contents + roff);
-		  continue;
+		  goto copy_rel_and_continue;
 		}
 	      else if (ELF32_R_TYPE (rel->r_info) == R_X86_64_TLSDESC_CALL)
 		{
@@ -4330,7 +4331,7 @@ direct:
 		     xchg %ax,%ax.  */
 		  bfd_put_8 (output_bfd, 0x66, contents + roff);
 		  bfd_put_8 (output_bfd, 0x90, contents + roff + 1);
-		  continue;
+		  goto copy_rel_and_continue;
 		}
 	      else if (ELF32_R_TYPE (rel->r_info) == R_X86_64_GOTTPOFF)
 		{
@@ -4405,7 +4406,7 @@ direct:
 		  bfd_put_32 (output_bfd,
 			      elf_x86_64_tpoff (info, relocation),
 			      contents + roff);
-		  continue;
+		  goto copy_rel_and_continue;
 		}
 	      else
 		BFD_ASSERT (FALSE);
@@ -4577,8 +4578,7 @@ direct:
 		  bfd_put_32 (output_bfd, relocation,
 			      contents + roff + 8 + largepic);
 		  /* Skip R_X86_64_PLT32/R_X86_64_PLTOFF64.  */
-		  rel++;
-		  continue;
+          goto skip_next_rel_and_continue;
 		}
 	      else if (ELF32_R_TYPE (rel->r_info) == R_X86_64_GOTPC32_TLSDESC)
 		{
@@ -4603,7 +4603,7 @@ direct:
 			      - input_section->output_offset
 			      - 4,
 			      contents + roff);
-		  continue;
+		  goto copy_rel_and_continue;
 		}
 	      else if (ELF32_R_TYPE (rel->r_info) == R_X86_64_TLSDESC_CALL)
 		{
@@ -4616,7 +4616,7 @@ direct:

 		  bfd_put_8 (output_bfd, 0x66, contents + roff);
 		  bfd_put_8 (output_bfd, 0x90, contents + roff + 1);
-		  continue;
+		  goto copy_rel_and_continue;
 		}
 	      else
 		BFD_ASSERT (FALSE);
@@ -4661,8 +4661,7 @@ direct:
 		memcpy (contents + rel->r_offset - 3,
 			"\x0f\x1f\x40\x00\x64\x8b\x04\x25\0\0\0", 12);
 	      /* Skip R_X86_64_PC32/R_X86_64_PLT32/R_X86_64_PLTOFF64.  */
-	      rel++;
-	      continue;
+          goto skip_next_rel_and_continue;
 	    }

 	  if (htab->elf.sgot == NULL)
@@ -4777,6 +4776,25 @@ check_relocation_error:
 	      return FALSE;
 	    }
 	}
+
+copy_rel_and_continue:
+      if (next_rel != rel)
+      {
+        *next_rel = *rel;
+      }
+
+      continue;
+
+skip_next_rel_and_continue:
+      if (next_rel != rel)
+      {
+        *next_rel = *rel;
+      }
+
+      rel++;
+      next_rel++;
+
+      continue;
     }

   return TRUE;
-- 
2.3.0



Reproducer Makefile: just run "make -j N" and look for the last relocation
time

CXXFLAGS=-std=gnu++11 -O2 -ggdb3

MAX=150
MAX_ONE_THIRD=$(shell echo $$((${MAX} / 3)))
MAX_ONE_THIRD_PLUS_ONE=$(shell echo $$((${MAX_ONE_THIRD} + 1)))
MAX_TWO_THIRD=$(shell echo $$((2 * ${MAX} / 3)))
MAX_TWO_THIRD_PLUS_ONE=$(shell echo $$((${MAX_TWO_THIRD} + 1)))

define A0_H_FILE_CONTENT
#ifndef HEADER_GUARD_A0
#define HEADER_GUARD_A0

#include <string>

struct A0
{
    A0() {}

    ~A0()
    {
        relocatedCall2();
    }

    A0(const std::string& value)
        :_value(value)
    {}

    std::string inlinedCall() const
    {
        return _value + relocatedCall1();
    }

    std::string relocatedCall1() const;

    void relocatedCall2();

    std::string _value;
};

#endif
endef

define A0_CPP_FILE_CONTENT
#include "A0.h"

std::string A0::relocatedCall1() const
{
    return "relocatedCall1";
}

void A0::relocatedCall2() {}
endef

#$1 = number N
#$2 = N - 1
define AXXX_H_FILE_CONTENT
#ifndef HEADER_GUARD_A$1
#define HEADER_GUARD_A$1

#include <string>
#include <map>
#include "A$2.h"

struct A$1
{
    A$1() {}

    A$1(const std::string& key, const A$2& value)
    {
        A$2 oldValueByInlinedCopy = _map[key];
        _map[key] = value;
        _map[oldValueByInlinedCopy.inlinedCall()] = oldValueByInlinedCopy;
    }

    A$1(const std::map<std::string, A$2>& map)
        :_map(map)
    {}

    std::string inlinedCall()
    {
        return "inlinedCall" + relocatedCall();
    }

    std::map<std::string, A$2> _map;

    std::string relocatedCall() const;
};

#endif
endef

#$1 = number N
define AXXX_CPP_FILE_CONTENT
#include "A$1.h"
#include "A0.h"

std::string A$1::relocatedCall() const
{
    std::string s;

    for (auto pairByInlinedCopy : _map)
    {
        s += pairByInlinedCopy.first + " = {" + pairByInlinedCopy.second.inlinedCall() + "}, ";
    }

    A0 inlinedA0("A$1");

    return s + inlinedA0.inlinedCall();
}
endef

all:all.o

first.o second.o all.o:
	time ${LD} -r -o $@ $^

first.o:$(shell echo A{0..${MAX_ONE_THIRD}}.o)
second.o:$(shell echo A{${MAX_ONE_THIRD_PLUS_ONE}..${MAX_TWO_THIRD}}.o)
all.o:$(shell echo A{${MAX_TWO_THIRD_PLUS_ONE}..${MAX}}.o) first.o second.o

$(shell echo A{0..${MAX}}.o):|$(shell echo A{0..${MAX}}.{h,cpp})
	$(COMPILE.cc) $(OUTPUT_OPTION) ${@:%.o=%.cpp}

$(shell echo A{0..${MAX}}.{h,cpp}):
	cat>$@<<<"$${FILE_CONTENT}"

A0.h:export FILE_CONTENT=${A0_H_FILE_CONTENT}
A0.cpp:export FILE_CONTENT=${A0_CPP_FILE_CONTENT}
MINUS_ONE_SEQUENCE:=$(shell echo {0..${MAX}})
MINUS_ONE=$(word $1,${MINUS_ONE_SEQUENCE})
A%.h:export FILE_CONTENT=$(call AXXX_H_FILE_CONTENT,${*:A%=%},$(call MINUS_ONE,${*:A%=%}))
A%.cpp:export FILE_CONTENT=$(call AXXX_CPP_FILE_CONTENT,${*:A%=%})

clean:
	${RM} *.h *.cpp *.o

soft-clean:
	${RM} all.o first.o second.o

.PHONY:all clean soft-clean
.PRECIOUS:$(shell echo A{0..${MAX}}.{h,cpp,o})


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]