This is the mail archive of the crossgcc@sourceware.org mailing list for the crossgcc project.

See the CrossGCC FAQ for lots more information.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

ARM floating point differences


I was posting about this problem to gcc-help, and then gnuarm, but I've not gotten many responses lately on gnuarm, so I thought I'd try here. The original messages are appended at the bottom.

Basically, an older set of tools I built is
generating much faster floating point code. A new set of tools I built
does not have such fast FP code, and I'd like to figure out how to
rebuild it so that it does.


I've compared some of the floating point code in the disassembly of our code. In one example, __addfs3, the code from one toolsuite is markedly different from the other. I've included the disassembly below.

Clearly, the floating point code in the fast case is highly optimized.
It doesn't use the stack, it doesn't branch to other routines, etc.

Is there a configuration option I missed when I built the toolchain?
Or something else? The "slow" toolchain is built from the same or more recent versions of the tools. In practice, we probably won't use any floating point code, but it makes me wonder what other code lacks optimization.


Any help would be greatly appreciated.

TIA,
Rick


The slow code generated is:


80101a30 <__addsf3>:
80101a30:	e92d4030 	stmdb	sp!, {r4, r5, lr}
80101a34:	e24dd038 	sub	sp, sp, #56	; 0x38
80101a38:	e28d5020 	add	r5, sp, #32	; 0x20
80101a3c:	e58d0034 	str	r0, [sp, #52]
80101a40:	e58d1030 	str	r1, [sp, #48]
80101a44:	e28d0034 	add	r0, sp, #52	; 0x34
80101a48:	e1a01005 	mov	r1, r5
80101a4c:	e28d4010 	add	r4, sp, #16	; 0x10
80101a50:	eb0001b1 	bl	8010211c <__unpack_f>
80101a54:	e28d0030 	add	r0, sp, #48	; 0x30
80101a58:	e1a01004 	mov	r1, r4
80101a5c:	eb0001ae 	bl	8010211c <__unpack_f>
80101a60:	e1a01004 	mov	r1, r4
80101a64:	e1a0200d 	mov	r2, sp
80101a68:	e1a00005 	mov	r0, r5
80101a6c:	ebffff55 	bl	801017c8 <_fpadd_parts>
80101a70:	eb00014e 	bl	80101fb0 <__pack_f>
80101a74:	e28dd038 	add	sp, sp, #56	; 0x38
80101a78:	e8bd8030 	ldmia	sp!, {r4, r5, pc}


While the fast code (despite being much longer) is:


80101750 <__addsf3>:
80101750:	e1b02080 	lsls	r2, r0, #1
80101754:	11b03081 	lslsne	r3, r1, #1
80101758:	11320003 	teqne	r2, r3
8010175c:	11f0cc42 	mvnsne	ip, r2, asr #24
80101760:	11f0cc43 	mvnsne	ip, r3, asr #24
80101764:	0a00003c 	beq	8010185c <__addsf3+0x10c>
80101768:	e1a02c22 	lsr	r2, r2, #24
8010176c:	e0723c23 	rsbs	r3, r2, r3, lsr #24
80101770:	c0822003 	addgt	r2, r2, r3
80101774:	c0201001 	eorgt	r1, r0, r1
80101778:	c0210000 	eorgt	r0, r1, r0
8010177c:	c0201001 	eorgt	r1, r0, r1
80101780:	b2633000 	rsblt	r3, r3, #0	; 0x0
80101784:	e3530019 	cmp	r3, #25	; 0x19
80101788:	812fff1e 	bxhi	lr
8010178c:	e3100102 	tst	r0, #-2147483648	; 0x80000000
80101790:	e3800502 	orr	r0, r0, #8388608	; 0x800000
80101794:	e3c004ff 	bic	r0, r0, #-16777216	; 0xff000000
80101798:	12600000 	rsbne	r0, r0, #0	; 0x0
8010179c:	e3110102 	tst	r1, #-2147483648	; 0x80000000
801017a0:	e3811502 	orr	r1, r1, #8388608	; 0x800000
801017a4:	e3c114ff 	bic	r1, r1, #-16777216	; 0xff000000
801017a8:	12611000 	rsbne	r1, r1, #0	; 0x0
801017ac:	e1320003 	teq	r2, r3
801017b0:	0a000023 	beq	80101844 <__addsf3+0xf4>
801017b4:	e2422001 	sub	r2, r2, #1	; 0x1
801017b8:	e0900351 	adds	r0, r0, r1, asr r3
801017bc:	e2633020 	rsb	r3, r3, #32	; 0x20
801017c0:	e1a01311 	lsl	r1, r1, r3
801017c4:	e2003102 	and	r3, r0, #-2147483648	; 0x80000000
801017c8:	5a000001 	bpl	801017d4 <__addsf3+0x84>
801017cc:	e2711000 	rsbs	r1, r1, #0	; 0x0
801017d0:	e2e00000 	rsc	r0, r0, #0	; 0x0
801017d4:	e3500502 	cmp	r0, #8388608	; 0x800000
801017d8:	3a00000b 	bcc	8010180c <__addsf3+0xbc>
801017dc:	e3500401 	cmp	r0, #16777216	; 0x1000000
801017e0:	3a000004 	bcc	801017f8 <__addsf3+0xa8>
801017e4:	e1b000a0 	lsrs	r0, r0, #1
801017e8:	e1a01061 	rrx	r1, r1
801017ec:	e2822001 	add	r2, r2, #1	; 0x1
801017f0:	e35200fe 	cmp	r2, #254	; 0xfe
801017f4:	2a00002d 	bcs	801018b0 <__addsf3+0x160>
801017f8:	e3510102 	cmp	r1, #-2147483648	; 0x80000000
801017fc:	e0a00b82 	adc	r0, r0, r2, lsl #23
80101800:	03c00001 	biceq	r0, r0, #1	; 0x1
80101804:	e1800003 	orr	r0, r0, r3
80101808:	e12fff1e 	bx	lr
8010180c:	e1b01081 	lsls	r1, r1, #1
80101810:	e0a00000 	adc	r0, r0, r0
80101814:	e3100502 	tst	r0, #8388608	; 0x800000
80101818:	e2422001 	sub	r2, r2, #1	; 0x1
8010181c:	1afffff5 	bne	801017f8 <__addsf3+0xa8>
80101820:	e16fcf10 	clz	ip, r0
80101824:	e24cc008 	sub	ip, ip, #8	; 0x8
80101828:	e052200c 	subs	r2, r2, ip
8010182c:	e1a00c10 	lsl	r0, r0, ip
80101830:	a0800b82 	addge	r0, r0, r2, lsl #23
80101834:	b2622000 	rsblt	r2, r2, #0	; 0x0
80101838:	a1800003 	orrge	r0, r0, r3
8010183c:	b1830230 	orrlt	r0, r3, r0, lsr r2
80101840:	e12fff1e 	bx	lr
80101844:	e3320000 	teq	r2, #0	; 0x0
80101848:	e2211502 	eor	r1, r1, #8388608	; 0x800000
8010184c:	02200502 	eoreq	r0, r0, #8388608	; 0x800000
80101850:	02822001 	addeq	r2, r2, #1	; 0x1
80101854:	12433001 	subne	r3, r3, #1	; 0x1
80101858:	eaffffd5 	b	801017b4 <__addsf3+0x64>
8010185c:	e1a03081 	lsl	r3, r1, #1
80101860:	e1f0cc42 	mvns	ip, r2, asr #24
80101864:	11f0cc43 	mvnsne	ip, r3, asr #24
80101868:	0a000013 	beq	801018bc <__addsf3+0x16c>
8010186c:	e1320003 	teq	r2, r3
80101870:	0a000002 	beq	80101880 <__addsf3+0x130>
80101874:	e3320000 	teq	r2, #0	; 0x0
80101878:	01a00001 	moveq	r0, r1
8010187c:	e12fff1e 	bx	lr
80101880:	e1300001 	teq	r0, r1
80101884:	13a00000 	movne	r0, #0	; 0x0
80101888:	112fff1e 	bxne	lr
8010188c:	e31204ff 	tst	r2, #-16777216	; 0xff000000
80101890:	1a000002 	bne	801018a0 <__addsf3+0x150>
80101894:	e1b00080 	lsls	r0, r0, #1
80101898:	23800102 	orrcs	r0, r0, #-2147483648	; 0x80000000
8010189c:	e12fff1e 	bx	lr
801018a0:	e2922402 	adds	r2, r2, #33554432	; 0x2000000
801018a4:	32800502 	addcc	r0, r0, #8388608	; 0x800000
801018a8:	312fff1e 	bxcc	lr
801018ac:	e2003102 	and	r3, r0, #-2147483648	; 0x80000000
801018b0:	e383047f 	orr	r0, r3, #2130706432	; 0x7f000000
801018b4:	e3800502 	orr	r0, r0, #8388608	; 0x800000
801018b8:	e12fff1e 	bx	lr
801018bc:	e1f02c42 	mvns	r2, r2, asr #24
801018c0:	11a00001 	movne	r0, r1
801018c4:	01f03c43 	mvnseq	r3, r3, asr #24
801018c8:	11a01000 	movne	r1, r0
801018cc:	e1b02480 	lsls	r2, r0, #9
801018d0:	01b03481 	lslseq	r3, r1, #9
801018d4:	01300001 	teqeq	r0, r1
801018d8:	13800501 	orrne	r0, r0, #4194304	; 0x400000
801018dc:	e12fff1e 	bx	lr




A little more information: there seems to be a difference in the resulting binary's floating point (which would go a long way to explaining what I'm seeing). The ELF built with the more recent tools results in this:

$ xscale-elf-readelf -h h.elf
ELF Header:
 Magic:   7f 45 4c 46 01 01 01 61 00 00 00 00 00 00 00 00
 Class:                             ELF32
 Data:                              2's complement, little endian
 Version:                           1 (current)
 OS/ABI:                            ARM
 ABI Version:                       0
 Type:                              EXEC (Executable file)
 Machine:                           ARM
 Version:                           0x1
 Entry point address:               0x80100000
 Start of program headers:          52 (bytes into file)
 Start of section headers:          448508 (bytes into file)
 Flags:                             0x602, has entry point, GNU EABI,
software FP, VFP
 Size of this header:               52 (bytes)
 Size of program headers:           32 (bytes)
 Number of program headers:         1
 Size of section headers:           40 (bytes)
 Number of section headers:         25
 Section header string table index: 22


The ELF built with the older (faster) tools results in this:


$ arm-elf-readelf -h h.elf
ELF Header:
 Magic:   7f 45 4c 46 01 01 01 61 00 00 00 00 00 00 00 00
 Class:                             ELF32
 Data:                              2's complement, little endian
 Version:                           1 (current)
 OS/ABI:                            ARM
 ABI Version:                       0
 Type:                              EXEC (Executable file)
 Machine:                           ARM
 Version:                           0x1
 Entry point address:               0x80100000
 Start of program headers:          52 (bytes into file)
 Start of section headers:          411484 (bytes into file)
 Flags:                             0x402, has entry point, GNU EABI,
VFP
 Size of this header:               52 (bytes)
 Size of program headers:           32 (bytes)
 Number of program headers:         1
 Size of section headers:           40 (bytes)
 Number of section headers:         26
 Section header string table index: 23

The relevant change is in the Flags: field. The new tools include
"software FP", the old tools don't.

Now, the processor doesn't have hardware floating point, yet the code
runs in both cases, so some kind of software floating point code is
being emitted.

TIA,
Rick

(original post below)


I've been building tools targeting the Marvell Xscale processor a
lot lately. A set of tools I build a few months ago seem to generate
much faster code on our target hardware than tools I built more
recently. There were some significant differences in the way the
tools were built, but it doesn't seem like that's enough to explain
the difference. Unfortunately, I don't remember exactly how I built
the older toolchain, so I'm hoping someone can help me determine
what it was by looking at the build result.

Old tools:

$ arm-elf-gcc -v
Using built-in specs.
Target: arm-elf
Configured with: ../configure --prefix=/usr/local/arm3 --target=arm-
elf --with-newlib --with-cpu=xscale --enable-languages=c,c++
Thread model: single
gcc version 4.2.1

$ arm-elf-ld --version
GNU ld (GNU Binutils) 2.18

How do I tell what version of newlib is installed (I think it's 1.15)?

Built using a multistep process, where I first built binutils, then
gcc, then newlib (I don't recall if I did a stage 1 GCC build first,
but somehow I got it all working).


The latest tools are slightly different, and built with a combined tree build:

gcc-4.2.2
binutils-2.17
newlib-1.15

$ xscale-elf-gcc -v
Using built-in specs.
Target: xscale-elf
Configured with: ../combined/configure --target=xscale-elf --disable-
nls --with-newlib --prefix=/usr/local/gcc-xscale-elf --disable-
newlib-supplied-syscalls
Thread model: single
gcc version 4.2.2



I'm sorry I can't provide better information, but I'd really like to
figure this out. The code doesn't call into the standard C library,
but does make use of a lot of floating point code. Is it possible
that this code is better with the other tools (either built more
optimized, or generally different)? I don't know I'm just
speculating. It is C++ code (bouncing balls on a screen, the balls
are object instances).

Thanks for any help!

--
Rick



-- For unsubscribe information see http://sourceware.org/lists.html#faq


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]