[RFC] arm: use built-in byte swap function
Nicolas Pitre
nico at fluxnic.net
Tue Feb 19 22:17:52 EST 2013
On Tue, 19 Feb 2013, Kim Phillips wrote:
> On Fri, 8 Feb 2013 22:16:47 -0500
> Nicolas Pitre <nico at fluxnic.net> wrote:
>
> > Not only that, but in many cases the results are wildly different given
> > the same config:
> >
> > > imx_v6_v7_defconfig: 7637605 7636935 -670
> > > lart_defconfig: 2922550 2926600 4050
> > > mxs_defconfig: 11071139 11070893 -246
> >
> > The mxs_defconfig became much better while lart_defconfig regressed a
> > lot.
> >
> > > Haven't looked at why.
> >
> > Would be a good idea since this is rather weird and gcc could benefit
> > from your findings.
>
> The following is next-20130207 built with Linaro gcc 4.7.1 [1], and
> before and after the diff at the bottom of this email (and with
> normalized linux version string sizes):
>
> lart_defconfig: 2752106 120864 56444 2929414 2cb306 vmlinux
> lart_defconfig: 2756092 120864 56444 2933400 2cc298 vmlinux #builtin-bswap
>
> mxs_defconfig: 5229115 280572 5569648 11079335 a90ea7 vmlinux
> mxs_defconfig: 5228969 280552 5569648 11079169 a90e01 vmlinux #builtin-bswap
>
> imx_v6_v7_defconfig: 6935025 356172 360648 7651845 74c205 vmlinux
> imx_v6_v7_defconfig: 6934091 356180 360648 7650919 74be67 vmlinux #builtin-bswap
>
>
> so builtin-bswap improved mxs and imx_v6_v7 but in lart, it _added_
> 3986 bytes to .text -> not good.
>
> Getting a closer look at lart, bloat-o-meter says the code actually
> shrunk:
>
> add/remove: 7/1 grow/shrink: 11/19 up/down: 298/-356 (-58)
> function old new delta
> inet_abc_len - 96 +96
> __bswapdi2 - 52 +52
> __bswapsi2 - 32 +32
> icmp_unreach 472 492 +20
> xfrm_selector_match 988 1000 +12
> fib_table_insert 2176 2188 +12
> __kstrtab___bswapsi2 - 11 +11
> __kstrtab___bswapdi2 - 11 +11
> __ksymtab___bswapsi2 - 8 +8
> __ksymtab___bswapdi2 - 8 +8
> vermagic 51 57 +6
> linux_banner 230 236 +6
> xfrm_replay_check_esn 320 324 +4
> xfrm_replay_check_bmp 200 204 +4
> xfrm_replay_check 152 156 +4
> static.tcp_parse_aligned_timestamp 80 84 +4
> fib_table_delete 708 712 +4
> cookie_v4_check 1316 1320 +4
> tcp_tso_segment 728 724 -4
> tcp_options_write 724 720 -4
> ip_rt_ioctl 1152 1148 -4
> fib_trie_seq_show 724 720 -4
> crc32_be 448 444 -4
> xfrm_stateonly_find 640 632 -8
> tcp_finish_connect 276 268 -8
> static.tcp_v4_send_ack 480 472 -8
> __xfrm_state_lookup 356 348 -8
> __xfrm_state_bump_genids 436 428 -8
> __find_acq_core 1256 1248 -8
> cookie_v4_init_sequence 272 260 -12
> __xfrm_state_insert 616 600 -16
> sys_swapon 2500 2480 -20
> xfrm_state_find 2420 2396 -24
> xfrm_hash_resize 1620 1596 -24
> fib_route_seq_show 560 536 -24
> fib_table_dump 704 676 -28
> devinet_ioctl 1856 1796 -60
> static.inet_abc_len 80 - -80
>
> Comparing System.maps, .rodata starts at the same address:
>
> c020a000 R __start_rodata
> c020a000 R __start_rodata #builtin-bswap
>
> however, changes including the __bswap[sd]i2 implementation pushes
> the .rodata section size just over the 4KiB alignment boundary
> specified in arm/kernel/vmlinux.lds:
>
> no builtin_bswap:
>
> c028ffc4 R __stop___modver
> c0290000 R __end_rodata
> c0290000 R __start___ex_table
>
> with builtin_bswap:
>
> c0290068 R __stop___modver
> c0291000 R __end_rodata
> c0291000 R __start___ex_table
>
> So, AFAICT, that's why we see a total increase in .text for lart,
> and, looking at both numbers being a little less than 4KiB, I
> suspect the same with whatever happened with mxs above.
OK. At least we do have a plausible explanation now. The actual code
being smaller should compensate for section alignment loss.
> ok, so to avoid recursion, I've enforced a -O2 on bswapsdi2.o.
Not only recursion, but the horrible assembly output from -Os.
> Here's the new diff:
>
> changes from last diff:
> - enforce -O2 for bswapsdi2.o
> - fix building out-of-source tree
>
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 4265a26..5e8b735 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -58,6 +58,7 @@ config ARM
> select CLONE_BACKWARDS
> select OLD_SIGSUSPEND3
> select OLD_SIGACTION
> + select ARCH_USE_BUILTIN_BSWAP
> help
> The ARM series is a line of low-power-consumption RISC chip designs
> licensed by ARM Ltd and targeted at embedded applications and
> diff --git a/arch/arm/boot/compressed/Makefile b/arch/arm/boot/compressed/Makefile
> index c9865f6..8ef97c4 100644
> --- a/arch/arm/boot/compressed/Makefile
> +++ b/arch/arm/boot/compressed/Makefile
> @@ -111,12 +111,12 @@ endif
>
> targets := vmlinux vmlinux.lds \
> piggy.$(suffix_y) piggy.$(suffix_y).o \
> - lib1funcs.o lib1funcs.S ashldi3.o ashldi3.S \
> + lib1funcs.o lib1funcs.S ashldi3.o ashldi3.S bswapsdi2.o \
> font.o font.c head.o misc.o $(OBJS)
>
> # Make sure files are removed during clean
> extra-y += piggy.gzip piggy.lzo piggy.lzma piggy.xzkern \
> - lib1funcs.S ashldi3.S $(libfdt) $(libfdt_hdrs)
> + lib1funcs.S ashldi3.S bswapsdi2.o $(libfdt) $(libfdt_hdrs)
>
> ifeq ($(CONFIG_FUNCTION_TRACER),y)
> ORIG_CFLAGS := $(KBUILD_CFLAGS)
> @@ -158,6 +158,12 @@ ashldi3 = $(obj)/ashldi3.o
> $(obj)/ashldi3.S: $(srctree)/arch/$(SRCARCH)/lib/ashldi3.S
> $(call cmd,shipped)
>
> +# For __bswapsi2, __bswapdi2
> +bswapsdi2 = $(obj)/bswapsdi2.o
> +
> +$(obj)/bswapsdi2.o: $(obj)/../../../../arch/$(SRCARCH)/lib/bswapsdi2.o
> + $(call cmd,shipped)
> +
I don't think you can get away with this. The decompressor code is
compiled with -fpic and the main kernel is not. Most toolchains do mark
object files with some flags to prevent the link of incompatible objects
together (normally pic and non pic objects are not compatible even if in
this very simple case that would not matter). Maybe you are able to
link zImage successfully simply because no references to __bswap* needed
to be resolved and therefore the linker didn't need to search/include
that object?
> diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
> index af72969..dbee639 100644
> --- a/arch/arm/lib/Makefile
> +++ b/arch/arm/lib/Makefile
> @@ -13,7 +13,7 @@ lib-y := backtrace.o changebit.o csumipv6.o csumpartial.o \
> ashldi3.o ashrdi3.o lshrdi3.o muldi3.o \
> ucmpdi2.o lib1funcs.o div64.o \
> io-readsb.o io-writesb.o io-readsl.o io-writesl.o \
> - call_with_stack.o
> + call_with_stack.o bswapsdi2.o
>
> mmu-y := clear_user.o copy_page.o getuser.o putuser.o
>
> @@ -45,3 +45,5 @@ lib-$(CONFIG_ARCH_SHARK) += io-shark.o
>
> $(obj)/csumpartialcopy.o: $(obj)/csumpartialcopygeneric.S
> $(obj)/csumpartialcopyuser.o: $(obj)/csumpartialcopygeneric.S
> +
> +CFLAGS_bswapsdi2.o = -O2
Please insert a small comment to explain why this is done. Adding some
more elaborate explanation in the commit log would be good too.
Nicolas
More information about the linux-arm-kernel
mailing list