[RFC PATCH 2/2] arm64: kernel: switch to PIE code generation for relocatable kernels

Fangrui Song maskray at google.com
Wed Apr 27 23:57:42 PDT 2022


On 2022-04-28, Ard Biesheuvel wrote:
>On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray at google.com> wrote:
>>
>> On 2022-04-27, Ard Biesheuvel wrote:
>> >We currently use ordinary, position dependent code generation for the
>> >core kernel, which happens to default to the 'small' code model on both
>> >GCC and Clang. This is the code model that relies on ADRP/ADD or
>> >ADRP/LDR pairs for symbol references, which are PC-relative with a range
>> >of -/+ 4 GiB, and therefore happen to be position independent in
>> >practice.
>> >
>> >This means that the fact that we can link the relocatable KASLR kernel
>> >using the -pie linker flag (which generates the runtime relocations and
>> >inserts them into the binary) is somewhat of a coincidence, and not
>> >something which is explicitly supported by the toolchains.
>>
>> Agree. The current -fno-PIE + -shared -Bsymbolic combo works as a
>> conincidence, not guaranteed by the toolchain.
>>
>> -shared needs -fpic object files. -shared -Bsymbolic is very similar to
>> -pie and therefore works with -fpie object files, but the usage is not
>> recommended from the toolchain perspective.
>>
>
>So are you suggesting we should also switch from -shared to -Bsymbol
>to -pie if we can? I don't remember the details, but IIRC ld.bfd
>didn't set the ELF binary type correctly, but perhaps this has now
>been fixed.

Yes, -shared -Bsymbolic => -pie, but that can be done later.

For e_type: ET_DYN, I think unlikely there was a bug.
-pie was added by binutils in 2003: it's close to -shared but doesn't
allow its definitions to be preempted/interposed. Code earlier than that
might use -shared -Bsymbolic before -pie was available.

>> >The reason we have not used -fpie for code generation so far (which is
>> >the compiler flag that should be used to generate code that is to be
>> >linked with -pie) is that by default, it generates code based on
>> >assumptions that only hold for shared libraries and PIE executables,
>> >i.e., that gathering all relocatable quantities into a Global Offset
>> >Table (GOT) is desirable because it reduces the CoW footprint, and
>> >because it permits ELF symbol preemption (which lets an executable
>> >override symbols defined in a shared library, in a way that forces the
>> >shared library to update all of its internal references as well).
>> >Ironically, this means we end up with many more absolute references that
>> >all need to be fixed up at boot.
>>
>> This is not about symbol preemption (when the executable and a shared
>> objectdefine the same symbol, which one wins). An executable using a GOT
>> which will be resolved to a shared object => this is regular relocation
>> resolving and there is no preemption.
>>
>> It is that the compiler prefers code generation which can avoid text
>> relocations / copy relocations / canonical PLT entries
>> (https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected#summary).
>>
>
>Fair enough. So the compiler cannot generate relative references to
>undefined external symbols since it doesn't know at codegen time
>whether the symbol reference will be satisfied by the executable
>itself or by a shared library, and in the latter case, the relative
>distance to the symbol is not known at build time, and so a runtime
>relocation is required.

Right.

>But how about references to symbols with
>external visibility that are defined in the same compilation unit? I
>don't quite understand why those references need to go via the GOT as
>well.

If you mean references to a non-local STV_DEFAULT (default visibility) definition =>

* -fpic: use GOT because the definition may be replaced by another at run time.
   Conservatively use a GOT-generating code sequence to allow potential symbol
   preemption(interposition). The linker may optimize out the GOT (x86-64
   GOTPCRELX, recent ld.lld for aarch64, powerpc64 TOC-indirect to
   TOC-relative optimization).
* -fpie or -fno-pie: the definition cannot be replaced. GOT is unneeded.

-fpie is an optimization on top of -fpic: (a) non-local STV_DEFAULT
definitions can be assumed non-interposable (b) (irrelevant to the
kernel) TLS can use more optimized models.

>> >Fortunately, we can convince the compiler to handle this in a way that
>> >is a bit more suitable for freestanding binaries such as the kernel, by
>> >setting the 'hidden' visibility #pragma, which informs the compiler that
>> >symbol preemption or CoW footprint are of no concern to us, and so
>> >PC-relative references that are resolved at link time are perfectly
>> >fine.
>>
>> Agree
>>
>
>The only unfortunate thing is that -fvisibility=hidden does not give
>us the behavior we want, and we are forced to use the #pragma instead.

Right. For a very long time there had been no option controlling the
access mode for undefined symbols (-fvisibility= is for defined
symbols).

I added -fdirect-access-external-data to Clang which supports
many architectures (x86, aarch64, arm, riscv, ...).
GCC's x86 port added -mdirect-extern-access in 2022-02 (not available on aarch64).

The use of `#pragma GCC visibility push(hidden)` looks good as a
portable solution.

>> >So let's enable this #pragma and build with -fpie when building a
>> >relocatable kernel. This also means that all constant data items that
>> >carry statically initialized pointer variables are now emitted into the
>> >.data.rel.ro* sections, so move these into .rodata where they belong.
>>
>> LGTM, except: is ".rodata" a typo? The patch doesn't reference .rodata
>>
>
>I am referring to the .rodata pseudo-segment that we have in the
>kernel, which runs from _etext to __inittext_begin.

OK

>> >Code size impact (GCC):
>> >
>> >Before:
>> >
>> >      text       data        bss      total filename
>> >  16712396   18659064     534556   35906016 vmlinux
>> >
>> >After:
>> >
>> >      text       data        bss      total filename
>> >  16804400   18612876     534556   35951832 vmlinux
>> >
>> >Code size impact (Clang):
>> >
>> >Before:
>> >
>> >      text       data        bss      total filename
>> >  17194584   13335060     535268   31064912 vmlinux
>> >
>> >After:
>> >
>> >      text       data        bss      total filename
>> >  17194536   13310032     535268   31039836 vmlinux

The size difference for Clang matches my expecation:)
I am somewhat surprised that data is smaller, though...

I wonder how GCC makes code bloated so much...

>> >Signed-off-by: Ard Biesheuvel <ardb at kernel.org>
>> >---
>> > arch/arm64/Makefile             | 4 ++++
>> > arch/arm64/kernel/vmlinux.lds.S | 9 ++++-----
>> > 2 files changed, 8 insertions(+), 5 deletions(-)
>> >
>> >diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
>> >index 2f1de88651e6..94b6c51f5de6 100644
>> >--- a/arch/arm64/Makefile
>> >+++ b/arch/arm64/Makefile
>> >@@ -18,6 +18,10 @@ ifeq ($(CONFIG_RELOCATABLE), y)
>> > # with the relocation offsets always being zero.
>> > LDFLAGS_vmlinux               += -shared -Bsymbolic -z notext \
>> >                       $(call ld-option, --no-apply-dynamic-relocs)
>> >+
>> >+# Generate position independent code without relying on a Global Offset Table
>> >+KBUILD_CFLAGS_KERNEL   += -fpie -include $(srctree)/include/linux/hidden.h
>> >+
>> > endif
>> >
>> > ifeq ($(CONFIG_ARM64_ERRATUM_843419),y)
>> >diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
>> >index edaf0faf766f..b1e071ac1acf 100644
>> >--- a/arch/arm64/kernel/vmlinux.lds.S
>> >+++ b/arch/arm64/kernel/vmlinux.lds.S
>> >@@ -174,8 +174,6 @@ SECTIONS
>> >                       KEXEC_TEXT
>> >                       TRAMP_TEXT
>> >                       *(.gnu.warning)
>> >-              . = ALIGN(16);
>> >-              *(.got)                 /* Global offset table          */
>> >       }
>> >
>> >       /*
>> >@@ -192,6 +190,8 @@ SECTIONS
>> >       /* everything from this point to __init_begin will be marked RO NX */
>> >       RO_DATA(PAGE_SIZE)
>> >
>> >+      .data.rel.ro : ALIGN(8) { *(.got) *(.data.rel.ro*) }
>> >+
>> >       HYPERVISOR_DATA_SECTIONS
>> >
>> >       idmap_pg_dir = .;
>> >@@ -273,6 +273,8 @@ SECTIONS
>> >       _sdata = .;
>> >       RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN)
>> >
>> >+      .data.rel : ALIGN(8) { *(.data.rel*) }
>> >+
>> >       /*
>> >        * Data written with the MMU off but read with the MMU on requires
>> >        * cache lines to be invalidated, discarding up to a Cache Writeback
>> >@@ -320,9 +322,6 @@ SECTIONS
>> >               *(.plt) *(.plt.*) *(.iplt) *(.igot .igot.plt)
>> >       }
>> >       ASSERT(SIZEOF(.plt) == 0, "Unexpected run-time procedure linkages detected!")
>> >-
>> >-      .data.rel.ro : { *(.data.rel.ro) }
>> >-      ASSERT(SIZEOF(.data.rel.ro) == 0, "Unexpected RELRO detected!")
>> > }
>> >
>> > #include "image-vars.h"
>> >--
>> >2.30.2
>> >
>> >--
>> >You received this message because you are subscribed to the Google Groups "Clang Built Linux" group.
>> >To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-linux+unsubscribe at googlegroups.com.
>> >To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/20220427171241.2426592-3-ardb%40kernel.org.



More information about the linux-arm-kernel mailing list