[RFC PATCH 2/2] arm64: kernel: switch to PIE code generation for relocatable kernels

Thu Apr 28 09:03:55 PDT 2022

On Thu, 28 Apr 2022 at 08:57, Fangrui Song <maskray at google.com> wrote:
>
> On 2022-04-28, Ard Biesheuvel wrote:
> >On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray at google.com> wrote:
> >>
> >> On 2022-04-27, Ard Biesheuvel wrote:
> >> >We currently use ordinary, position dependent code generation for the
> >> >core kernel, which happens to default to the 'small' code model on both
> >> >GCC and Clang. This is the code model that relies on ADRP/ADD or
> >> >ADRP/LDR pairs for symbol references, which are PC-relative with a range
> >> >of -/+ 4 GiB, and therefore happen to be position independent in
> >> >practice.
> >> >
> >> >This means that the fact that we can link the relocatable KASLR kernel
> >> >using the -pie linker flag (which generates the runtime relocations and
> >> >inserts them into the binary) is somewhat of a coincidence, and not
> >> >something which is explicitly supported by the toolchains.
> >>
> >> Agree. The current -fno-PIE + -shared -Bsymbolic combo works as a
> >> conincidence, not guaranteed by the toolchain.
> >>
> >> -shared needs -fpic object files. -shared -Bsymbolic is very similar to
> >> -pie and therefore works with -fpie object files, but the usage is not
> >> recommended from the toolchain perspective.
> >>
> >
> >So are you suggesting we should also switch from -shared to -Bsymbol
> >to -pie if we can? I don't remember the details, but IIRC ld.bfd
> >didn't set the ELF binary type correctly, but perhaps this has now
> >been fixed.
>
> Yes, -shared -Bsymbolic => -pie, but that can be done later.
>
> For e_type: ET_DYN, I think unlikely there was a bug.
> -pie was added by binutils in 2003: it's close to -shared but doesn't
> allow its definitions to be preempted/interposed. Code earlier than that
> might use -shared -Bsymbolic before -pie was available.
>
> >> >The reason we have not used -fpie for code generation so far (which is
> >> >the compiler flag that should be used to generate code that is to be
> >> >linked with -pie) is that by default, it generates code based on
> >> >assumptions that only hold for shared libraries and PIE executables,
> >> >i.e., that gathering all relocatable quantities into a Global Offset
> >> >Table (GOT) is desirable because it reduces the CoW footprint, and
> >> >because it permits ELF symbol preemption (which lets an executable
> >> >override symbols defined in a shared library, in a way that forces the
> >> >shared library to update all of its internal references as well).
> >> >Ironically, this means we end up with many more absolute references that
> >> >all need to be fixed up at boot.
> >>
> >> This is not about symbol preemption (when the executable and a shared
> >> objectdefine the same symbol, which one wins). An executable using a GOT
> >> which will be resolved to a shared object => this is regular relocation
> >> resolving and there is no preemption.
> >>
> >> It is that the compiler prefers code generation which can avoid text
> >> relocations / copy relocations / canonical PLT entries
> >> (https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected#summary).
> >>
> >
> >Fair enough. So the compiler cannot generate relative references to
> >undefined external symbols since it doesn't know at codegen time
> >whether the symbol reference will be satisfied by the executable
> >itself or by a shared library, and in the latter case, the relative
> >distance to the symbol is not known at build time, and so a runtime
> >relocation is required.
>
> Right.
>
> >But how about references to symbols with
> >external visibility that are defined in the same compilation unit? I
> >don't quite understand why those references need to go via the GOT as
> >well.
>
> If you mean references to a non-local STV_DEFAULT (default visibility) definition =>
>
> * -fpic: use GOT because the definition may be replaced by another at run time.
>    Conservatively use a GOT-generating code sequence to allow potential symbol
>    preemption(interposition). The linker may optimize out the GOT (x86-64
>    GOTPCRELX, recent ld.lld for aarch64, powerpc64 TOC-indirect to
>    TOC-relative optimization).
> * -fpie or -fno-pie: the definition cannot be replaced. GOT is unneeded.
>
> -fpie is an optimization on top of -fpic: (a) non-local STV_DEFAULT
> definitions can be assumed non-interposable (b) (irrelevant to the
> kernel) TLS can use more optimized models.
>
> >> >Fortunately, we can convince the compiler to handle this in a way that
> >> >is a bit more suitable for freestanding binaries such as the kernel, by
> >> >setting the 'hidden' visibility #pragma, which informs the compiler that
> >> >symbol preemption or CoW footprint are of no concern to us, and so
> >> >PC-relative references that are resolved at link time are perfectly
> >> >fine.
> >>
> >> Agree
> >>
> >
> >The only unfortunate thing is that -fvisibility=hidden does not give
> >us the behavior we want, and we are forced to use the #pragma instead.
>
> Right. For a very long time there had been no option controlling the
> access mode for undefined symbols (-fvisibility= is for defined
> symbols).
>
> I added -fdirect-access-external-data to Clang which supports
> many architectures (x86, aarch64, arm, riscv, ...).
> GCC's x86 port added -mdirect-extern-access in 2022-02 (not available on aarch64).
>
> The use of `#pragma GCC visibility push(hidden)` looks good as a
> portable solution.
>

OK

> >> >So let's enable this #pragma and build with -fpie when building a
> >> >relocatable kernel. This also means that all constant data items that
> >> >carry statically initialized pointer variables are now emitted into the
> >> >.data.rel.ro* sections, so move these into .rodata where they belong.
> >>
> >> LGTM, except: is ".rodata" a typo? The patch doesn't reference .rodata
> >>
> >
> >I am referring to the .rodata pseudo-segment that we have in the
> >kernel, which runs from _etext to __inittext_begin.
>
> OK
>
> >> >Code size impact (GCC):
> >> >
> >> >Before:
> >> >
> >> >      text       data        bss      total filename
> >> >  16712396   18659064     534556   35906016 vmlinux
> >> >
> >> >After:
> >> >
> >> >      text       data        bss      total filename
> >> >  16804400   18612876     534556   35951832 vmlinux
> >> >
> >> >Code size impact (Clang):
> >> >
> >> >Before:
> >> >
> >> >      text       data        bss      total filename
> >> >  17194584   13335060     535268   31064912 vmlinux
> >> >
> >> >After:
> >> >
> >> >      text       data        bss      total filename
> >> >  17194536   13310032     535268   31039836 vmlinux
>
> The size difference for Clang matches my expecation:)
> I am somewhat surprised that data is smaller, though...
>
> I wonder how GCC makes code bloated so much...
>

This is caused by the use of RELA format instead of RELR.