[PATCH v2] arm64: implement support for static call trampolines

Thu Oct 29 07:50:26 EDT 2020

Hi Ard,

On Wed, Oct 28, 2020 at 07:41:14PM +0100, Ard Biesheuvel wrote:
> Implement arm64 support for the 'unoptimized' static call variety,
> which routes all calls through a single trampoline that is patched
> to perform a tail call to the selected function.

Given the complexity and subtlety here, do we actually need this?

If there's some core feature that depends on this (or wants to), it'd be
nice to call that out in the commit message.

> Since static call targets may be located in modules loaded out of
> direct branching range, we need to be able to fall back to issuing
> a ADRP/ADD pair to load the branch target into R16 and use a BR
> instruction. As this involves patching more than a single B or NOP
> instruction (for which the architecture makes special provisions
> in terms of the synchronization needed), we may need to run the
> full blown instruction patching logic that uses stop_machine(). It
> also means that once we've patched in a ADRP/ADD pair once, we are
> quite restricted in the patching we can code subsequently, and we
> may end up using an indirect call after all (note that 

Noted. I guess we

[...]

> v2:
> This turned nasty really quickly when I realized that any sleeping task
> could have been interrupted right in the middle of the ADRP/ADD pair
> that we emit for static call targets that are out of immediate branching
> range.

> +/*
> + * The static call trampoline consists of one of the following sequences:
> + *
> + *      (A)           (B)           (C)           (D)           (E)
> + * 00: BTI  C        BTI  C        BTI  C        BTI  C        BTI  C
> + * 04: B    fn       NOP           NOP           NOP           NOP
> + * 08: RET           RET           ADRP X16, fn  ADRP X16, fn  ADRP X16, fn
> + * 0c: NOP           NOP           ADD  X16, fn  ADD  X16, fn  ADD  X16, fn
> + * 10:                             BR   X16      RET           NOP
> + * 14:                                                         ADRP X16, &fn
> + * 18:                                                         LDR  X16, [X16, &fn]
> + * 1c:                                                         BR   X16

Are these all padded with trailing NOPs or UDFs? I assume the same space is
statically reserved.

> + *
> + * The architecture permits us to patch B instructions into NOPs or vice versa
> + * directly, but patching any other instruction sequence requires careful
> + * synchronization. Since branch targets may be out of range for ordinary
> + * immediate branch instructions, we may have to fall back to ADRP/ADD/BR
> + * sequences in some cases, which complicates things considerably; since any
> + * sleeping tasks may have been preempted right in the middle of any of these
> + * sequences, we have to carefully transform one into the other, and ensure
> + * that it is safe to resume execution at any point in the sequence for tasks
> + * that have already executed part of it.
> + *
> + * So the rules are:
> + * - we start out with (A) or (B)
> + * - a branch within immediate range can always be patched in at offset 0x4;
> + * - sequence (A) can be turned into (B) for NULL branch targets;
> + * - a branch outside of immediate range can be patched using (C), but only if
> + *   . the sequence being updated is (A) or (B), or
> + *   . the branch target address modulo 4k results in the same ADD opcode
> + *     (which could occur when patching the same far target a second time)
> + * - once we have patched in (C) we cannot go back to (A) or (B), so patching
> + *   in a NULL target now requires sequence (D);
> + * - if we cannot patch a far target using (C), we fall back to sequence (E),
> + *   which loads the function pointer from memory.

Cases C-E all use an indirect branch, which goes against one of the
arguments for having static calls (the assumption that CPUs won't
mis-predict direct branches). Similarly case E is a literal pool with
more steps.

That means that for us, static calls would only be an opportunistic
optimization rather than a hardening feature. Do they actually save us
much, or could we get by with an inline literal pool in the trampoline?

It'd be much easier to reason consistently if the trampoline were
always:

| 	BTI C
| 	LDR X16, _literal // pc-relative load
| 	BR X16
| _literal:
| 	< patch a 64-bit value here atomically >

... and I would strongly prefer that to having multiple sequences that
could all be live -- I'm really not keen on the complexity and subtlety
that entails.

[...]

> + * Note that sequence (E) is only used when switching between multiple far
> + * targets, and that it is not a terminal degraded state.

I think what you're saying here is that you can go from (E) to (C) if
switching to a new far branch that's in ADRP+ADD range, but it can be
misread to mean (E) is a transient step while patching a branch rather
than some branches only being possible to encode with (E).

Thanks,
Mark.