[PATCH v2] arm64: implement support for static call trampolines

Thu Oct 29 07:59:43 EDT 2020

Hi Mark,

On Thu, 29 Oct 2020 at 12:50, Mark Rutland <mark.rutland at arm.com> wrote:
>
> Hi Ard,
>
> On Wed, Oct 28, 2020 at 07:41:14PM +0100, Ard Biesheuvel wrote:
> > Implement arm64 support for the 'unoptimized' static call variety,
> > which routes all calls through a single trampoline that is patched
> > to perform a tail call to the selected function.
>
> Given the complexity and subtlety here, do we actually need this?
>

The more time I spend on this, the more I lean towards 'no' :-)

> If there's some core feature that depends on this (or wants to), it'd be
> nice to call that out in the commit message.
>

Agreed. Perhaps I should have put RFC in the subject, because this is
more intended as a call for discussion (and v1 had little success in
that regard)

> > Since static call targets may be located in modules loaded out of
> > direct branching range, we need to be able to fall back to issuing
> > a ADRP/ADD pair to load the branch target into R16 and use a BR
> > instruction. As this involves patching more than a single B or NOP
> > instruction (for which the architecture makes special provisions
> > in terms of the synchronization needed), we may need to run the
> > full blown instruction patching logic that uses stop_machine(). It
> > also means that once we've patched in a ADRP/ADD pair once, we are
> > quite restricted in the patching we can code subsequently, and we
> > may end up using an indirect call after all (note that
>
> Noted. I guess we
>
> [...]
>

?

> > v2:
> > This turned nasty really quickly when I realized that any sleeping task
> > could have been interrupted right in the middle of the ADRP/ADD pair
> > that we emit for static call targets that are out of immediate branching
> > range.
>
> > +/*
> > + * The static call trampoline consists of one of the following sequences:
> > + *
> > + *      (A)           (B)           (C)           (D)           (E)
> > + * 00: BTI  C        BTI  C        BTI  C        BTI  C        BTI  C
> > + * 04: B    fn       NOP           NOP           NOP           NOP
> > + * 08: RET           RET           ADRP X16, fn  ADRP X16, fn  ADRP X16, fn
> > + * 0c: NOP           NOP           ADD  X16, fn  ADD  X16, fn  ADD  X16, fn
> > + * 10:                             BR   X16      RET           NOP
> > + * 14:                                                         ADRP X16, &fn
> > + * 18:                                                         LDR  X16, [X16, &fn]
> > + * 1c:                                                         BR   X16
>
> Are these all padded with trailing NOPs or UDFs? I assume the same space is
> statically reserved.
>

Yes, it is a fixed size 32 byte area. I only included the relevant
ones here, the empty slots are D/C in the context of each individual
sequence.

> > + *
> > + * The architecture permits us to patch B instructions into NOPs or vice versa
> > + * directly, but patching any other instruction sequence requires careful
> > + * synchronization. Since branch targets may be out of range for ordinary
> > + * immediate branch instructions, we may have to fall back to ADRP/ADD/BR
> > + * sequences in some cases, which complicates things considerably; since any
> > + * sleeping tasks may have been preempted right in the middle of any of these
> > + * sequences, we have to carefully transform one into the other, and ensure
> > + * that it is safe to resume execution at any point in the sequence for tasks
> > + * that have already executed part of it.
> > + *
> > + * So the rules are:
> > + * - we start out with (A) or (B)
> > + * - a branch within immediate range can always be patched in at offset 0x4;
> > + * - sequence (A) can be turned into (B) for NULL branch targets;
> > + * - a branch outside of immediate range can be patched using (C), but only if
> > + *   . the sequence being updated is (A) or (B), or
> > + *   . the branch target address modulo 4k results in the same ADD opcode
> > + *     (which could occur when patching the same far target a second time)
> > + * - once we have patched in (C) we cannot go back to (A) or (B), so patching
> > + *   in a NULL target now requires sequence (D);
> > + * - if we cannot patch a far target using (C), we fall back to sequence (E),
> > + *   which loads the function pointer from memory.
>
> Cases C-E all use an indirect branch, which goes against one of the
> arguments for having static calls (the assumption that CPUs won't
> mis-predict direct branches). Similarly case E is a literal pool with
> more steps.
>
> That means that for us, static calls would only be an opportunistic
> optimization rather than a hardening feature. Do they actually save us
> much, or could we get by with an inline literal pool in the trampoline?
>

Another assumption this is based on is that a literal load is more
costly than a ADRP/ADD.

> It'd be much easier to reason consistently if the trampoline were
> always:
>
> |       BTI C
> |       LDR X16, _literal // pc-relative load
> |       BR X16
> | _literal:
> |       < patch a 64-bit value here atomically >
>
> ... and I would strongly prefer that to having multiple sequences that
> could all be live -- I'm really not keen on the complexity and subtlety
> that entails.
>

I don't see this having any benefit over a ADRP/LDR pair that accesses
the static call key struct directly.

> [...]
>
> > + * Note that sequence (E) is only used when switching between multiple far
> > + * targets, and that it is not a terminal degraded state.
>
> I think what you're saying here is that you can go from (E) to (C) if
> switching to a new far branch that's in ADRP+ADD range, but it can be
> misread to mean (E) is a transient step while patching a branch rather
> than some branches only being possible to encode with (E).
>

Indeed.