[PATCH v2] arm64: implement support for static call trampolines

Thu Oct 29 09:21:22 EDT 2020

On Thu, Oct 29, 2020 at 12:59:43PM +0100, Ard Biesheuvel wrote:
> On Thu, 29 Oct 2020 at 12:50, Mark Rutland <mark.rutland at arm.com> wrote:
> > On Wed, Oct 28, 2020 at 07:41:14PM +0100, Ard Biesheuvel wrote:
> > > Implement arm64 support for the 'unoptimized' static call variety,
> > > which routes all calls through a single trampoline that is patched
> > > to perform a tail call to the selected function.
> >
> > Given the complexity and subtlety here, do we actually need this?
> 
> The more time I spend on this, the more I lean towards 'no' :-)

That would make this all simpler! :)

[...]

> > > Since static call targets may be located in modules loaded out of
> > > direct branching range, we need to be able to fall back to issuing
> > > a ADRP/ADD pair to load the branch target into R16 and use a BR
> > > instruction. As this involves patching more than a single B or NOP
> > > instruction (for which the architecture makes special provisions
> > > in terms of the synchronization needed), we may need to run the
> > > full blown instruction patching logic that uses stop_machine(). It
> > > also means that once we've patched in a ADRP/ADD pair once, we are
> > > quite restricted in the patching we can code subsequently, and we
> > > may end up using an indirect call after all (note that
> >
> > Noted. I guess we
> >
> > [...]
> >
> 
> ?

Sorry; I was playing on the commit message ending on "(note that", as I
wasn't sure where that was going.

> > > + *
> > > + * The architecture permits us to patch B instructions into NOPs or vice versa
> > > + * directly, but patching any other instruction sequence requires careful
> > > + * synchronization. Since branch targets may be out of range for ordinary
> > > + * immediate branch instructions, we may have to fall back to ADRP/ADD/BR
> > > + * sequences in some cases, which complicates things considerably; since any
> > > + * sleeping tasks may have been preempted right in the middle of any of these
> > > + * sequences, we have to carefully transform one into the other, and ensure
> > > + * that it is safe to resume execution at any point in the sequence for tasks
> > > + * that have already executed part of it.
> > > + *
> > > + * So the rules are:
> > > + * - we start out with (A) or (B)
> > > + * - a branch within immediate range can always be patched in at offset 0x4;
> > > + * - sequence (A) can be turned into (B) for NULL branch targets;
> > > + * - a branch outside of immediate range can be patched using (C), but only if
> > > + *   . the sequence being updated is (A) or (B), or
> > > + *   . the branch target address modulo 4k results in the same ADD opcode
> > > + *     (which could occur when patching the same far target a second time)
> > > + * - once we have patched in (C) we cannot go back to (A) or (B), so patching
> > > + *   in a NULL target now requires sequence (D);
> > > + * - if we cannot patch a far target using (C), we fall back to sequence (E),
> > > + *   which loads the function pointer from memory.
> >
> > Cases C-E all use an indirect branch, which goes against one of the
> > arguments for having static calls (the assumption that CPUs won't
> > mis-predict direct branches). Similarly case E is a literal pool with
> > more steps.
> >
> > That means that for us, static calls would only be an opportunistic
> > optimization rather than a hardening feature. Do they actually save us
> > much, or could we get by with an inline literal pool in the trampoline?
> 
> Another assumption this is based on is that a literal load is more
> costly than a ADRP/ADD.

Agreed. I think in practice it's going to depend on the surrounding
context and microarchitecture. If the result is being fed into a BR, I'd
expect no difference on a big OoO core, and even for a small in-order
core it'll depend on how/when the core can forward the result relative
to predicting the branch.

> > It'd be much easier to reason consistently if the trampoline were
> > always:
> >
> > |       BTI C
> > |       LDR X16, _literal // pc-relative load
> > |       BR X16
> > | _literal:
> > |       < patch a 64-bit value here atomically >
> >
> > ... and I would strongly prefer that to having multiple sequences that
> > could all be live -- I'm really not keen on the complexity and subtlety
> > that entails.
> 
> I don't see this having any benefit over a ADRP/LDR pair that accesses
> the static call key struct directly.

Even better!

Thanks,
Mark.