[PATCH] kasan: arm64: support specialized outlined tag mismatch checks

Peter Collingbourne pcc at google.com
Fri Nov 20 17:59:42 EST 2020


On Fri, Nov 13, 2020 at 3:22 AM Mark Rutland <mark.rutland at arm.com> wrote:
>
> Hi Peter,
>
> Thanks for this; the additional detail was very helpful.
>
> The high-level idea looks fine to me, but I have some qualms with the
> ABI, and I think we should rejig the compiler side rather than taking
> this as-is. It also looks like there may be an issue with PLTs. I've
> tried to explain those in detail below.
>
> On Thu, Nov 05, 2020 at 05:12:56PM -0800, Peter Collingbourne wrote:
> > On Tue, Nov 3, 2020 at 7:28 AM Mark Rutland <mark.rutland at arm.com> wrote:
> > >
> > > Hi Peter,
> > >
> > > On Thu, Oct 29, 2020 at 01:59:44PM -0700, Peter Collingbourne wrote:
> > > > Outlined checks require a new runtime function with a custom calling
> > > > convention. Add this function to arch/arm64/lib.
> > >
> > > [...]
> > >
> > > > +/*
> > > > + * Report a tag mismatch detected by tag-based KASAN.
> > > > + *
> > > > + * This function has a custom calling convention in order to minimize the sizes
> > > > + * of the compiler-generated thunks that call it. All registers except for x16
> > > > + * and x17 must be preserved. This includes x0 and x1 which are used by the
> > > > + * caller to pass arguments, and x29 (the frame pointer register). In order to
> > > > + * allow these registers to be restored the caller spills them to stack. The 256
> > > > + * bytes of stack space allocated by the caller must be deallocated on return.
> > >
> > > This is a bit difficult to follow.
> > >
> > > Am I right in understanding that the caller has pre-allocated 256 bytes
> > > on the stack by decrementing the SP, yet it's up to the callee to undo
> > > that? That seems bizarre to me, and it would be much easier (and more
> > > generally useful) for the caller to leave it to the trampoline to
> > > allocate/free its own space.
> >
> > Yes, that's how it works. The goal was to reduce code size costs,
> > since otherwise you would need setup and teardown code in each
> > trampoline together with unwind info which will not necessarily be
> > deduplicated by the linker. It's probably easiest to think of the
> > trampoline as the "head" of the tag mismatch handling function, while
> > the runtime part is the "tail". From that perspective everything is
> > balanced.
>
> I appreciate the general idea, but there some details that I think
> aren't quite right in the ABI, and O'd ratehr we rejig that before
> settling on it, especially as IIUC this was added specifically to speed
> up kernel usage, and I'd like to make sure it works well in-context.
>
> Firstly, reserving 256 bytes in the compiler-generated trampoline is
> arbitrary, and I'd rather that trampoline only reserved the space it
> needs (currently 32 bytes for a stack record and x0/x1), even if that
> means the runtime trampoline has to shuffle x0/x1.
>
> Secondly, I don't beieve that the compiler-generated trampoline needs to
> stack x29/x30. It makes a tail-call to __hwasan_tag_mismatch, which
> generally suggests it *shouldn't* create a frame record. It doesn't
> clobber the original x29/x30, so a frame record isn't necessary (even
> for reliable stacktrace), and it doesn't plumb the new record into x29,
> so it could leave all that work to the runtime trampoline and save an
> instruction.
>
> I think that the stacking in the compiler generated trampoline should be
> limited to:
>
>         stp x0, x1, [sp, #-16]!
>
> ... which is much like how mfentry works on most architectures, and
> similar to how patchable-function-entry works on arm64.

Hi Mark,

I think all of your suggestions may be considered useful improvements
to the ABI, and if we were redesigning it from scratch then I think it
would look a lot more like what you've proposed. However, as discussed
out-of-band, the ABI is already being used in userspace, and has been
for over a year, and I don't think any of them rise to the level of
justifying creating a new ABI, in particular because they only affect
the uncommon case where we've actually detected an error.

> > For reference here is an example of a trampoline function for the kernel:
> >
> > __hwasan_check_x1_67043328:
> > sbfx x16, x1, #4, #52
> > ldrb w16, [x9, x16]
> > cmp x16, x1, lsr #56
> > b.ne .Ltmp5
> > .Ltmp6:
> > ret
> > .Ltmp5:
> > lsr x16, x1, #56
> > cmp x16, #255
> > b.eq .Ltmp6
> > stp x0, x1, [sp, #-256]!
> > stp x29, x30, [sp, #232]
> > mov x0, x1
> > mov x1, #0
> > b __hwasan_tag_mismatch
>
> Thanks for this; it was helpful in clarifying my understanding.
>
> I note this clobbers x16 unconditionally. Is x17 free for use too?

Yes, and indeed it must be free for use because the linker is allowed
to clobber x16 and x17 across the call via a range extension thunk
(that said, I don't know what ld.bfd does, but ld.lld just clobbers
x16).

> > > > + *
> > > > + * This function takes care of transitioning to the standard AAPCS calling
> > > > + * convention and calls the C function kasan_tag_mismatch to report the error.
> > > > + *
> > > > + * Parameters:
> > > > + *   x0 - the fault address
> > > > + *   x1 - an encoded description of the faulting access
> > > > + */
> > > > +SYM_FUNC_START(__hwasan_tag_mismatch)
> > >
> > > For non-AAPCS code you should use SYM_CODE_*() rather than SYM_FUNC_*(),
> > > and place any BTI instructions necessary manually.
> >
> > Okay. Since the trampolines may appear in modules I guess it's
> > possible for the function to be called via a module PLT so a BTI
> > instruction is required.
>
> I think so, but I am not completely certain (there are special rules for
> branches from PLTs using x16/x17).

As also pointed out by Ard a "bti c" is required here so I've added one in v2.

> > If I am reading the code in arch/arm64/kernel/module-plts.c correctly
> > it looks like the module PLTs may only clobber x16, correct? That
> > works for me.
>
> PLTs can clobber both x16 and x17. Do callers of the compiler-generated
> trampolines account for those being clobbered? I see from the example
> above that the trampolines do not.

Yes, the call to the check function is modelled by the compiler as
clobbering x16 and x17 (and x30 of course).

> [...]
>
> > > > +#ifndef CONFIG_SHADOW_CALL_STACK
> > > > +     str     x18, [sp, #128]
> > > > +#endif
> > >
> > > This doesn't look right to me. The shadow call stack should remain
> > > balanced across a call, so why do you need to save x18? any manipulation
> > > within kasan_tag_mismatch should be balanced.
> > >
> > > Did you mean to save the return address to the shadow stack, rather
> > > than saving the shadow stack pointer?
> >
> > This is to handle the case where SCS is disabled. In this case, x18
> > becomes a temporary register for AAPCS functions, but this calling
> > convention requires it to be saved.
>
> My bad -- I had misread the `ifndef` as an `ifdef`, which was why I was
> confused as to why you'd save x18 when SCS was enabled, which is not
> what's going on.

Ack.

Peter



More information about the linux-arm-kernel mailing list