[PATCH] arm64: add prctl(PR_PAC_SET_ENABLED_KEYS)

Thu Sep 3 00:31:55 EDT 2020

On Tue, Sep 1, 2020 at 9:02 AM Dave Martin <Dave.Martin at arm.com> wrote:
>
> On Wed, Aug 26, 2020 at 07:55:51PM -0700, Peter Collingbourne wrote:
> > On Wed, Aug 26, 2020 at 9:37 AM Dave Martin <Dave.Martin at arm.com> wrote:
> > >
> > > On Mon, Aug 24, 2020 at 05:23:27PM -0700, Peter Collingbourne wrote:
> > > > On Mon, Aug 24, 2020 at 7:49 AM Dave Martin <Dave.Martin at arm.com> wrote:
> > > > >
> > > > > On Wed, Aug 19, 2020 at 02:25:45PM -0700, Peter Collingbourne wrote:
> > > > > > On Wed, Aug 19, 2020 at 3:18 AM Dave Martin <Dave.Martin at arm.com> wrote:
> > > > > > >
> > > > > > > On Fri, Jul 31, 2020 at 06:11:52PM -0700, Peter Collingbourne wrote:
>
> [...]
>
> > > > > > > > diff --git a/arch/arm64/include/asm/asm_pointer_auth.h b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > > index 52dead2a8640..d121fa5fed5f 100644
> > > > > > > > --- a/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > > +++ b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > > @@ -31,6 +31,14 @@ alternative_else_nop_endif
> > > > > > > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APDB]
> > > > > > > >       msr_s   SYS_APDBKEYLO_EL1, \tmp2
> > > > > > > >       msr_s   SYS_APDBKEYHI_EL1, \tmp3
> > > > > > > > +
> > > > > > > > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > > > > > > > +     cbz     \tmp2, .Laddr_auth_skip_\@
> > > > > > >
> > > > > > > I wonder whether it would make sense to simple store the thread's base
> > > > > > > SCTLR value (containing the ENxx bits), rather than storing the ENxx
> > > > > > > bits separately.  There may be reasons outside this snippet why this
> > > > > > > isn't such a good idea though -- I haven't gone digging so far.
> > > > > >
> > > > > > As far as I know (as I learned [1] from the MTE patch series), it can
> > > > > > be expensive to access SCTLR_EL1, therefore I arrange to only access
> > > > > > SCTLR_EL1 in the hopefully-uncommon case where a process has disabled
> > > > > > keys. Detecting the "process has disabled keys" case is quite simple
> > > > > > if we only store the disabled keys mask here, not so much if we store
> > > > > > the full value of SCTLR_EL1.
> > > > >
> > > > > My thought was that we would still only write SCTLR_EL1 if needed, but
> > > > > we would do the write-if-needed across the whole register in one go.
> > > > > This would be easier to extend if we have to twiddle additional
> > > > > SCTLR_EL1 bits in the future.  If the key enable bits are the only
> > > > > affected bits for now then we could of course defer this factoring until
> > > > > later.  (I vaguely remember a similar discussion in the past, but
> > > > > possibly it was about the pauth keys anyway, rather than something
> > > > > else.)
> > > >
> > > > We will also need a per-task SCTLR_EL1.TCF0 setting for MTE. We could
> > > > potentially combine the two, so that both
> > > > prctl(PR_PAC_SET_ENABLED_KEYS) and prctl(PR_SET_TAGGED_ADDR_CTRL) end
> > > > up affecting a common SCTLR_EL1 field in the task_struct, and we do:
> > > >
> > > > On kernel entry:
> > > > - read SCTLR_EL1 from task_struct (or from the system register if we
> > > > decide that's cheap enough)
> > > > - if EnIA clear, set it, and write the value to the system register.
> > > >
> > > > On kernel exit:
> > > > - read system register SCTLR_EL1
> > > > - read SCTLR_EL1 from task_struct
> > > > - if they are different, write task's SCTLR_EL1 to the system register.
> > > >
> > > > But this would require at least an unconditional read of SCTLR_EL1 per
> > > > kernel exit. The comment that I referenced says that "accesses" to the
> > > > register are expensive. I was operating under the assumption that both
> > > > reads and writes are expensive, but if it's actually only writes that
> > > > are expensive, we may be able to get away with the above.
> > >
> > > The kernel is in full control over what's in SCTLR_EL1, so perhaps we
> > > could cache what we wrote percpu, if the overhead of reading it is
> > > problematic.
> >
> > Maybe, but that sounds like the kind of complexity that I'd like to avoid.
>
> Ack, this is more something to explore if the other approaches turn out
> to be more costly.
>
> > I went ahead and did some performance measurements using the following
> > program, which issues 10M invalid syscalls and exits. This should give
> > us something as close as possible to a measurement of a kernel entry
> > and exit on its own.
> >
> > .globl _start
> > _start:
> > movz x1, :abs_g1:10000000
> > movk x1, :abs_g0_nc:10000000
> >
> > .Lloop:
> > mov x8, #65536 // invalid
> > svc #0
> > sub x1, x1, #1
> > cbnz x1, .Lloop
> >
> > mov x0, #0
> > mov x8, #0x5d // exit
> > svc #0
> >
> > On a Pixel 4 (which according to Wikipedia has a CPU based on
> > Cortex-A76/A55) I measured the median of 1000 runs execution time at
> > 0.554511s, implying an entry/exit cost of 55.5ns. If I make this
> > modification to the kernel (Pixel 4 uses kernel version 4.14 so this
> > won't apply cleanly to modern kernels; this is in the same place as
> > the ptrauth_keys_install_user macro invocation in a modern kernel):
> >
> > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> > index 303447c765f7..b7c72d767f3e 100644
> > --- a/arch/arm64/kernel/entry.S
> > +++ b/arch/arm64/kernel/entry.S
> > @@ -340,6 +340,7 @@ alternative_else_nop_endif
> >  #endif
> >  3:
> >         apply_ssbd 0, 5f, x0, x1
> > +       mrs x0, sctlr_el1
> >  5:
> >         .endif
> >
> > The median execution time goes to 0.565604s, implying a cost of 1.1ns
> > to read the system register. I also measured the cost of reading from
> > a percpu variable at kernel exit like so:
> >
> > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> > index 303447c765f7..c5a89785f650 100644
> > --- a/arch/arm64/kernel/entry.S
> > +++ b/arch/arm64/kernel/entry.S
> > @@ -340,6 +340,7 @@ alternative_else_nop_endif
> >  #endif
> >  3:
> >         apply_ssbd 0, 5f, x0, x1
> > +       ldr_this_cpu x0, sctlr_el1_cache, x1
> >  5:
> >         .endif
> >
> > diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> > index 0b6ca5479f1f..ac9c9a6179d4 100644
> > --- a/arch/arm64/kernel/traps.c
> > +++ b/arch/arm64/kernel/traps.c
> > @@ -52,6 +52,8 @@
> >  #include <asm/system_misc.h>
> >  #include <asm/sysreg.h>
> >
> > +DEFINE_PER_CPU(unsigned long, sctlr_el1_cache);
> > +
> >  static const char *handler[]= {
> >         "Synchronous Abort",
> >         "IRQ",
> >
> > With that the median execution time goes to 0.600186s, implying a cost
> > of 4.5ns. So at least on existing hardware, it looks like reading
> > SCTLR_EL1 directly is probably our cheapest option if we're going to
> > read *something*. Therefore I'll look at implementing this with
> > unconditional reads from SCTLR_EL1 in v2.
>
> Thanks for the investigation!
>
> I'm not too surprised by this outcome.  There is a possibility that
> reading SCTLR_EL1 sucks badly on some implementations, but on others it
> may be little more than a register rename.  We should probably wait for
> evidence before panicking about it.
>
> So long as this SCTLR_EL1 switching is completely disabled via
> alternatives on hardware to which it isn't applicable, I expect that
> should be a reasonable compromise.

One disadvantage of disabling SCTLR_EL1 switching on inapplicable
hardware is that it complicates the mechanism for enabling and
disabling deprecated instruction support (see
arch/arm64/kernel/armv8_deprecated.c), which is currently implemented
by using on_each_cpu to set or clear certain bits in SCTLR_EL1 on each
CPU. This is incompatible with an entry/exit that assumes full control
over SCTLR_EL1.

I would propose the following instruction sequences in entry.S, which
are justified by further experiments run on a Pixel 4 showing that:
1) an unconditional SCTLR_EL1 write on kernel entry/exit costs 3.0ns,
whereas conditional writes cost 4.5ns
2) an SCTLR_EL1 read on kernel entry is 0.3ns faster than a load from
task_struct
3) an unconditional SCTLR_EL1 write on kernel exit costs 0.7ns.

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 55af8b504b65..9e0acd5e7527 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -186,6 +186,12 @@ alternative_cb_end

        ptrauth_keys_install_kernel tsk, x20, x22, x23

+alternative_if ARM64_HAS_ADDRESS_AUTH
+       mrs     x22, sctlr_el1
+       orr     x22, x22, SCTLR_ELx_ENIA
+       msr     sctlr_el1, x22
+alternative_else_nop_endif
+
        scs_load tsk, x20
        .else
        add     x21, sp, #S_FRAME_SIZE
@@ -297,6 +303,9 @@ alternative_else_nop_endif
        /* No kernel C function calls after this as user keys are set. */
        ptrauth_keys_install_user tsk, x0, x1, x2

+       ldr     x0, [tsk, THREAD_SCTLR]
+       msr     sctlr_el1, x0
+
        apply_ssbd 0, x0, x1
        .endif


With these instruction sequences I would propose to reimplement
toggling deprecated instruction support as follows:
1) Stop the machine using stop_machine.
2) While the machine is stopped, adjust the deprecated instruction
support feature bits in each task_struct.

This will make toggling deprecated instruction support more expensive,
but that seems like the right tradeoff since I imagine that these
features are rarely toggled, and making it more efficient to toggle
them would mean more instructions in the entry/exit hot path.

This seems like it will work well if SCTLR_EL1 is set on kernel exit.
The deprecated feature bits are CP15BEN and SED, and each of them
affects features that are only implemented at EL0, so it seems like we
can wait until kernel exit to update SCTLR_EL1. But if we don't set
SCTLR_EL1 on exit on certain hardware, we will need to have both
mechanisms, use the existing mechanism on hardware that doesn't set
SCTLR_EL1 and the new mechanism on hardware that does set SCTLR_EL1.
Which I wouldn't mind too much, but I'm not sure that it's worth it to
save 0.7ns (I'm not claiming that my numbers are universal, just that
we should probably favor simpler approaches if we don't have evidence
that they are significantly more expensive than more complex ones).

Peter

> [...]
>
> > > > > > > Do we need a way to query the enabled keys?
> > > > > > >
> > > > > > > We could either have a _GET_ prctl (the common approach), or have this
> > > > > > > prctl return the mask of enabled keys (avoids the extra prctl, but
> > > > > > > weirder).
> > > > > > >
> > > > > > > As above, we might
> > > > > > >
> > > > > > > Things like CRIU may need a GET in order to save/restore this setting
> > > > > > > properly.
> > > > > >
> > > > > > Maybe it makes sense for there to be a GET prctl() to support CRIU.
> > > > > > But it would need to be defined carefully because CRIU would
> > > > > > presumably need to know what value to pass as the "keys" argument when
> > > > > > it calls SET to restore the state. It can't just hardcode a value
> > > > > > because that may harm extensibility if new keys are added.
> > > > > >
> > > > > > If we require the kernel to start processes with all keys enabled
> > > > > > (including any keys that we may introduce in the future), then it
> > > > > > seems like if CRIU knows which keys were disabled, then it can restore
> > > > > > the state by issuing a syscall like this:
> > > > > >
> > > > > > prctl(PR_PAC_SET_ENABLED_KEYS, disabled_keys_mask, 0)
> > > > > >
> > > > > > which would imply that instead of providing PR_PAC_GET_ENABLED_KEYS we
> > > > > > instead provide PR_PAC_GET_DISABLED_KEYS, which CRIU would call when
> > > > > > saving the state to prepare the disabled_keys_mask argument passed
> > > > > > here. Then for consistency we can make the SET prctl() be
> > > > > > PR_PAC_SET_DISABLED_KEYS, and CRIU can then do:
> > > > > >
> > > > > > prctl(PR_PAC_SET_DISABLED_KEYS, disabled_keys_mask, disabled_keys_mask)
> > > > > >
> > > > > > Does that make sense?
> > > > >
> > > > > Possibly, though it's nicer if the GET returns the same value suppiled
> > > > > to the SET, rather than the complement of it.
> > > >
> > > > Maybe you misread my proposal? The part about the complement is
> > > > addressed by the sentence beginning "Then for consistency..." So we
> > > > would end up with PR_PAC_SET_DISABLED_KEYS and
> > > > PR_PAC_GET_DISABLED_KEYS, and they would both take/return a mask of
> > > > *disabled* keys.
> > >
> > > Oh, I see.  Yes, while I prefer the two are consistent with each other,
> > > we can flip the sense of both here if desired.
> > >
> > > >
> > > > > If SET ignores set bits in arg3 that don't correspond to set bits in the
> > > > > mask arg2, then
> > > > >
> > > > >         u64 current_keys = prctl(PR_PAC_GET_ENABLED_KEYS);
> > > > >
> > > > >         prctl(PR_PAC_SET_ENABLED_KEYS, ~0UL, current_keys);
> > > > >
> > > > > should work.
> > > >
> > > > I'd prefer not to allow the SET prctl to accept all-ones as the first
> > > > argument, as this could lead to a scenario where sometimes people pass
> > > > all-ones as the first argument and sometimes they don't, which could
> > > > make it hard to add new keys in the future (since the addition of a
> > > > new key could cause existing callers to affect the new key
> > > > arbitrarily). Instead, the first argument should be required to
> > > > specify only known keys, in order to enforce that the caller is
> > > > explicit about which keys they intend to affect.
> > >
> > > That sounds fair.
> > >
> > > > > There's a final option, which is to expose this config through ptrace
> > > > > instead for save/restore purposes.  From previous discussions with the
> > > > > CRIU folks, I recall that they are trying to move away from doing setup
> > > > > from within the new process where possible.
> > > > >
> > > > > There's no reason not to have both though -- there's precedent for that,
> > > > > such as for PR_SVE_{GET,SET}_VL and the NT_ARM_SVE regset.  MTE may move
> > > > > in a similar direction too IIUC.
> > > > >
> > > > > Having a GET remains useful for in-process debugging and diagnostics,
> > > > > and it's extremely straightforward to add in the kernel.  So from my
> > > > > side I'd vote to have it anyway...
> > > >
> > > > Okay. I see that the MTE series is adding NT_ARM_TAGGED_ADDR_CTRL as
> > > > well, and I suppose that I could add a
> > > > NT_ARM_PAC_(whatever)ABLED_KEYS. Since this wouldn't be used in normal
> > > > userspace applications (and it doesn't look like ptrace APIs such as
> > > > NT_ARM_PACA_KEYS will accommodate new keys being added so CRIU would
> > > > need to be made aware of new keys anyway), we might as well just have
> > > > all the APIs deal with the set of enabled keys.
> > > >
> > > > And if the ptrace is enough for CRIU, I'd mildly prefer not to add a
> > > > GET prctl, since it's just more uapi surface that needs to be
> > > > maintained forever. For in-process use cases it seems redundant with
> > > > e.g. issuing a pacia instruction and checking whether it is a no-op,
> > >
> > > Probing instructions to see whether they generate signals is a cardinal
> > > sin and we make huge efforts to ensure that userspace doesn't need to do
> > > that.  As for the no-op check, I suppose that works, but it's nasty if
> > > the generic "check all the required featues" boilerplate on program
> > > entry relies on intrinsics or can't be written in C at all.  This is
> > > just a way of discouraging people from checking at all IMHO.
> > >
> > >
> > > The implementation of that prctl would most likely be a one-line wrapper
> > > that calls the same internal function used to populate the regset for
> > > ptrace, so I think you might be worrying a bit too much about creating
> > > ABI here.  This is more providing a second interface for same thing,
> > > because no single interface is usable in all situations.
> >
> > Okay. Since I don't feel too strongly about it, I'll add the GET prctl in v2.
>
> Thanks
>
> ---Dave