[PATCH] arm64: add prctl(PR_PAC_SET_ENABLED_KEYS)

Dave Martin Dave.Martin at arm.com
Tue Sep 1 12:02:31 EDT 2020


On Wed, Aug 26, 2020 at 07:55:51PM -0700, Peter Collingbourne wrote:
> On Wed, Aug 26, 2020 at 9:37 AM Dave Martin <Dave.Martin at arm.com> wrote:
> >
> > On Mon, Aug 24, 2020 at 05:23:27PM -0700, Peter Collingbourne wrote:
> > > On Mon, Aug 24, 2020 at 7:49 AM Dave Martin <Dave.Martin at arm.com> wrote:
> > > >
> > > > On Wed, Aug 19, 2020 at 02:25:45PM -0700, Peter Collingbourne wrote:
> > > > > On Wed, Aug 19, 2020 at 3:18 AM Dave Martin <Dave.Martin at arm.com> wrote:
> > > > > >
> > > > > > On Fri, Jul 31, 2020 at 06:11:52PM -0700, Peter Collingbourne wrote:

[...]

> > > > > > > diff --git a/arch/arm64/include/asm/asm_pointer_auth.h b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > index 52dead2a8640..d121fa5fed5f 100644
> > > > > > > --- a/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > +++ b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > @@ -31,6 +31,14 @@ alternative_else_nop_endif
> > > > > > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APDB]
> > > > > > >       msr_s   SYS_APDBKEYLO_EL1, \tmp2
> > > > > > >       msr_s   SYS_APDBKEYHI_EL1, \tmp3
> > > > > > > +
> > > > > > > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > > > > > > +     cbz     \tmp2, .Laddr_auth_skip_\@
> > > > > >
> > > > > > I wonder whether it would make sense to simple store the thread's base
> > > > > > SCTLR value (containing the ENxx bits), rather than storing the ENxx
> > > > > > bits separately.  There may be reasons outside this snippet why this
> > > > > > isn't such a good idea though -- I haven't gone digging so far.
> > > > >
> > > > > As far as I know (as I learned [1] from the MTE patch series), it can
> > > > > be expensive to access SCTLR_EL1, therefore I arrange to only access
> > > > > SCTLR_EL1 in the hopefully-uncommon case where a process has disabled
> > > > > keys. Detecting the "process has disabled keys" case is quite simple
> > > > > if we only store the disabled keys mask here, not so much if we store
> > > > > the full value of SCTLR_EL1.
> > > >
> > > > My thought was that we would still only write SCTLR_EL1 if needed, but
> > > > we would do the write-if-needed across the whole register in one go.
> > > > This would be easier to extend if we have to twiddle additional
> > > > SCTLR_EL1 bits in the future.  If the key enable bits are the only
> > > > affected bits for now then we could of course defer this factoring until
> > > > later.  (I vaguely remember a similar discussion in the past, but
> > > > possibly it was about the pauth keys anyway, rather than something
> > > > else.)
> > >
> > > We will also need a per-task SCTLR_EL1.TCF0 setting for MTE. We could
> > > potentially combine the two, so that both
> > > prctl(PR_PAC_SET_ENABLED_KEYS) and prctl(PR_SET_TAGGED_ADDR_CTRL) end
> > > up affecting a common SCTLR_EL1 field in the task_struct, and we do:
> > >
> > > On kernel entry:
> > > - read SCTLR_EL1 from task_struct (or from the system register if we
> > > decide that's cheap enough)
> > > - if EnIA clear, set it, and write the value to the system register.
> > >
> > > On kernel exit:
> > > - read system register SCTLR_EL1
> > > - read SCTLR_EL1 from task_struct
> > > - if they are different, write task's SCTLR_EL1 to the system register.
> > >
> > > But this would require at least an unconditional read of SCTLR_EL1 per
> > > kernel exit. The comment that I referenced says that "accesses" to the
> > > register are expensive. I was operating under the assumption that both
> > > reads and writes are expensive, but if it's actually only writes that
> > > are expensive, we may be able to get away with the above.
> >
> > The kernel is in full control over what's in SCTLR_EL1, so perhaps we
> > could cache what we wrote percpu, if the overhead of reading it is
> > problematic.
> 
> Maybe, but that sounds like the kind of complexity that I'd like to avoid.

Ack, this is more something to explore if the other approaches turn out
to be more costly.

> I went ahead and did some performance measurements using the following
> program, which issues 10M invalid syscalls and exits. This should give
> us something as close as possible to a measurement of a kernel entry
> and exit on its own.
> 
> .globl _start
> _start:
> movz x1, :abs_g1:10000000
> movk x1, :abs_g0_nc:10000000
> 
> .Lloop:
> mov x8, #65536 // invalid
> svc #0
> sub x1, x1, #1
> cbnz x1, .Lloop
> 
> mov x0, #0
> mov x8, #0x5d // exit
> svc #0
> 
> On a Pixel 4 (which according to Wikipedia has a CPU based on
> Cortex-A76/A55) I measured the median of 1000 runs execution time at
> 0.554511s, implying an entry/exit cost of 55.5ns. If I make this
> modification to the kernel (Pixel 4 uses kernel version 4.14 so this
> won't apply cleanly to modern kernels; this is in the same place as
> the ptrauth_keys_install_user macro invocation in a modern kernel):
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 303447c765f7..b7c72d767f3e 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -340,6 +340,7 @@ alternative_else_nop_endif
>  #endif
>  3:
>         apply_ssbd 0, 5f, x0, x1
> +       mrs x0, sctlr_el1
>  5:
>         .endif
> 
> The median execution time goes to 0.565604s, implying a cost of 1.1ns
> to read the system register. I also measured the cost of reading from
> a percpu variable at kernel exit like so:
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 303447c765f7..c5a89785f650 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -340,6 +340,7 @@ alternative_else_nop_endif
>  #endif
>  3:
>         apply_ssbd 0, 5f, x0, x1
> +       ldr_this_cpu x0, sctlr_el1_cache, x1
>  5:
>         .endif
> 
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index 0b6ca5479f1f..ac9c9a6179d4 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -52,6 +52,8 @@
>  #include <asm/system_misc.h>
>  #include <asm/sysreg.h>
> 
> +DEFINE_PER_CPU(unsigned long, sctlr_el1_cache);
> +
>  static const char *handler[]= {
>         "Synchronous Abort",
>         "IRQ",
> 
> With that the median execution time goes to 0.600186s, implying a cost
> of 4.5ns. So at least on existing hardware, it looks like reading
> SCTLR_EL1 directly is probably our cheapest option if we're going to
> read *something*. Therefore I'll look at implementing this with
> unconditional reads from SCTLR_EL1 in v2.

Thanks for the investigation!

I'm not too surprised by this outcome.  There is a possibility that
reading SCTLR_EL1 sucks badly on some implementations, but on others it
may be little more than a register rename.  We should probably wait for
evidence before panicking about it.

So long as this SCTLR_EL1 switching is completely disabled via
alternatives on hardware to which it isn't applicable, I expect that
should be a reasonable compromise.

[...]

> > > > > > Do we need a way to query the enabled keys?
> > > > > >
> > > > > > We could either have a _GET_ prctl (the common approach), or have this
> > > > > > prctl return the mask of enabled keys (avoids the extra prctl, but
> > > > > > weirder).
> > > > > >
> > > > > > As above, we might
> > > > > >
> > > > > > Things like CRIU may need a GET in order to save/restore this setting
> > > > > > properly.
> > > > >
> > > > > Maybe it makes sense for there to be a GET prctl() to support CRIU.
> > > > > But it would need to be defined carefully because CRIU would
> > > > > presumably need to know what value to pass as the "keys" argument when
> > > > > it calls SET to restore the state. It can't just hardcode a value
> > > > > because that may harm extensibility if new keys are added.
> > > > >
> > > > > If we require the kernel to start processes with all keys enabled
> > > > > (including any keys that we may introduce in the future), then it
> > > > > seems like if CRIU knows which keys were disabled, then it can restore
> > > > > the state by issuing a syscall like this:
> > > > >
> > > > > prctl(PR_PAC_SET_ENABLED_KEYS, disabled_keys_mask, 0)
> > > > >
> > > > > which would imply that instead of providing PR_PAC_GET_ENABLED_KEYS we
> > > > > instead provide PR_PAC_GET_DISABLED_KEYS, which CRIU would call when
> > > > > saving the state to prepare the disabled_keys_mask argument passed
> > > > > here. Then for consistency we can make the SET prctl() be
> > > > > PR_PAC_SET_DISABLED_KEYS, and CRIU can then do:
> > > > >
> > > > > prctl(PR_PAC_SET_DISABLED_KEYS, disabled_keys_mask, disabled_keys_mask)
> > > > >
> > > > > Does that make sense?
> > > >
> > > > Possibly, though it's nicer if the GET returns the same value suppiled
> > > > to the SET, rather than the complement of it.
> > >
> > > Maybe you misread my proposal? The part about the complement is
> > > addressed by the sentence beginning "Then for consistency..." So we
> > > would end up with PR_PAC_SET_DISABLED_KEYS and
> > > PR_PAC_GET_DISABLED_KEYS, and they would both take/return a mask of
> > > *disabled* keys.
> >
> > Oh, I see.  Yes, while I prefer the two are consistent with each other,
> > we can flip the sense of both here if desired.
> >
> > >
> > > > If SET ignores set bits in arg3 that don't correspond to set bits in the
> > > > mask arg2, then
> > > >
> > > >         u64 current_keys = prctl(PR_PAC_GET_ENABLED_KEYS);
> > > >
> > > >         prctl(PR_PAC_SET_ENABLED_KEYS, ~0UL, current_keys);
> > > >
> > > > should work.
> > >
> > > I'd prefer not to allow the SET prctl to accept all-ones as the first
> > > argument, as this could lead to a scenario where sometimes people pass
> > > all-ones as the first argument and sometimes they don't, which could
> > > make it hard to add new keys in the future (since the addition of a
> > > new key could cause existing callers to affect the new key
> > > arbitrarily). Instead, the first argument should be required to
> > > specify only known keys, in order to enforce that the caller is
> > > explicit about which keys they intend to affect.
> >
> > That sounds fair.
> >
> > > > There's a final option, which is to expose this config through ptrace
> > > > instead for save/restore purposes.  From previous discussions with the
> > > > CRIU folks, I recall that they are trying to move away from doing setup
> > > > from within the new process where possible.
> > > >
> > > > There's no reason not to have both though -- there's precedent for that,
> > > > such as for PR_SVE_{GET,SET}_VL and the NT_ARM_SVE regset.  MTE may move
> > > > in a similar direction too IIUC.
> > > >
> > > > Having a GET remains useful for in-process debugging and diagnostics,
> > > > and it's extremely straightforward to add in the kernel.  So from my
> > > > side I'd vote to have it anyway...
> > >
> > > Okay. I see that the MTE series is adding NT_ARM_TAGGED_ADDR_CTRL as
> > > well, and I suppose that I could add a
> > > NT_ARM_PAC_(whatever)ABLED_KEYS. Since this wouldn't be used in normal
> > > userspace applications (and it doesn't look like ptrace APIs such as
> > > NT_ARM_PACA_KEYS will accommodate new keys being added so CRIU would
> > > need to be made aware of new keys anyway), we might as well just have
> > > all the APIs deal with the set of enabled keys.
> > >
> > > And if the ptrace is enough for CRIU, I'd mildly prefer not to add a
> > > GET prctl, since it's just more uapi surface that needs to be
> > > maintained forever. For in-process use cases it seems redundant with
> > > e.g. issuing a pacia instruction and checking whether it is a no-op,
> >
> > Probing instructions to see whether they generate signals is a cardinal
> > sin and we make huge efforts to ensure that userspace doesn't need to do
> > that.  As for the no-op check, I suppose that works, but it's nasty if
> > the generic "check all the required featues" boilerplate on program
> > entry relies on intrinsics or can't be written in C at all.  This is
> > just a way of discouraging people from checking at all IMHO.
> >
> >
> > The implementation of that prctl would most likely be a one-line wrapper
> > that calls the same internal function used to populate the regset for
> > ptrace, so I think you might be worrying a bit too much about creating
> > ABI here.  This is more providing a second interface for same thing,
> > because no single interface is usable in all situations.
> 
> Okay. Since I don't feel too strongly about it, I'll add the GET prctl in v2.

Thanks

---Dave



More information about the linux-arm-kernel mailing list