[PATCH] arm64: mte: optimize asynchronous tag check fault flag check

Peter Collingbourne pcc at google.com
Wed Nov 18 11:53:16 EST 2020


On Wed, Nov 18, 2020 at 3:24 AM Catalin Marinas <catalin.marinas at arm.com> wrote:
>
> On Tue, Nov 17, 2020 at 07:20:51PM -0800, Peter Collingbourne wrote:
> > We don't need to check for MTE support before checking the flag
> > because it can only be set if the hardware supports MTE. As a result
> > we can unconditionally check the flag bit which is expected to be in
> > a register and therefore the check can be done in a single instruction
> > instead of first needing to load the hwcaps.
> >
> > On a DragonBoard 845c with a kernel built with CONFIG_ARM64_MTE=y with
> > the powersave governor this reduces the cost of a kernel entry/exit
> > (invalid syscall) from 465.1ns to 463.8ns.
>
> That's less than 0.3%, could be noise as well.

The numbers are quite stable following the methodology of [1] (one
thing that I forgot to mention there is adb shell stop before taking
measurements). Out of 100 samples there were about 75 in the faster
cluster ranging from 464.9966ns-465.1515ns before and
463.7776ns-463.9312ns after.

> > diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
> > index e4c0dadf0d92..f533e83fd722 100644
> > --- a/arch/arm64/kernel/syscall.c
> > +++ b/arch/arm64/kernel/syscall.c
> > @@ -123,7 +123,7 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
> >       local_daif_restore(DAIF_PROCCTX);
> >       user_exit();
> >
> > -     if (system_supports_mte() && (flags & _TIF_MTE_ASYNC_FAULT)) {
> > +     if (IS_ENABLED(CONFIG_ARM64_MTE) && (flags & _TIF_MTE_ASYNC_FAULT)) {
>
> system_supports_mte() is actually a static label (well, it expands to
> two). So we get a combination of nop/branch jumping over the flags
> check. Maybe the difference you are seeing is caused by the extra
> instructions changing the cache alignment or confusing the branch
> predictor.

Okay, I wasn't aware of that mechanism. From the disassembly of
syscall.o it did appear that we were loading from hwcap every time.
But looking at the compiler's -S output it looks like there's
something going on with __jump_table which may lead to that happening.
There is still a conditional branch on the presumably static result of
the hwcap which may also be where part of the performance loss is
coming from.

> I wonder whether changing __cpus_have_const_cap() to use
> static_branch_likely() has any effect (given that we expect most
> features to be enabled in future CPUs, such change would probably make
> sense).
>
> That said, I'm happy to take this patch, most likely it replaces some
> static branch with tbnz on this path. Given that MTE is on in defconfig,
> do we actually need the IS_ENABLED() check? If we remove it, it would be
> more consistent with the similar check in do_notify_resume().

If we don't mind the additional conditional branch on
CONFIG_ARM64_MTE=n kernels we don't need the IS_ENABLED() check. I'll
remove it.

Peter

[1] https://lore.kernel.org/linux-arm-kernel/CAMn1gO7HCJiXEzDvBb-=T7znATqyaxE_t-zezqKL0dyXRCG-nQ@mail.gmail.com/



More information about the linux-arm-kernel mailing list