[PATCH v2] arm64: mte: switch GCR_EL1 on task switch rather than entry/exit

Thu Jul 8 18:50:09 PDT 2021

On Mon, Jul 5, 2021 at 5:52 AM Catalin Marinas <catalin.marinas at arm.com> wrote:
>
> On Fri, Jul 02, 2021 at 12:45:18PM -0700, Peter Collingbourne wrote:
> > Accessing GCR_EL1 and issuing an ISB can be expensive on some
> > microarchitectures. To avoid taking this performance hit on every
> > kernel entry/exit, switch GCR_EL1 on task switch rather than
> > entry/exit. This is essentially a revert of commit bad1e1c663e0
> > ("arm64: mte: switch GCR_EL1 in kernel entry and exit").
>
> As per the discussion in v1, we can avoid an ISB, though we are still
> left with the GCR_EL1 access. I'm surprised that access to a non
> self-synchronising register is that expensive but I suspect the
> benchmark is just timing a dummy syscall. I'm not asking for numbers but
> I'd like to make sure we don't optimise for unrealistic use-cases. Is
> something like a geekbench score affected for example?

FWIW, I was using this test program:
https://patchwork.kernel.org/project/linux-arm-kernel/patch/20200801011152.39838-1-pcc@google.com/#23572981

Since it's an invalid syscall it's a good way to measure the effect of
changes to entry/exit in isolation, but it does mean that we need to
be careful when also making changes elsewhere in the kernel, as will
become apparent in a moment.

> While we can get rid of the IRG in the kernel, at some point we may want
> to use ADDG as generated by the compiler. That too is affected by the
> GCR_EL1.Exclude mask.
>
> > This requires changing how we generate random tags for HW tag-based
> > KASAN, since at this point IRG would use the user's exclusion mask,
> > which may not be suitable for kernel use. In this patch I chose to take
> > the modulus of CNTVCT_EL0, however alternative approaches are possible.
>
> So a few successive mte_get_mem_tag() will give the same result if the
> counter hasn't changed. Even if ARMv8.6 requires a 1GHz timer frequency,
> I think an implementation is allowed to count in bigger increments.

Yes, I observed that Apple M1 for example counts in increments of 16.
Taking the modulus of the timer would happen to work as long as the
increment is small enough (since it would mean that the timer would
likely have incremented by the time we need to make another
allocation) and a power of 2 (to ensure that we permute through all of
the possible tag values), which I would expect to be the case on most
microarchitectures.

However, I developed an in-kernel allocator microbenchmark which
revealed a more important issue with this patch, which is that on most
cores switching from IRG to reading the timer costs more than the
performance improvement from switching from the single ISB patch to
the GCR on task switch patch. Which means that if KASAN is enabled, a
single allocation would wipe out the performance improvement from
avoiding touching GCR on entry/exit. I also tried a number of
alternative approaches and they were also too expensive. So now I am
less inclined to push for an approach that avoids touching GCR on
entry/exit.

> BTW, can you also modify mte_set_kernel_gcr to only do a write to the
> GCR_EL1 register rather than a read-modify-write?

Yes, this helps a bit. In v3 I now do this as well as single ISB.

Peter