[PATCH v2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu
Will Deacon
will at kernel.org
Sun Jun 14 04:04:44 PDT 2026
On Sat, May 23, 2026 at 07:17:10PM +0530, Linu Cherian wrote:
> From: Ryan Roberts <ryan.roberts at arm.com>
>
> There are 3 variants of tlb flush that invalidate user mappings:
> flush_tlb_mm(), flush_tlb_page() and __flush_tlb_range(). All of these
> would previously unconditionally broadcast their tlbis to all cpus in
> the inner shareable domain.
>
> But this is a waste of effort if we can prove that the mm for which we
> are flushing the mappings has only ever been active on the local cpu. In
> that case, it is safe to avoid the broadcast and simply invalidate the
> current cpu.
>
> So let's track in mm_context_t::active_cpu either the mm has never been
> active on any cpu, has been active on more than 1 cpu, or has been
> active on precisely 1 cpu - and in that case, which one. We update this
> when switching context, being careful to ensure that it gets updated
> *before* installing the mm's pgtables. On the reader side, we ensure we
> read *after* the previous write(s) to the pgtable(s) that necessitated
> the tlb flush have completed. This guarrantees that if a cpu that is
> doing a tlb flush sees it's own id in active_cpu, then the old pgtable
> entry cannot have been seen by any other cpu and we can flush only the
> local cpu.
>
> Signed-off-by: Ryan Roberts <ryan.roberts at arm.com>
> Reviewed-by: Catalin Marinas <catalin.marinas at arm.com>
> Tested-by: Huang Ying <ying.huang at linux.alibaba.com>
> [linu.cherian at arm.com: Adapted for v7.1 flush tlb API changes]
> Signed-off-by: Linu Cherian <linu.cherian at arm.com>
> ---
> Changelog from RFC v1:
> - Adapted for v7.1 flush tlb API changes
> No changes in core logic
> - Collected Rb and Tb tags
> - lat_mmap benchmark showed dsb(ishst) performs better than dsb(ish),
> hence retained dsb(ishst) in flush_tlb_user_pre
>
>
> Testing with 7.1-rc4 :
> +-----------------------+---------------------------------------------------+-------------+
> | Benchmark | Result Class | Improvement|
> +=======================+===================================================+=============+
> | perf/syscall | fork (ops/sec) | (I) 3.25% |
> +-----------------------+---------------------------------------------------+-------------+
> | pts/memtier-benchmark | Protocol: Redis Clients: 100 Ratio: 1:5 (Ops/sec) | (I) 2.70% |
> | | Protocol: Redis Clients: 100 Ratio: 5:1 (Ops/sec) | (I) 2.13% |
> +-----------------------+---------------------------------------------------+-------------+
I think we need a much more comprehensive set of benchmarks before we can
begin to consider a change like this.
> arch/arm64/include/asm/mmu.h | 12 +++
> arch/arm64/include/asm/mmu_context.h | 2 +
> arch/arm64/include/asm/tlbflush.h | 127 +++++++++++++++++++++------
> arch/arm64/mm/context.c | 30 ++++++-
> 4 files changed, 141 insertions(+), 30 deletions(-)
Doesn't this break BTM/SVM with the SMMU? I think that's a non-starter
even if you can provide some more compelling numbers.
> +static inline bool flush_tlb_user_pre(struct mm_struct *mm, tlbf_t flags)
> +{
> + unsigned int self, active;
> + bool local;
> +
> + migrate_disable();
> +
> + if (flags & TLBF_NOBROADCAST) {
> + dsb(nshst);
> + return true;
> + }
Why does the NOBROADCAST case need migration disabled? It didn't before...
> +
> + self = smp_processor_id();
> +
> + /*
> + * The load of mm->context.active_cpu must not be reordered before the
> + * store to the pgtable that necessitated this flush. This ensures that
> + * if the value read is our cpu id, then no other cpu can have seen the
> + * old pgtable value and therefore does not need this old value to be
> + * flushed from its tlb. But we don't want to upgrade the dsb(ishst),
> + * needed to make the pgtable updates visible to the walker, to a
> + * dsb(ish) by default. So speculatively load without a barrier and if
> + * it indicates our cpu id, then upgrade the barrier and re-load.
> + */
> + active = READ_ONCE(mm->context.active_cpu);
> + if (active == self) {
> + dsb(ish);
> + active = READ_ONCE(mm->context.active_cpu);
> + } else {
> + dsb(ishst);
> + }
Why can't you just do:
dsb(ishst);
active = READ_ONCE(mm->context.active_cpu);
?
> +
> + local = active == self;
> + if (!local)
> + migrate_enable();
> +
> + return local;
> +}
> +
> +static inline void flush_tlb_user_post(bool local)
> +{
> + if (local)
> + migrate_enable();
> +}
I was under the impression that disabling/enabling migration was an
expensive thing to do, so I'd really want to see some more numbers to
justify this (including from inside a VM) and allow us to consider the
trade-offs properly. It's also not at all clear to me that it's safe
from such a low-level TLB invalidation helper.
> +
> /*
> * TLB Invalidation
> * ================
> @@ -408,12 +482,20 @@ static inline void flush_tlb_all(void)
> static inline void flush_tlb_mm(struct mm_struct *mm)
> {
> unsigned long asid;
> + bool local;
>
> - dsb(ishst);
> + local = flush_tlb_user_pre(mm, TLBF_NONE);
> asid = __TLBI_VADDR(0, ASID(mm));
> - __tlbi(aside1is, asid);
> - __tlbi_user(aside1is, asid);
> - __tlbi_sync_s1ish(mm);
> + if (local) {
> + __tlbi(aside1, asid);
> + __tlbi_user(aside1, asid);
> + dsb(nsh);
> + } else {
> + __tlbi(aside1is, asid);
> + __tlbi_user(aside1is, asid);
> + __tlbi_sync_s1ish(mm);
> + }
> + flush_tlb_user_post(local);
I think you've changed this since Ryan's original patch, but why are you
only calling __tlbi_sync_s1ish() for the !local case? Doesn't that break
the erratum workaround when running as a VM if the vCPU is migrated?
> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
> index 0f4a28b87469..f34ed78393e0 100644
> --- a/arch/arm64/mm/context.c
> +++ b/arch/arm64/mm/context.c
> @@ -214,9 +214,10 @@ static u64 new_context(struct mm_struct *mm)
>
> void check_and_switch_context(struct mm_struct *mm)
> {
> - unsigned long flags;
> - unsigned int cpu;
> + unsigned int cpu = smp_processor_id();
> u64 asid, old_active_asid;
> + unsigned int active;
> + unsigned long flags;
>
> if (system_supports_cnp())
> cpu_set_reserved_ttbr0();
> @@ -251,7 +252,6 @@ void check_and_switch_context(struct mm_struct *mm)
> atomic64_set(&mm->context.id, asid);
> }
>
> - cpu = smp_processor_id();
> if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
> local_flush_tlb_all();
>
> @@ -262,6 +262,30 @@ void check_and_switch_context(struct mm_struct *mm)
>
> arm64_apply_bp_hardening();
>
> + /*
> + * Update mm->context.active_cpu in such a manner that we avoid cmpxchg
> + * and dsb unless we definitely need it. If initially ACTIVE_CPU_NONE
> + * then we are the first cpu to run so set it to our id. If initially
> + * any id other than ours, we are the second cpu to run so set it to
> + * ACTIVE_CPU_MULTIPLE. If we update the value then we must issue
> + * dsb(ishst) to ensure stores to mm->context.active_cpu are ordered
> + * against the TTBR0 write in cpu_switch_mm()/uaccess_enable(); the
> + * store must be visible to another cpu before this cpu could have
> + * populated any TLB entries based on the pgtables that will be
> + * installed.
> + */
> + active = READ_ONCE(mm->context.active_cpu);
> + if (active != cpu && active != ACTIVE_CPU_MULTIPLE) {
> + if (active == ACTIVE_CPU_NONE)
> + active = cmpxchg_relaxed(&mm->context.active_cpu,
> + ACTIVE_CPU_NONE, cpu);
> +
> + if (active != ACTIVE_CPU_NONE)
> + WRITE_ONCE(mm->context.active_cpu, ACTIVE_CPU_MULTIPLE);
> +
> + dsb(ishst);
> + }
> +
Can you simplify the 'if' condition here?
if (active == ACTIVE_CPU_NONE) {
if (!try_cmpxchg_relaxed(...))
WRITE_ONCE(...);
dsb(ishst);
}
(as an aside, maybe we should implement arch_try_cmpxchg{,_relaxed} so
we could drop the READ_ONCE() here as well?)
Will
More information about the linux-arm-kernel
mailing list