[PATCH v2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu

Sun Jun 14 04:04:44 PDT 2026

On Sat, May 23, 2026 at 07:17:10PM +0530, Linu Cherian wrote:
> From: Ryan Roberts <ryan.roberts at arm.com>
> 
> There are 3 variants of tlb flush that invalidate user mappings:
> flush_tlb_mm(), flush_tlb_page() and __flush_tlb_range(). All of these
> would previously unconditionally broadcast their tlbis to all cpus in
> the inner shareable domain.
> 
> But this is a waste of effort if we can prove that the mm for which we
> are flushing the mappings has only ever been active on the local cpu. In
> that case, it is safe to avoid the broadcast and simply invalidate the
> current cpu.
> 
> So let's track in mm_context_t::active_cpu either the mm has never been
> active on any cpu, has been active on more than 1 cpu, or has been
> active on precisely 1 cpu - and in that case, which one. We update this
> when switching context, being careful to ensure that it gets updated
> *before* installing the mm's pgtables. On the reader side, we ensure we
> read *after* the previous write(s) to the pgtable(s) that necessitated
> the tlb flush have completed. This guarrantees that if a cpu that is
> doing a tlb flush sees it's own id in active_cpu, then the old pgtable
> entry cannot have been seen by any other cpu and we can flush only the
> local cpu.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts at arm.com>
> Reviewed-by: Catalin Marinas <catalin.marinas at arm.com>
> Tested-by: Huang Ying <ying.huang at linux.alibaba.com>
> [linu.cherian at arm.com: Adapted for v7.1 flush tlb API changes]
> Signed-off-by: Linu Cherian <linu.cherian at arm.com>
> ---
> Changelog from RFC v1:
> - Adapted for v7.1 flush tlb API changes
>   No changes in core logic
> - Collected Rb and Tb tags
> - lat_mmap benchmark showed dsb(ishst) performs better than dsb(ish),
>   hence retained dsb(ishst) in flush_tlb_user_pre	
> 
> 
> Testing with 7.1-rc4 :
> +-----------------------+---------------------------------------------------+-------------+
> | Benchmark             | Result Class                                      |  Improvement|  
> +=======================+===================================================+=============+
> | perf/syscall          | fork (ops/sec)                                    |   (I) 3.25% |
> +-----------------------+---------------------------------------------------+-------------+
> | pts/memtier-benchmark | Protocol: Redis Clients: 100 Ratio: 1:5 (Ops/sec) |   (I) 2.70% |
> | 			| Protocol: Redis Clients: 100 Ratio: 5:1 (Ops/sec) |   (I) 2.13% |
> +-----------------------+---------------------------------------------------+-------------+

I think we need a much more comprehensive set of benchmarks before we can
begin to consider a change like this.

>  arch/arm64/include/asm/mmu.h         |  12 +++
>  arch/arm64/include/asm/mmu_context.h |   2 +
>  arch/arm64/include/asm/tlbflush.h    | 127 +++++++++++++++++++++------
>  arch/arm64/mm/context.c              |  30 ++++++-
>  4 files changed, 141 insertions(+), 30 deletions(-)

Doesn't this break BTM/SVM with the SMMU? I think that's a non-starter
even if you can provide some more compelling numbers.

> +static inline bool flush_tlb_user_pre(struct mm_struct *mm, tlbf_t flags)
> +{
> +	unsigned int self, active;
> +	bool local;
> +
> +	migrate_disable();
> +
> +	if (flags & TLBF_NOBROADCAST) {
> +		dsb(nshst);
> +		return true;
> +	}

Why does the NOBROADCAST case need migration disabled? It didn't before...

> +
> +	self = smp_processor_id();
> +
> +	/*
> +	 * The load of mm->context.active_cpu must not be reordered before the
> +	 * store to the pgtable that necessitated this flush. This ensures that
> +	 * if the value read is our cpu id, then no other cpu can have seen the
> +	 * old pgtable value and therefore does not need this old value to be
> +	 * flushed from its tlb. But we don't want to upgrade the dsb(ishst),
> +	 * needed to make the pgtable updates visible to the walker, to a
> +	 * dsb(ish) by default. So speculatively load without a barrier and if
> +	 * it indicates our cpu id, then upgrade the barrier and re-load.
> +	 */
> +	active = READ_ONCE(mm->context.active_cpu);
> +	if (active == self) {
> +		dsb(ish);
> +		active = READ_ONCE(mm->context.active_cpu);
> +	} else {
> +		dsb(ishst);
> +	}

Why can't you just do:

	dsb(ishst);
	active = READ_ONCE(mm->context.active_cpu);

?

> +
> +	local = active == self;
> +	if (!local)
> +		migrate_enable();
> +
> +	return local;
> +}
> +
> +static inline void flush_tlb_user_post(bool local)
> +{
> +	if (local)
> +		migrate_enable();
> +}

I was under the impression that disabling/enabling migration was an
expensive thing to do, so I'd really want to see some more numbers to
justify this (including from inside a VM) and allow us to consider the
trade-offs properly. It's also not at all clear to me that it's safe
from such a low-level TLB invalidation helper.

> +
>  /*
>   *	TLB Invalidation
>   *	================
> @@ -408,12 +482,20 @@ static inline void flush_tlb_all(void)
>  static inline void flush_tlb_mm(struct mm_struct *mm)
>  {
>  	unsigned long asid;
> +	bool local;
>  
> -	dsb(ishst);
> +	local = flush_tlb_user_pre(mm, TLBF_NONE);
>  	asid = __TLBI_VADDR(0, ASID(mm));
> -	__tlbi(aside1is, asid);
> -	__tlbi_user(aside1is, asid);
> -	__tlbi_sync_s1ish(mm);
> +	if (local) {
> +		__tlbi(aside1, asid);
> +		__tlbi_user(aside1, asid);
> +		dsb(nsh);
> +	} else {
> +		__tlbi(aside1is, asid);
> +		__tlbi_user(aside1is, asid);
> +		__tlbi_sync_s1ish(mm);
> +	}
> +	flush_tlb_user_post(local);

I think you've changed this since Ryan's original patch, but why are you
only calling __tlbi_sync_s1ish() for the !local case? Doesn't that break
the erratum workaround when running as a VM if the vCPU is migrated?

> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
> index 0f4a28b87469..f34ed78393e0 100644
> --- a/arch/arm64/mm/context.c
> +++ b/arch/arm64/mm/context.c
> @@ -214,9 +214,10 @@ static u64 new_context(struct mm_struct *mm)
>  
>  void check_and_switch_context(struct mm_struct *mm)
>  {
> -	unsigned long flags;
> -	unsigned int cpu;
> +	unsigned int cpu = smp_processor_id();
>  	u64 asid, old_active_asid;
> +	unsigned int active;
> +	unsigned long flags;
>  
>  	if (system_supports_cnp())
>  		cpu_set_reserved_ttbr0();
> @@ -251,7 +252,6 @@ void check_and_switch_context(struct mm_struct *mm)
>  		atomic64_set(&mm->context.id, asid);
>  	}
>  
> -	cpu = smp_processor_id();
>  	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
>  		local_flush_tlb_all();
>  
> @@ -262,6 +262,30 @@ void check_and_switch_context(struct mm_struct *mm)
>  
>  	arm64_apply_bp_hardening();
>  
> +	/*
> +	 * Update mm->context.active_cpu in such a manner that we avoid cmpxchg
> +	 * and dsb unless we definitely need it. If initially ACTIVE_CPU_NONE
> +	 * then we are the first cpu to run so set it to our id. If initially
> +	 * any id other than ours, we are the second cpu to run so set it to
> +	 * ACTIVE_CPU_MULTIPLE. If we update the value then we must issue
> +	 * dsb(ishst) to ensure stores to mm->context.active_cpu are ordered
> +	 * against the TTBR0 write in cpu_switch_mm()/uaccess_enable(); the
> +	 * store must be visible to another cpu before this cpu could have
> +	 * populated any TLB entries based on the pgtables that will be
> +	 * installed.
> +	 */
> +	active = READ_ONCE(mm->context.active_cpu);
> +	if (active != cpu && active != ACTIVE_CPU_MULTIPLE) {
> +		if (active == ACTIVE_CPU_NONE)
> +			active = cmpxchg_relaxed(&mm->context.active_cpu,
> +						 ACTIVE_CPU_NONE, cpu);
> +
> +		if (active != ACTIVE_CPU_NONE)
> +			WRITE_ONCE(mm->context.active_cpu, ACTIVE_CPU_MULTIPLE);
> +
> +		dsb(ishst);
> +	}
> +

Can you simplify the 'if' condition here?

	if (active == ACTIVE_CPU_NONE) {
		if (!try_cmpxchg_relaxed(...))
			WRITE_ONCE(...);

		dsb(ishst);
	}

(as an aside, maybe we should implement arch_try_cmpxchg{,_relaxed} so
 we could drop the READ_ONCE() here as well?)

Will