[PATCH v1] arm64/mm: Close theoretical race where stale TLB entry remains valid

Mon Jun 2 07:00:12 PDT 2025

On 02/06/2025 13:00, Will Deacon wrote:
> On Fri, May 30, 2025 at 04:23:47PM +0100, Ryan Roberts wrote:
>> Commit 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with
>> a parallel reclaim leaving stale TLB entries") describes a race that,
>> prior to the commit, could occur between reclaim and operations such as
>> mprotect() when using reclaim's tlbbatch mechanism. See that commit for
>> details but the summary is:
>>
>> """
>> Nadav Amit identified a theoritical race between page reclaim and
>> mprotect due to TLB flushes being batched outside of the PTL being held.
>>
>> He described the race as follows:
>>
>> 	CPU0				CPU1
>> 	----				----
>> 					user accesses memory using RW PTE
>> 					[PTE now cached in TLB]
>> 	try_to_unmap_one()
>> 	==> ptep_get_and_clear()
>> 	==> set_tlb_ubc_flush_pending()
>> 					mprotect(addr, PROT_READ)
>> 					==> change_pte_range()
>> 					==> [ PTE non-present - no flush ]
>>
>> 					user writes using cached RW PTE
>> 	...
>>
>> 	try_to_unmap_flush()
>> """
>>
>> The solution was to insert flush_tlb_batched_pending() in mprotect() and
>> friends to explcitly drain any pending reclaim TLB flushes. In the
>> modern version of this solution, arch_flush_tlb_batched_pending() is
>> called to do that synchronisation.
>>
>> arm64's tlbbatch implementation simply issues TLBIs at queue-time
>> (arch_tlbbatch_add_pending()), eliding the trailing dsb(ish). The
>> trailing dsb(ish) is finally issued in arch_tlbbatch_flush() at the end
>> of the batch to wait for all the issued TLBIs to complete.
>>
>> Now, the Arm ARM states:
>>
>> """
>> The completion of the TLB maintenance instruction is guaranteed only by
>> the execution of a DSB by the observer that performed the TLB
>> maintenance instruction. The execution of a DSB by a different observer
>> does not have this effect, even if the DSB is known to be executed after
>> the TLB maintenance instruction is observed by that different observer.
>> """
>>
>> arch_tlbbatch_add_pending() and arch_tlbbatch_flush() conform to this
>> requirement because they are called from the same task (either kswapd or
>> caller of madvise(MADV_PAGEOUT)), so either they are on the same CPU or
>> if the task was migrated, __switch_to() contains an extra dsb(ish).
>>
>> HOWEVER, arm64's arch_flush_tlb_batched_pending() is also implemented as
>> a dsb(ish). But this may be running on a CPU remote from the one that
>> issued the outstanding TLBIs. So there is no architectural gurantee of
>> synchonization. Therefore we are still vulnerable to the theoretical
>> race described in Commit 3ea277194daa ("mm, mprotect: flush TLB if
>> potentially racing with a parallel reclaim leaving stale TLB entries").
>>
>> Fix this by flushing the entire mm in arch_flush_tlb_batched_pending().
>> This aligns with what the other arches that implement the tlbbatch
>> feature do.
>>
>> Fixes: 43b3dfdd0455 ("arm64: support batched/deferred tlb shootdown during page reclamation/migration")
> 
> Barry -- it would be great if you could re-run some of the benchmarks
> from that commit with this fix applied.

Worth rerunning if possible, but I would guess that those benchmarks will still
show the similar improvement because they are measuring the cost of doing the
TLB flushing. But with the fix, there is an extra cost that those benchmarks
probably won't measure; subsequent work within the target mm will have no VAs
cached in the TLB so the miss rate will be much higher.

> 
>> Signed-off-by: Ryan Roberts <ryan.roberts at arm.com>
>> ---
>>  arch/arm64/include/asm/tlbflush.h | 9 +++++----
>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>> index eba1a98657f1..7d564c2a126f 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -323,13 +323,14 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>  }
>>
>>  /*
>> - * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
>> - * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
>> - * flush_tlb_batched_pending().
>> + * If mprotect/munmap/etc occurs during TLB batched flushing, we need to ensure
>> + * all the previously issued TLBIs targeting mm have completed. But since we
>> + * can be executing on a remote CPU, a DSB cannot guarrantee this like it can
>> + * for arch_tlbbatch_flush(). Our only option is to flush the entire mm.
>>   */
>>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
>>  {
>> -	dsb(ish);
>> +	flush_tlb_mm(mm);
>>  }
> 
> Thanks, Ryan. I'll pick this as a fix, but perhaps the core code should
> do this given that all the architectures selecting
> ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH now have an identical implementation
> of arch_flush_tlb_batched_pending()?

Ha, yes... infact it looks like that's what it did prior to commit db6c1f6f236d
("mm/tlbbatch: introduce arch_flush_tlb_batched_pending()").

I'll do that tidy up once this fix appears in mm-unstable.

Thanks,
Ryan

> 
> Will