[RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

David Hildenbrand david at redhat.com
Wed Apr 10 13:09:27 PDT 2024


[...]

Skipping the ptdesc stuff we discussed offline, to not get distracted. :)

This stuff is killing me, sorry for the lengthy reply ...

> 
> So I've been looking at all this again, and getting myself even more confused.
> 
> I believe there are 2 classes of ptep_get_lockless() caller:
> 
> 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault()
> 2) everyone else

Likely only completely lockless page table walkers where we *cannot* 
recheck under PTL is special. Essentially where we disable interrupts 
and rely on TLB flushes to sync against concurrent changes.

Let's take a look where ptep_get_lockless() comes from:

commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3
Author: Peter Zijlstra <peterz at infradead.org>
Date:   Fri Nov 13 11:41:40 2020 +0100

     mm/gup: Provide gup_get_pte() more generic

     In order to write another lockless page-table walker, we need
     gup_get_pte() exposed. While doing that, rename it to
     ptep_get_lockless() to match the existing ptep_get() naming.


GUP-fast, when we were still relying on TLB flushes to sync against 
GUP-fast.

"With get_user_pages_fast(), we walk down the pagetables without taking 
any locks.  For this we would like to load the pointers atomically, but 
sometimes that is not possible (e.g. without expensive cmpxchg8b on 
x86_32 PAE).  What we do have is the guarantee that a PTE will only 
either go from not present to present, or present to not present or both 
-- it will not switch to a completely different present page without a 
TLB flush in between; something hat we are blocking by holding 
interrupts off."

Later, we added support for GUP-fast that introduced the !TLB variant:

commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13
Author: Steve Capper <steve.capper at linaro.org>
Date:   Thu Oct 9 15:29:14 2014 -0700

     mm: introduce a general RCU get_user_pages_fast()

With the pattern

/*
  * In the line below we are assuming that the pte can be read
  * atomically. If this is not the case for your architecture,
  * please wrap this in a helper function!
  *
  * for an example see gup_get_pte in arch/x86/mm/gup.c
  */
pte_t pte = ACCESS_ONCE(*ptep);
...
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
...


Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte = 
gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read 
was required to be sane, this the lengthy comment above 
ptep_get_lockless() that talks about TLB flushes.

The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is 
still full of details about TLB flushes syncing against GUP-fast. But as 
you note, we use it even in contexts where we don't disable interrupts 
and the TLB flush can't help.

We don't disable interrupts during page faults ... so most of the things 
described in ptep_get_lockless() don't really apply.

That's also the reason why ...
> 
>                  vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> -               vmf->orig_pte = *vmf->pte;
> +               vmf->orig_pte = ptep_get_lockless(vmf->pte);
>                  vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
> 
> -               /*
> -                * some architectures can have larger ptes than wordsize,
> -                * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
> -                * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
> -                * accesses.  The code below just needs a consistent view
> -                * for the ifs and we later double check anyway with the
> -                * ptl lock held. So here a barrier will do.
> -                */
> -               barrier();
>                  if (pte_none(vmf->orig_pte)) {

... that code was in place. We would possibly read garbage PTEs, but 
would recheck *under PTL* (where the PTE can no longer change) that what 
we read wasn't garbage and didn't change.

> 
> --
> 
> (2) All the other users require that a subset of the pte fields are
> self-consistent; specifically they don't care about access, dirty, uffd-wp or
> soft-dirty. arm64 can guarrantee that all the other bits are self-consistent
> just by calling ptep_get().

Let's focus on access+dirty for now ;)

> 
> --
> 
> So, I'm making the bold claim that it was never neccessary to specialize
> pte_get_lockless() on arm64, and I *think* we could just delete it so that
> ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original
> aim without needing to introduce "norecency" variants.
> 
> Additionally I propose documenting ptep_get_lockless() to describe the set of
> fields that are guarranteed to be self-consistent and the remaining fields which
> are self-consistent only with best-effort.
> 
> Could it be this easy? My head is hurting...

I think what has to happen is:

(1) pte_get_lockless() must return the same value as ptep_get() as long 
as there are no races. No removal/addition of access/dirty bits etc.

(2) Lockless page table walkers that later verify under the PTL can 
handle serious "garbage PTEs". This is our page fault handler.

(3) Lockless page table walkers that cannot verify under PTL cannot 
handle arbitrary garbage PTEs. This is GUP-fast. Two options:

(3a) pte_get_lockless() can atomically read the PTE: We re-check later 
if the atomically-read PTE is still unchanged (without PTL). No IPI for 
TLB flushes required. This is the common case. HW might concurrently set 
access/dirty bits, so we can race with that. But we don't read garbage.

(3b) pte_get_lockless() cannot atomically read the PTE: We need special 
magic to read somewhat-sane PTEs and need IPIs during TLB flushes to 
sync against serious PTE changes (e.g., present -> present). This is 
weird x86-PAE.


If ptep_get() on arm64 can do (1), (2) and (3a), we might be good.

My head is hurting ...

-- 
Cheers,

David / dhildenb




More information about the linux-arm-kernel mailing list