[RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

Tue Apr 9 09:35:38 PDT 2024

On 08/04/2024 09:36, David Hildenbrand wrote:
> On 03.04.24 14:59, Ryan Roberts wrote:
>> On 27/03/2024 09:34, David Hildenbrand wrote:
>>> On 26.03.24 18:51, Ryan Roberts wrote:
>>>> On 26/03/2024 17:39, David Hildenbrand wrote:
>>>>> On 26.03.24 18:32, Ryan Roberts wrote:
>>>>>> On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>>>>>>>
>>>>>>>>>>> Likely, we just want to read "the real deal" on both sides of the
>>>>>>>>>>> pte_same()
>>>>>>>>>>> handling.
>>>>>>>>>>
>>>>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if
>>>>>>>>>> we do
>>>>>>>>>> that, we still have to keep pte_get_lockless() around for this case.
>>>>>>>>>> In an
>>>>>>>>>> ideal
>>>>>>>>>> world we would convert everything over to
>>>>>>>>>> ptep_get_lockless_norecency() and
>>>>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>>>>>>>
>>>>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really
>>>>>>>>> affect
>>>>>>>>> any
>>>>>>>>> architecture.
>>>>>>>>>
>>>>>>>>> I do wonder if pte_same_norecency() should be defined per architecture
>>>>>>>>> and the
>>>>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>>>>>>>> architectures.
>>>>>>>>
>>>>>>>> Wouldn't that break it's semantics? The "norecency" of
>>>>>>>> ptep_get_lockless_norecency() means "recency information in the returned
>>>>>>>> pte
>>>>>>>> may
>>>>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore
>>>>>>>> the
>>>>>>>> access and dirty bits when you do the comparison".
>>>>>>>
>>>>>>> My idea was that ptep_get_lockless_norecency() would return the actual
>>>>>>> result on
>>>>>>> these architectures. So e.g., on x86, there would be no actual change in
>>>>>>> generated code.
>>>>>>
>>>>>> I think this is a bad plan... You'll end up with subtle differences between
>>>>>> architectures.
>>>>>>
>>>>>>>
>>>>>>> But yes, the documentation of these functions would have to be improved.
>>>>>>>
>>>>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear
>>>>>>> dirty/accessed bits to more easily find any actual issues where the bits
>>>>>>> still
>>>>>>> matter ...
>>>>>>
>>>>>> I did a version that took that approach. Decided it was not as good as
>>>>>> this way
>>>>>> though. Now for the life of me, I can't remember my reasoning.
>>>>>
>>>>> Maybe because there are some code paths that check accessed/dirty without
>>>>> "correctness" implications? For example, if the PTE is already dirty, no
>>>>> need to
>>>>> set it dirty etc?
>>>>
>>>> I think I decided I was penalizing the architectures that don't care because
>>>> all
>>>> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly
>>>> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so
>>>> that I could feed its result into pte_same().
>>>
>>> True. With ptep_get_norecency() you're also penalizing other architectures.
>>> Therefore my original thought about making the behavior arch-specific, but the
>>> arch has to make sure to get the combination of
>>> ptep_get_lockless_norecency()+ptep_same_norecency() is right.
>>>
>>> So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must
>>> make sure to also ignore them in ptep_same_norecency(), and must be able to
>>> handle access/dirty bit changes differently.
>>>
>>> Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed
>>> accessed or dirty". Only the former would end up ignoring stuff here, the latter
>>> would not.
>>>
>>> But again, just some random thoughts how this affects other architectures and
>>> how we could avoid it. The issue I describe in patch #3 would be gone if
>>> ptep_same_norecency() would just do a ptep_same() check on other architectures
>>> -- and would make it easier to sell :)
>>>
>>
>> I've been thinking some more about this. I think your proposal is the following:
>>
>>
>> // ARM64
>> ptep_get_lockless_norecency()
>> {
>>     - returned access/dirty may be incorrect
>>     - returned access/dirty may be differently incorrect between 2 calls
>> }
>> pte_same_norecency()
>> {
>>     - ignore access/dirty when doing comparison
>> }
>> ptep_set_access_flags(ptep, pte)
>> {
>>     - must not assume access/dirty in pte are "more permissive" than
>>       access/dirty in *ptep
>>     - must only set access/dirty in *ptep, never clear
>> }
>>
>>
>> // Other arches: no change to generated code
>> ptep_get_lockless_norecency()
>> {
>>     return ptep_get_lockless();
>> }
>> pte_same_norecency()
>> {
>>     return pte_same();
>> }
>> ptep_set_access_flags(ptep, pte)
>> {
>>     - may assume access/dirty in pte are "more permissive" than access/dirty
>>       in *ptep
>>     - if no HW access/dirty updates, "*ptep = pte" always results in "more
>>       permissive" change
>> }
>>
>> An arch either specializes all 3 or none of them.
>>
>> This would allow us to get rid of ptep_get_lockless().
>>
>> And it addresses the bug you found with ptep_set_access_flags().
>>
>>
>> BUT, I still have a nagging feeling that there are likely to be other similar
>> problems caused by ignoring access/dirty during pte_same_norecency(). I can't
>> convince myself that its definitely all safe and robust.
> 
> Right, we'd have to identify the other possible cases and document what an arch
> + common code must stick to to make it work.
> 
> Some rules would be: if an arch implements ptep_get_lockless_norecency():
> 
> (1) Passing the result from ptep_get_lockless_norecency() to pte_same()
>     is wrong.
> (2) Checking pte_young()/pte_old/pte_dirty()/pte_clean() after
>     ptep_get_lockless_norecency() is very likely wrong.
> 
> 
>>
>> So I'm leaning towards dropping patch 3 and therefore keeping
>> ptep_get_lockless() around.
>>
>> Let me know if you have any insight that might help me change my mind :)
> 
> I'm wondering if it would help if we could find a better name (or concept) for
> "norecency" here, that expresses that only on some archs we'd have that fuzzy
> handling.
> 
> Keeping ptep_get_lockless() around for now sounds like the best alternative.

Separately to this I've been playing with an idea to add support for uffd-wp and
soft-dirty SW PTE bits for arm64; it boils down to keeping the SW bits in
separate storage, linked from the ptdesc. And we have some constant HW PTE bits
that we can remove and replace with those SW bits so we can keep the pte_t the
same size and abstract it all with ptep_get() and set_ptes().

It was all looking straightforward until I got to ptep_get_lockless(). Now that
there are 2 separate locations for PTE bits, I can't read it all atomically.

So I've been looking at all this again, and getting myself even more confused.

I believe there are 2 classes of ptep_get_lockless() caller:

1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault()
2) everyone else

--

(1) doesn't really care if orig_pte is consistent or not. It just wants to read
a value, do some speculative work based on that value, then lock the PTL, and
check the value hasn't changed. If it has changed, it backs out. So we don't
actually need any "lockless" guarrantees here; we could just use ptep_get().

In fact, prior to Hugh's commit 26e1a0c3277d ("mm: use pmdp_get_lockless()
without surplus barrier()"), we had this:

                vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
-               vmf->orig_pte = *vmf->pte;
+               vmf->orig_pte = ptep_get_lockless(vmf->pte);
                vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;

-               /*
-                * some architectures can have larger ptes than wordsize,
-                * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
-                * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
-                * accesses.  The code below just needs a consistent view
-                * for the ifs and we later double check anyway with the
-                * ptl lock held. So here a barrier will do.
-                */
-               barrier();
                if (pte_none(vmf->orig_pte)) {

--

(2) All the other users require that a subset of the pte fields are
self-consistent; specifically they don't care about access, dirty, uffd-wp or
soft-dirty. arm64 can guarrantee that all the other bits are self-consistent
just by calling ptep_get().

--

So, I'm making the bold claim that it was never neccessary to specialize
pte_get_lockless() on arm64, and I *think* we could just delete it so that
ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original
aim without needing to introduce "norecency" variants.

Additionally I propose documenting ptep_get_lockless() to describe the set of
fields that are guarranteed to be self-consistent and the remaining fields which
are self-consistent only with best-effort.

Could it be this easy? My head is hurting...

Thanks,
Ryan