[PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()

Mon Feb 12 08:27:59 PST 2024

On 12.02.24 16:47, Ryan Roberts wrote:
> On 12/02/2024 13:43, David Hildenbrand wrote:
>> On 02.02.24 09:07, Ryan Roberts wrote:
>>> Some architectures (e.g. arm64) can tell from looking at a pte, if some
>>> follow-on ptes also map contiguous physical memory with the same pgprot.
>>> (for arm64, these are contpte mappings).
>>>
>>> Take advantage of this knowledge to optimize folio_pte_batch() so that
>>> it can skip these ptes when scanning to create a batch. By default, if
>>> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
>>> the changes are optimized out and the behaviour is as before.
>>>
>>> arm64 will opt-in to providing this hint in the next patch, which will
>>> greatly reduce the cost of ptep_get() when scanning a range of contptes.
>>>
>>> Tested-by: John Hubbard <jhubbard at nvidia.com>
>>> Signed-off-by: Ryan Roberts <ryan.roberts at arm.com>
>>> ---
>>>    include/linux/pgtable.h | 18 ++++++++++++++++++
>>>    mm/memory.c             | 20 +++++++++++++-------
>>>    2 files changed, 31 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 50f32cccbd92..cba31f177d27 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>>>    #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>    #endif
>>>    +#ifndef pte_batch_hint
>>> +/**
>>> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
>>> + * @ptep: Page table pointer for the entry.
>>> + * @pte: Page table entry.
>>> + *
>>> + * Some architectures know that a set of contiguous ptes all map the same
>>> + * contiguous memory with the same permissions. In this case, it can provide a
>>> + * hint to aid pte batching without the core code needing to scan every pte.
>>
>> I think we might want to document here the expectation regarding
>> dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
>> FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
>> always when batching, because the dirty bit may target any pte part of the
>> cont-pte group either way.
>>
>> Maybe something like:
>>
>> "
>> An architecture implementation may only ignore the PTE accessed and dirty bits.
>> Further, it may only ignore the dirty bit if that bit is already not
>> maintained with precision per PTE inside the hinted batch, and ptep_get()
>> would already have to collect it from various PTEs.
>> "
> 
> I'm proposing to simplify this to:
> 
> "
> An architecture implementation may ignore the PTE accessed state. Further, the
> dirty state must apply atomically to all the PTEs described by the hint.
> "
> 
> Which I think more accurately describes the requirement. Shout if you disagree.

I'm not 100% sure if the "must apply atomically" is clear without all of 
the cont-pte details and ptep_get(). But I fail to describe it in a 
better way.

It's all better compared to what we had before, so LGTM :)

-- 
Cheers,

David / dhildenb