[PATCH] arm64: mm: drop tlb flush operation when clearing the access bit

Baolin Wang baolin.wang at linux.alibaba.com
Thu Oct 26 05:32:42 PDT 2023



On 10/26/2023 2:01 PM, Anshuman Khandual wrote:
> 
> 
> On 10/26/23 11:24, Barry Song wrote:
>> On Thu, Oct 26, 2023 at 12:55 PM Anshuman Khandual
>> <anshuman.khandual at arm.com> wrote:
>>>
>>>
>>>
>>> On 10/24/23 18:26, Baolin Wang wrote:
>>>> Now ptep_clear_flush_young() is only called by folio_referenced() to
>>>> check if the folio was referenced, and now it will call a tlb flush on
>>>> ARM64 architecture. However the tlb flush can be expensive on ARM64
>>>> servers, especially for the systems with a large CPU numbers.
>>>
>>> TLB flush would be expensive on *any* platform with large CPU numbers ?

Perhaps yes, but did not measure it on other platforms.

>>>> Similar to the x86 architecture, below comments also apply equally to
>>>> ARM64 architecture. So we can drop the tlb flush operation in
>>>> ptep_clear_flush_young() on ARM64 architecture to improve the performance.
>>>> "
>>>> /* Clearing the accessed bit without a TLB flush
>>>>   * doesn't cause data corruption. [ It could cause incorrect
>>>>   * page aging and the (mistaken) reclaim of hot pages, but the
>>>>   * chance of that should be relatively low. ]
>>>>   *
>>>>   * So as a performance optimization don't flush the TLB when
>>>>   * clearing the accessed bit, it will eventually be flushed by
>>>>   * a context switch or a VM operation anyway. [ In the rare
>>>>   * event of it not getting flushed for a long time the delay
>>>>   * shouldn't really matter because there's no real memory
>>>>   * pressure for swapout to react to. ]
>>>>   */
>>>
>>> If always true, this sounds generic enough for all platforms, why only
>>> x86 and arm64 ?

I am not sure this is always true for every architectures.

>>>> "
>>>> Running the thpscale to show some obvious improvements for compaction
>>>> latency with this patch:
>>>>                               base                   patched
>>>> Amean     fault-both-1      1093.19 (   0.00%)     1084.57 *   0.79%*
>>>> Amean     fault-both-3      2566.22 (   0.00%)     2228.45 *  13.16%*
>>>> Amean     fault-both-5      3591.22 (   0.00%)     3146.73 *  12.38%*
>>>> Amean     fault-both-7      4157.26 (   0.00%)     4113.67 *   1.05%*
>>>> Amean     fault-both-12     6184.79 (   0.00%)     5218.70 *  15.62%*
>>>> Amean     fault-both-18     9103.70 (   0.00%)     7739.71 *  14.98%*
>>>> Amean     fault-both-24    12341.73 (   0.00%)    10684.23 *  13.43%*
>>>> Amean     fault-both-30    15519.00 (   0.00%)    13695.14 *  11.75%*
>>>> Amean     fault-both-32    16189.15 (   0.00%)    14365.73 *  11.26%*
>>>>                         base       patched
>>>> Duration User         167.78      161.03
>>>> Duration System      1836.66     1673.01
>>>> Duration Elapsed     2074.58     2059.75
>>>
>>> Could you please point to the test repo you are running ?

The test is based on v6.5 kernel.

>>>> Barry Song submitted a similar patch [1] before, that replaces the
>>>> ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
>>>> folio_referenced_one(). However, I'm not sure if removing the tlb flush
>>>> operation is applicable to every architecture in kernel, so dropping
>>>> the tlb flush for ARM64 seems a sensible change.
>>>
>>> The reasoning provided here sounds generic when true, hence there seems
>>> to be no justification to keep it limited just for arm64 and x86. Also

Right, but I can not ensure if this will break other architectures.

>>> what about pmdp_clear_flush_young_notify() when THP is enabled. Should
>>> that also not do a TLB flush after clearing access bit ? Although arm64

Yes, I think so.

>>> does not enable __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH, rather depends on
>>> the generic pmdp_clear_flush_young() which also does a TLB flush via
>>> flush_pmd_tlb_range() while clearing the access bit.
>>>
>>>>
>>>> Note: I am okay for both approach, if someone can help to ensure that
>>>> all architectures do not need the tlb flush when clearing the accessed
>>>> bit, then I also think Barry's patch is better (hope Barry can resend
>>>> his patch).
>>>
>>> This paragraph belongs after the '----' below and not part of the commit
>>> message.

OK.

>>>> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
>>>> Signed-off-by: Baolin Wang <baolin.wang at linux.alibaba.com>
>>>> ---
>>>>   arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
>>>>   1 file changed, 16 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index 0bd18de9fd97..2979d796ba9d 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>   static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>                                         unsigned long address, pte_t *ptep)
>>>>   {
>>>> -     int young = ptep_test_and_clear_young(vma, address, ptep);
>>>> -
>>>> -     if (young) {
>>>> -             /*
>>>> -              * We can elide the trailing DSB here since the worst that can
>>>> -              * happen is that a CPU continues to use the young entry in its
>>>> -              * TLB and we mistakenly reclaim the associated page. The
>>>> -              * window for such an event is bounded by the next
>>>> -              * context-switch, which provides a DSB to complete the TLB
>>>> -              * invalidation.
>>>> -              */
>>>> -             flush_tlb_page_nosync(vma, address);
>>>> -     }
>>>> -
>>>> -     return young;
>>>> +     /*
>>>> +      * This comment is borrowed from x86, but applies equally to ARM64:
>>>> +      *
>>>> +      * Clearing the accessed bit without a TLB flush doesn't cause
>>>> +      * data corruption. [ It could cause incorrect page aging and
>>>> +      * the (mistaken) reclaim of hot pages, but the chance of that
>>>> +      * should be relatively low. ]
>>>> +      *
>>>> +      * So as a performance optimization don't flush the TLB when
>>>> +      * clearing the accessed bit, it will eventually be flushed by
>>>> +      * a context switch or a VM operation anyway. [ In the rare
>>>> +      * event of it not getting flushed for a long time the delay
>>>> +      * shouldn't really matter because there's no real memory
>>>> +      * pressure for swapout to react to. ]
>>>> +      */
>>>> +     return ptep_test_and_clear_young(vma, address, ptep);
>>>>   }
>>>>
>>>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>
>>> There are three distinct concerns here
>>>
>>> 1) What are the chances of this misleading existing hot page reclaim process
>>> 2) How secondary MMU such as SMMU adapt to change in mappings without a flush
>>> 3) Could this break the architecture rule requiring a TLB flush after access
>>>     bit clear on a page table entry
>>
>> In terms of all of above concerns,  though 2 is different, which is an
>> issue between
>> cpu and non-cpu,
>> i feel kernel has actually dropped tlb flush at least for mglru, there
>> is no flush in
>> lru_gen_look_around(),
>>
>> static bool folio_referenced_one(struct folio *folio,
>>                  struct vm_area_struct *vma, unsigned long address, void *arg)
>> {
>>          ...
>>
>>                  if (pvmw.pte) {
>>                          if (lru_gen_enabled() &&
>>                              pte_young(ptep_get(pvmw.pte))) {
>>                                  lru_gen_look_around(&pvmw);
>>                                  referenced++;
>>                          }
>>
>>                          if (ptep_clear_flush_young_notify(vma, address,
>>                                                  pvmw.pte))
>>                                  referenced++;
>>                  }
>>
>>          return true;
>> }
>>
>> and so is in walk_pte_range() of vmscan.  linux has been surviving with
>> all above concerns for a while, believing it or not :-)
> 
> Although the first two concerns could be worked upon in the SW, kernel surviving
> after breaking arch rules explicitly is not a correct state to be in IMHO.

Not sure what's the meaning of "not a correct state", at least we 
(Alibaba) have not found this can cause any issues until now when using 
MGLRU on x86 and ARM64 platforms.



More information about the linux-arm-kernel mailing list