[RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order

Fri Apr 14 09:06:49 PDT 2023

On 14/04/2023 16:37, Kirill A. Shutemov wrote:
> On Fri, Apr 14, 2023 at 03:38:35PM +0100, Ryan Roberts wrote:
>> On 14/04/2023 15:09, Kirill A. Shutemov wrote:
>>> On Fri, Apr 14, 2023 at 02:02:51PM +0100, Ryan Roberts wrote:
>>>> For variable-order anonymous folios, we want to tune the order that we
>>>> prefer to allocate based on the vma. Add the routines to manage that
>>>> heuristic.
>>>>
>>>> TODO: Currently we always use the global maximum. Add per-vma logic!
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts at arm.com>
>>>> ---
>>>>  include/linux/mm.h | 5 +++++
>>>>  mm/memory.c        | 8 ++++++++
>>>>  2 files changed, 13 insertions(+)
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index cdb8c6031d0f..cc8d0b239116 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3674,4 +3674,9 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>>>>  }
>>>>  #endif
>>>>
>>>> +/*
>>>> + * TODO: Should this be set per-architecture?
>>>> + */
>>>> +#define ANON_FOLIO_ORDER_MAX	4
>>>> +
>>>
>>> I think it has to be derived from size in bytes, not directly specifies
>>> page order. For 4K pages, order 4 is 64k and for 64k pages it is 1M.
>>>
>>
>> Yes I see where you are coming from. What's your feel for what a sensible upper
>> bound in bytes is?
>>
>> My difficulty is that I would like to be able to use this allocation mechanism
>> to enable using the "contiguous bit" on arm64; that's a set of contiguous PTEs
>> that are mapped to physically contiguous memory, and the HW can use that hint to
>> coalesce the TLB entries.
>>
>> For 4KB pages, the contig size is 64KB (order-4), so that works nicely. But for
>> 16KB and 64KB pages, its 2MB (order-7 and order-5 respectively). Do you think
>> allocating 2MB pages here is going to lead to too much memory wastage?
> 
> I think it boils down to the specifics of the microarchitecture.
> 
> We can justify 2M PMD-mapped THP in many cases. But PMD-mapped THP is not
> only reduces TLB pressure (that contiguous bit does too, I believe), but
> also saves one more memory access on page table walk.
> 
> It may or may not matter for the processor. It has to be evaluated.

I think you are saying that if the performance uplift is good, then some extra
memory wastage can be justified?

The point I'm thinking about is for 4K pages, we need to allocate 64K blocks to
use the contig bit. Roughly I guess that means going from average of 2K wastage
per anon VMA to 32K. Perhaps you can get away with that for a decent perf uplift.

But for 64K pages, we need to allocate 2M blocks to use the contig bit. So that
takes average wastage from 32K to 1M. That feels a bit harder to justify.
Perhaps here, we should make a decision based on MADV_HUGEPAGE?

So perhaps we actually want 2 values: one for if MADV_HUGEPAGE is not set on the
VMA, and one if it is? (with 64K pages I'm guessing there are many cases where
we won't PMD-map THPs - its 512MB).

> 
> Maybe moving it to per-arch is the right way. With default in generic code
> to be ilog2(SZ_64K >> PAGE_SIZE) or something.

Yes, I agree that sounds like a good starting point for the !MADV_HUGEPAGE case.