[PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

Thu Aug 31 01:09:18 PDT 2023

On 31.08.23 10:02, Yin, Fengwei wrote:
> 
> 
> On 8/31/2023 3:57 PM, David Hildenbrand wrote:
>> On 31.08.23 03:40, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts at arm.com> writes:
>>>
>>>> On 15/08/2023 22:32, Huang, Ying wrote:
>>>>> Hi, Ryan,
>>>>>
>>>>> Ryan Roberts <ryan.roberts at arm.com> writes:
>>>>>
>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>> counting, rmap management lru list management) are also significantly
>>>>>> reduced since those ops now become per-folio.
>>>>>>
>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>> defaut to enabled, but there are some risks around internal
>>>>>> fragmentation that need to be better understood first.
>>>>>>
>>>>>> Large anonymous folio (LAF) allocation is integrated with the existing
>>>>>> (PMD-order) THP and single (S) page allocation according to this policy,
>>>>>> where fallback (>) is performed for various reasons, such as the
>>>>>> proposed folio order not fitting within the bounds of the VMA, etc:
>>>>>>
>>>>>>                   | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
>>>>>>                   | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
>>>>>> ----------------|-----------|-------------|---------------|-------------
>>>>>> no hint         | S         | LAF>S       | LAF>S         | THP>LAF>S
>>>>>> MADV_HUGEPAGE   | S         | LAF>S       | THP>LAF>S     | THP>LAF>S
>>>>>> MADV_NOHUGEPAGE | S         | S           | S             | S
>>>>>
>>>>> IMHO, we should use the following semantics as you have suggested
>>>>> before.
>>>>>
>>>>>                   | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
>>>>>                   | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
>>>>> ----------------|-----------|-------------|---------------|-------------
>>>>> no hint         | S         | S           | LAF>S         | THP>LAF>S
>>>>> MADV_HUGEPAGE   | S         | S           | THP>LAF>S     | THP>LAF>S
>>>>> MADV_NOHUGEPAGE | S         | S           | S             | S
>>>>>
>>>>> Or even,
>>>>>
>>>>>                   | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
>>>>>                   | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
>>>>> ----------------|-----------|-------------|---------------|-------------
>>>>> no hint         | S         | S           | S             | THP>LAF>S
>>>>> MADV_HUGEPAGE   | S         | S           | THP>LAF>S     | THP>LAF>S
>>>>> MADV_NOHUGEPAGE | S         | S           | S             | S
>>>>>
>>>>>   From the implementation point of view, PTE mapped PMD-sized THP has
>>>>> almost no difference with LAF (just some small sized THP).  It will be
>>>>> confusing to distinguish them from the interface point of view.
>>>>>
>>>>> So, IMHO, the real difference is the policy.  For example, prefer
>>>>> PMD-sized THP, prefer small sized THP, or fully auto.  The sysfs
>>>>> interface is used to specify system global policy.  In the long term, it
>>>>> can be something like below,
>>>>>
>>>>> never:      S               # disable all THP
>>>>> madvise:                    # never by default, control via madvise()
>>>>> always:     THP>LAF>S       # prefer PMD-sized THP in fact
>>>>> small:      LAF>S           # prefer small sized THP
>>>>> auto:                       # use in-kernel heuristics for THP size
>>>>>
>>>>> But it may be not ready to add new policies now.  So, before the new
>>>>> policies are ready, we can add a debugfs interface to override the
>>>>> original policy in /sys/kernel/mm/transparent_hugepage/enabled.  After
>>>>> we have tuned enough workloads, collected enough data, we can add new
>>>>> policies to the sysfs interface.
>>>>
>>>> I think we can all imagine many policy options. But we don't really have much
>>>> evidence yet for what it best. The policy I'm currently using is intended to
>>>> give some flexibility for testing (use LAF without THP by setting sysfs=never,
>>>> use THP without LAF by compiling without LAF) without adding any new knobs at
>>>> all. Given that, surely we can defer these decisions until we have more data?
>>>>
>>>> In the absence of data, your proposed solution sounds very sensible to me. But
>>>> for the purposes of scaling up perf testing, I don't think its essential given
>>>> the current policy will also produce the same options.
>>>>
>>>> If we were going to add a debugfs knob, I think the higher priority would be a
>>>> knob to specify the folio order. (but again, I would rather avoid if possible).
>>>
>>> I totally understand we need some way to control PMD-sized THP and LAF
>>> to tune the workload, and nobody likes debugfs knob.
>>>
>>> My concern about interface is that we have no way to disable LAF
>>> system-wise without rebuilding the kernel.  In the future, should we add
>>> a new policy to /sys/kernel/mm/transparent_hugepage/enabled to be
>>> stricter than "never"?  "really_never"?
>>
>> Let's talk about that in a bi-weekly MM session. (I proposed it as a topic for next week).
> 
> The time slot of the meeting is not friendly to our timezone. Like
> it's 1 or 2 AM. Yes. I know it's very hard to find a good time slot
> for US, EU and Asia. :(.

:/

Yeah, even for me in Germany it's usually already around 6-7pm.

> 
> So maybe we still need to discuss it through mail?
I don't think we'll be done discussing that in one session. One of the 
main goals is to get some input from the wider MM community.

-- 
Cheers,

David / dhildenb