[PATCH v6 0/9] variable-order, large folios for anonymous memory

Fri Oct 27 05:29:47 PDT 2023

On 27.10.23 14:27, Ryan Roberts wrote:
> On 26/10/2023 16:19, David Hildenbrand wrote:
>> [...]
>>
>>>>> Hi,
>>>>>
>>>>> I wanted to remind people in the THP cabal meeting, but that either
>>>>> didn't happen or zoomed decided to not let me join :)
>>>
>>> I didn't make it yesterday either - was having to juggle child care.
>>
>> I think it didn't happen, or started quite late (>20 min).
>>
>>>
>>>>>
>>>>>>
>>>>>> It's been a week since the mm alignment meeting discussion we had around
>>>>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI
>>>>>> proposal, so I'm going to be optimistic and assume that nobody has found any
>>>>>> fatal flaws in it :).
>>>>>
>>>>> After saying in the call probably 10 times that people should comment
>>>>> here if there are reasonable alternatives worth discussing, call me
>>>>> "optimistic" as well; but, it's only been a week and people might still
>>>>> be thinking about this/
>>>>>
>>>>> There were two things discussed in the call:
>>>>>
>>>>> * Yu brought up "lists" so we can have priorities. As briefly discussed
>>>>>      in the  call, this (a) might not be needed right now in an initial
>>>>>      version;  (b) the kernel might be able to handle that (or many cases)
>>>>>      automatically, TBD. Adding lists now would kind-of set the semantics
>>>>>      of that interface in stone. As you describe below, the approach
>>>>>      discussed here could easily be extended to cover priorities, if need
>>>>>      be.
>>>>
>>>> I want to expand on this: the argument that "if you could allocate a
>>>> higher order you should use it" is too simplistic. There are many
>>>> reasons in addition to the one above that we want to "fall back" to
>>>> higher orders, e.g., those higher orders are not on PCP or from the
>>>> local node. When we consider the sequence of orders to try, user
>>>> preference is just one of the parameters to the cost function. The
>>>> bottom line is that I think we should all agree that there needs to be
>>>> a cost function down the road, whatever it looks like. Otherwise I
>>>> don't know how we can make "auto" happen.
>>
>> I agree that there needs to be a cost function, and as pagecache showed that's
>> independent of initial enablement.
>>
>>>
>>> I don't dispute that this sounds like it could be beneficial, but I see it as
>>> research to happen further down the road (as you say), and we don't know what
>>> that research might conclude. Also, I think the scope of this is bigger than
>>> anonymous memory - you would also likely want to look at the policy for page
>>> cache folio order too, since today that's based solely on readahead. So I see it
>>> as an optimization that is somewhat orthogonal to small-sized THP.
>>
>> Exactly my thoughts.
>>
>> The important thing is that we should plan ahead that we still have the option
>> to let the admin configure if we cannot make this work automatically in the kernel.
>>
>> What we'll need, nobody knows. Maybe it's a per-size priority, maybe it's a
>> single global toggle.
>>
>>>
>>> The proposed interface does not imply any preference order - it only states
>>> which sizes the user wants the kernel to select from, so I think there is lots
>>> of freedom to change this down the track if the kernel wants to start using the
>>> buddy allocator's state as a signal to make its decisions.
>>
>> Yes.
>>
>> [..]
>>
>>>>> Jup, same opinion here. But again, I'm very happy to hear other
>>>>> alternatives and why they are better.
>>>>
>>>> I'm not against David's proposal but I want to hear a lot more about
>>>> "lots of flexibility for growth" before I'm fully convinced.
>>>
>>> My point was that in an abstract sense, there are properties a user may wish to
>>> apply individually to a size, which is catered for by having a per-size
>>> directory into which we can add more files if/when requirements for new per-size
>>> properties arise. There are also properties that may be applied globally, for
>>> which we have the top-level transparent_hugepage directory where properties can
>>> be extended or added.
>>
>> Exactly, well said.
>>
>>>
>>> For your case around tighter integration with the buddy allocator, I could
>>> imagine a per-size file allowing the user to specify if the kernel should allow
>>> splitting a higher order to make a THP of that size (I'm not suggesting that's a
>>> good idea, I'm just pointing out that this sort of thing is possible with the
>>> interface). And we have discussed how the global enabled prpoerty could be
>>> extended to support "auto" [1].
>>>
>>> But perhaps what we really need are lots more ideas for future directions for
>>> small-sized THP to allow us to evaluate this interface more widely.
>>
>> David R. motivated a future size-aware setting of the defrag option. As
>> discussed we might want something similar to shmem_enable. What will happen with
>> khugepaged, nobody knows yet :)
>>
>> I could imagine exposing per-size boolean read-only properties like
>> "native-hw-size" (PMD, cont-pte). But these things require much more thought.
> 
> FWIW, the reason I opted for the "recommend" special case in the v5 posting was
> because that felt like an easy thing to also add to the command line in future.
> Having a separate file, native-hw-size, that the user has to read then enable
> through another file is not very command-line friendly, if you want the
> hw-preferred size(s) enabled from boot.

Jup. I strongly suspect distributions will just have their setup script 
to handle such things, though.

> 
> Maybe the wider observation is "how does the proposed interface translate to the
> kernel command line if needed in future?".

I guess in the distant future, "auto" is what we want.

-- 
Cheers,

David / dhildenb