[RESEND PATCH v7 03/10] mm: thp: Introduce per-size thp sysfs interface

Wed Nov 29 11:40:57 PST 2023

1On 11/29/23 03:05, Ryan Roberts wrote:
> On 29/11/2023 03:42, John Hubbard wrote:
>> On 11/22/23 08:29, Ryan Roberts wrote:
...
>>> +As well as PMD-sized THP described above, it is also possible to
>>> +configure the system to allocate "small-sized THP" to back anonymous
>>
>> Here's one of the places to change to the new name, which lately is
>> "multi-size THP", or mTHP or m_thp for short. (I've typed "multi-size"
>> instead of "multi-sized", because the 'd' doesn't add significantly to
>> the meaning, and if in doubt, shorter is better.
> 
> I was thinking of some light changes to the start of this paragraph, something like:
> 
> Modern kernels support "multi-size THP" (mTHP), which introduces the ability to
> allocate memory in blocks that are bigger than a base page but smaller than
> traditional PMD-size (as described above, in increments of a power-of-2 number
> of pages. mTHP can back anonymous
> 

Very nice.

...
>> By default, PMD-sized hugepages have enabled="inherited" and all other
>> hugepage sizes have enabled="never".
> 
> That all sounds good; will update.
> 
> I wonder if "inherit" is even better than "inherited" though?

Yes, I think so. "inherit" was actually my first idea, and after
thinking about it, I just made it worse by adding the "ed". haha. :)

...
>>>    Khugepaged controls
>>>    -------------------
>>>
>>> +.. note::
>>> +   khugepaged currently only searches for opportunities to collapse to
>>> +   PMD-sized THP and no attempt is made to collapse to small-sized
>>> +   THP.
>>> +
> 
> "small-sized THP" used here too; I propose to change it to "... collapse to
> other THP sizes."

OK, looks good.

> 
>>>    khugepaged runs usually at low frequency so while one may not want to
>>>    invoke defrag algorithms synchronously during the page faults, it
>>>    should be worth invoking defrag at least in khugepaged. However it's
>>> @@ -282,10 +321,11 @@ force
>>>    Need of application restart
>>>    ===========================
>>>
>>> -The transparent_hugepage/enabled values and tmpfs mount option only affect
>>> -future behavior. So to make them effective you need to restart any
>>> -application that could have been using hugepages. This also applies to the
>>> -regions registered in khugepaged.
>>> +The transparent_hugepage/enabled and
>>> +transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount
>>> +option only affect future behavior. So to make them effective you need
>>> +to restart any application that could have been using hugepages. This
>>> +also applies to the regions registered in khugepaged.
>>>
>>>    Monitoring usage
>>>    ================
>>> @@ -308,6 +348,10 @@ frequently will incur overhead.
>>>    There are a number of counters in ``/proc/vmstat`` that may be used to
>>>    monitor how successfully the system is providing huge pages for use.
>>>
>>> +.. note::
>>> +   Currently the below counters only record events relating to
>>> +   PMD-sized THP. Events relating to small-sized THP are not included.
>>
>> Here's another spot to rename to "multi-size THP".
> 
> I propose to change it to "Events relating to other THP sizes..."
> 
> I'm also going to move this note to just under the "Monitoring Usage" heading
> since its current position doesn't make it clear that it includes
> "AnonHugePages". I'll also make it clear in the text that mentions AnonHugePages
> counter that it "only applies to traditional PMD-sized THP for historical
> reasons" and that "AnonHugePages should have been called AnonHugePmdMapped" as
> David suggested in the other thread.

OK.

Are we entirely dropping the AnonHugePtePages that was there in v6? I'm
looking for some way to monitor this. I may have missed it because I haven't
gone through all of v7 yet.

...
>>> -"THPeligible" indicates whether the mapping is eligible for allocating THP
>>> -pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
>>> -It just shows the current status.
>>> +"THPeligible" indicates whether the mapping is eligible for allocating
>>> +naturally aligned THP pages of any currently enabled size. 1 if true, 0
>>> +otherwise. It just shows the current status.
>>
>> "It just shows the current status"...as opposed to what? I'm missing an
>> educational opportunity here--not sure what this means. :)
> 
> I have no idea either - it seems superfluous. But that sentence was there
> already. I can remove it if you like?
> 

Yes, let's remove it, please.

...
>>> +/*
>>> + * Mask of all large folio orders supported for anonymous THP.
>>> + */
>>> +#define THP_ORDERS_ALL_ANON    BIT(PMD_ORDER)
>>> +
>>> +/*
>>> + * Mask of all large folio orders supported for file THP.
>>> + */
>>> +#define THP_ORDERS_ALL_FILE    (BIT(PMD_ORDER) | BIT(PUD_ORDER))
>>
>> Is there something about file THP that implies PUD_ORDER? This
>> deserves an explanatory comment, I think.
> 
> Sorry, I'm not really sure what you are asking? Today's kernel supports
> PUD-mapping file-backed memory (mostly DAX I believe). I'm not sure what comment
> you want me to add, other than "file-backed memory can support PUD-mapping", but
> that's self-evident from the define, isn't it?
> 

Well, it's sort of evident, but confusing to me anyway, because it
combines THP and PUD. Which seems to imply that we have PUD-based THPs
supported, but only for file-backed mappings. Which is weird and needs
some explanation, right?

>>
>>> +
>>> +/*
>>> + * Mask of all large folio orders supported for THP.
>>> + */
>>> +#define THP_ORDERS_ALL        (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE)
>>> +
>>>    #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>    #define HPAGE_PMD_SHIFT PMD_SHIFT
>>>    #define HPAGE_PMD_SIZE    ((1UL) << HPAGE_PMD_SHIFT)
>>> @@ -78,13 +93,18 @@ extern struct kobj_attribute shmem_enabled_attr;
>>>
>>>    extern unsigned long transparent_hugepage_flags;
>>>
>>> -#define hugepage_flags_enabled()                           \
>>> -    (transparent_hugepage_flags &                       \
>>> -     ((1<<TRANSPARENT_HUGEPAGE_FLAG) |               \
>>> -      (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)))
>>> -#define hugepage_flags_always()                \
>>> -    (transparent_hugepage_flags &            \
>>> -     (1<<TRANSPARENT_HUGEPAGE_FLAG))
>>
>> Are macros actually required? If not, static inlines would be nicer.
> 
> I agree static inlines are nicer. Here I'm removing existing macros though, and

Oh, I see that I replied in the wrong part of the email, since that's the
*removal* part. oops. :)

...
>>> -static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
>>> -        unsigned long addr)
>>> -{
>>> -    unsigned long haddr;
>>> -
>>> -    /* Don't have to check pgoff for anonymous vma */
>>> -    if (!vma_is_anonymous(vma)) {
>>> -        if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
>>> -                HPAGE_PMD_NR))
>>> -            return false;
>>> +static inline unsigned long transhuge_vma_suitable(struct vm_area_struct *vma,
>>> +        unsigned long addr, unsigned long orders)
>>
>> Changing this from a bool to a mask of orders is quite significant, and
>> both the function name and the function-level comment documentation need
>> to also be adjusted. Here, perhaps one of these names would work:
>>
>>          transhuge_vma_suitable_orders()
> 
> This is my preference, but its getting a bit long. How about:
> 
> 	thp_vma_suitable_orders()
> 
> I could then call the other one:
> 
> 	thp_vma_allowable_orders()
> 
> So we have a clearly related pair?

Oh yes, that works nicely.

> 
> 
>>          transhuge_vma_orders()>
>>
>>> +{
>>> +    int order;
>>> +
>>> +    /*
>>> +     * Iterate over orders, highest to lowest, removing orders that don't
>>> +     * meet alignment requirements from the set. Exit loop at first order
>>> +     * that meets requirements, since all lower orders must also meet
>>> +     * requirements.
>>
>> Should we add a little note here, to the effect of, "this is unilaterally
>> taking over a certain amount of allocation-like policy, by deciding how
>> to search for folios of a certain size"?
>>
>> Or is this The Only Way To Do It, after all? I know we had some discussion
>> about it, and intuitively it feels reasonable, but wanted to poke at it
>> one last time anyway.
> 
> This function isn't trying to implement policy, its just filtering out orders
> that don't fit and therefore definitely cannot be used. The result is a set of
> orders the could be used and its the policy maker's job to decide which one.
> Currently that policy is "biggest one that fits && does not overlap other
> non-none ptes && folio successfully allocated". That's implemented in the next
> patch and would potentially be swapped out in the future for something more
> exotic/optimal.
> 
> So I don't think we need any extra comments here.

Ack.

...
>>> -bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
>>> -            bool smaps, bool in_pf, bool enforce_sysfs)
>>> +#define hugepage_global_enabled()            \
>>> +    (transparent_hugepage_flags &            \
>>> +     ((1<<TRANSPARENT_HUGEPAGE_FLAG) |        \
>>> +      (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)))
>>> +
>>> +#define hugepage_global_always()            \
>>> +    (transparent_hugepage_flags &            \
>>> +     (1<<TRANSPARENT_HUGEPAGE_FLAG))
>>> +
>>
>> Here again, I'd like to request that we avoid macros, as I don't think
>> they are strictly required.
> 
> Yeah I agree. I did them this way, because they already existed and I was just
> moving them from the header to here. But I'll change to static inline.
> 
>>
>>> +bool hugepage_flags_enabled(void)
>>>    {
>>> +    /*
>>> +     * We cover both the anon and the file-backed case here; we must return
>>> +     * true if globally enabled, even when all anon sizes are set to never.
>>> +     * So we don't need to look at huge_anon_orders_global.
>>> +     */
>>> +    return hugepage_global_enabled() ||
>>> +           huge_anon_orders_always ||
>>> +           huge_anon_orders_madvise;
>>> +}
>>> +
>>> +/**
>>> + * hugepage_vma_check - determine which hugepage orders can be applied to vma
>>> + * @vma:  the vm area to check
>>> + * @vm_flags: use these vm_flags instead of vma->vm_flags
>>> + * @smaps: whether answer will be used for smaps file
>>> + * @in_pf: whether answer will be used by page fault handler
>>> + * @enforce_sysfs: whether sysfs config should be taken into account
>>> + * @orders: bitfield of all orders to consider
>>> + *
>>> + * Calculates the intersection of the requested hugepage orders and the allowed
>>> + * hugepage orders for the provided vma. Permitted orders are encoded as a set
>>> + * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
>>> + * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
>>> + *
>>> + * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
>>> + * orders are allowed.
>>> + */
>>> +unsigned long hugepage_vma_check(struct vm_area_struct *vma,
>>> +                 unsigned long vm_flags, bool smaps, bool in_pf,
>>> +                 bool enforce_sysfs, unsigned long orders)
>>
>> Here again, a bool return type has been changed to a bitfield. Let's
>> also update the function name. Maybe one of these:
>>
>>          hugepage_vma_orders()
>>          hugepage_vma_allowable_orders()
> 
> thp_vma_allowable_orders()?
> 

Even better, yes.

thanks,
-- 
John Hubbard
NVIDIA