[PATCH v5 6/7] mm: Optimize mprotect() by PTE batching

Wed Aug 6 02:37:49 PDT 2025

On 06/08/25 2:51 pm, David Hildenbrand wrote:
> On 06.08.25 11:12, Lorenzo Stoakes wrote:
>> On Wed, Aug 06, 2025 at 10:08:33AM +0200, David Hildenbrand wrote:
>>> On 18.07.25 11:02, Dev Jain wrote:
>>>> Signed-off-by: Dev Jain <dev.jain at arm.com>
>>>
>>>
>>> I wanted to review this, but looks like it's already upstream and I 
>>> suspect
>>> it's buggy (see the upstream report I cc'ed you on)
>>>
>>> [...]
>>>
>>>> +
>>>> +/*
>>>> + * This function is a result of trying our very best to retain the
>>>> + * "avoid the write-fault handler" optimization. In 
>>>> can_change_pte_writable(),
>>>> + * if the vma is a private vma, and we cannot determine whether to 
>>>> change
>>>> + * the pte to writable just from the vma and the pte, we then need 
>>>> to look
>>>> + * at the actual page pointed to by the pte. Unfortunately, if we 
>>>> have a
>>>> + * batch of ptes pointing to consecutive pages of the same anon 
>>>> large folio,
>>>> + * the anon-exclusivity (or the negation) of the first page does 
>>>> not guarantee
>>>> + * the anon-exclusivity (or the negation) of the other pages 
>>>> corresponding to
>>>> + * the pte batch; hence in this case it is incorrect to decide to 
>>>> change or
>>>> + * not change the ptes to writable just by using information from 
>>>> the first
>>>> + * pte of the batch. Therefore, we must individually check all 
>>>> pages and
>>>> + * retrieve sub-batches.
>>>> + */
>>>> +static void commit_anon_folio_batch(struct vm_area_struct *vma,
>>>> +        struct folio *folio, unsigned long addr, pte_t *ptep,
>>>> +        pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather 
>>>> *tlb)
>>>> +{
>>>> +    struct page *first_page = folio_page(folio, 0);
>>>
>>> Who says that we have the first page of the folio mapped into the 
>>> first PTE
>>> of the batch?
>>
>> Yikes, missed this sorry. Got too tied up in alogrithm here.
>>
>> You mean in _this_ PTE of the batch right? As we're invoking these on 
>> each part
>> of the PTE table.
>>
>> I mean I guess we can simply do:
>>
>>     struct page *first_page = pte_page(ptent);
>>
>> Right?
>
> Yes, but we should forward the result from vm_normal_page(), which does
> exactly that for you, and increment the page accordingly as required,
> just like with the pte we are processing.

Makes sense, so I guess I will have to change the signature of 
prot_numa_skip()

to pass a double ptr to a page instead of folio and derive the folio in 
the caller,

and pass down both the folio and the page to 
set_write_prot_commit_flush_ptes.

>
> ...
>
>>>
>>>> +            else
>>>> +                prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
>>>> +                    nr_ptes, /* idx = */ 0, /* set_write = */ 
>>>> false, tlb);
>>>
>>> Semi-broken intendation.
>>
>> Because of else then 2 lines after?
>
> prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
>                nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);
>
> Is what I would have expected.
>
>
> I think a smart man once said, that if you need more than one line per 
> statement in
> an if/else clause, a set of {} can aid readability. But I don't 
> particularly care :)
>