[PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
Dev Jain
dev.jain at arm.com
Wed Aug 6 02:37:49 PDT 2025
On 06/08/25 2:51 pm, David Hildenbrand wrote:
> On 06.08.25 11:12, Lorenzo Stoakes wrote:
>> On Wed, Aug 06, 2025 at 10:08:33AM +0200, David Hildenbrand wrote:
>>> On 18.07.25 11:02, Dev Jain wrote:
>>>> Signed-off-by: Dev Jain <dev.jain at arm.com>
>>>
>>>
>>> I wanted to review this, but looks like it's already upstream and I
>>> suspect
>>> it's buggy (see the upstream report I cc'ed you on)
>>>
>>> [...]
>>>
>>>> +
>>>> +/*
>>>> + * This function is a result of trying our very best to retain the
>>>> + * "avoid the write-fault handler" optimization. In
>>>> can_change_pte_writable(),
>>>> + * if the vma is a private vma, and we cannot determine whether to
>>>> change
>>>> + * the pte to writable just from the vma and the pte, we then need
>>>> to look
>>>> + * at the actual page pointed to by the pte. Unfortunately, if we
>>>> have a
>>>> + * batch of ptes pointing to consecutive pages of the same anon
>>>> large folio,
>>>> + * the anon-exclusivity (or the negation) of the first page does
>>>> not guarantee
>>>> + * the anon-exclusivity (or the negation) of the other pages
>>>> corresponding to
>>>> + * the pte batch; hence in this case it is incorrect to decide to
>>>> change or
>>>> + * not change the ptes to writable just by using information from
>>>> the first
>>>> + * pte of the batch. Therefore, we must individually check all
>>>> pages and
>>>> + * retrieve sub-batches.
>>>> + */
>>>> +static void commit_anon_folio_batch(struct vm_area_struct *vma,
>>>> + struct folio *folio, unsigned long addr, pte_t *ptep,
>>>> + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather
>>>> *tlb)
>>>> +{
>>>> + struct page *first_page = folio_page(folio, 0);
>>>
>>> Who says that we have the first page of the folio mapped into the
>>> first PTE
>>> of the batch?
>>
>> Yikes, missed this sorry. Got too tied up in alogrithm here.
>>
>> You mean in _this_ PTE of the batch right? As we're invoking these on
>> each part
>> of the PTE table.
>>
>> I mean I guess we can simply do:
>>
>> struct page *first_page = pte_page(ptent);
>>
>> Right?
>
> Yes, but we should forward the result from vm_normal_page(), which does
> exactly that for you, and increment the page accordingly as required,
> just like with the pte we are processing.
Makes sense, so I guess I will have to change the signature of
prot_numa_skip()
to pass a double ptr to a page instead of folio and derive the folio in
the caller,
and pass down both the folio and the page to
set_write_prot_commit_flush_ptes.
>
> ...
>
>>>
>>>> + else
>>>> + prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
>>>> + nr_ptes, /* idx = */ 0, /* set_write = */
>>>> false, tlb);
>>>
>>> Semi-broken intendation.
>>
>> Because of else then 2 lines after?
>
> prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
> nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);
>
> Is what I would have expected.
>
>
> I think a smart man once said, that if you need more than one line per
> statement in
> an if/else clause, a set of {} can aid readability. But I don't
> particularly care :)
>
More information about the linux-arm-kernel
mailing list