[PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
Lorenzo Stoakes
lorenzo.stoakes at oracle.com
Tue Jul 1 03:21:57 PDT 2025
On Tue, Jul 01, 2025 at 10:53:08AM +0100, Ryan Roberts wrote:
> On 01/07/2025 09:51, Lorenzo Stoakes wrote:
> > On Tue, Jul 01, 2025 at 09:30:51AM +0100, Ryan Roberts wrote:
> >>>> In an ideal world we would flatten and just have mprotect_folio_pte_batch()
> >>>> return the batch size considering all the relevant PTE bits AND the
> >>>> AnonExclusive bit on the pages. IIRC one of Dev's earlier versions modified the
> >>>> core folio_pte_batch() function to also look at the AnonExclusive bit, but I
> >>>> really disliked changing that core function (I think others did too?).
> >>>
> >>> Yeah let's not change the core function.
> >>>
> >>> My suggestion is to have mprotect_folio_pte_batch() do this.
> >>>
> >>>>
> >>>> So barring that approach, we are really only left with the batch and sub-batch
> >>>> approach - although, yes, it could be abstracted more. We could maintain a
> >>>> context struct that persists across all calls to mprotect_folio_pte_batch() and
> >>>> it can use that to keep it's state to remember if we are in the middle of a
> >>>> sub-batch and decide either to call folio_pte_batch() to get a new batch, or
> >>>> call anon_exclusive_batch() to get the next sub-batch within the current batch.
> >>>> But that started to feel overly abstracted to me.
> >>>
> >>> Having this nested batch/sub-batch loop really feels worse. You just get lost in
> >>> the complexity here very easily.
> >>>
> >>> But i"m also not sure we need to maintain _that_ much state?
> >>>
> >>> We're already looping over all of the PTEs here, so abstracting _the entire
> >>> loop_ and all the sub-batch stuff to another function, that is
> >>> mprotect_folio_pte_batch() I think sensibly, so it handles this for you makes a
> >>> ton of sense.
> >>
> >> So effectively turn mprotect_folio_pte_batch() into an iterator; have a struct
> >> and a funtion to init the struct for the the number of ptes we want to iterate
> >> over, then a per iteration function that progressively returns batches?
> >
> > Is that really necessary though?
> >
> > Idea is that mprotect_folio_pte_batch() returns the nr_ptes _taking into account
> > the PAE stuff_.
>
> The issue is the efficiency. Assuming we want to keep the PTE scan contained
> within the core folio_pte_batch() function and we _don't_ want to add PAE
> awareness to that function, then we have 2 separate, independent loops; one for
> PTE scanning and the other for PAE scanning. If the first loop scans through ans
> returns 512, but then the PAE scan returns 1, we return 1. If we don't remember
> for the next time that we already determined we have a PTE batch of 512 (now
> 511) then we will rescan the 511 PTEs and potentially return 1 again due to PAE.
> Then 510, then 509...
Hm really?
The idea is mprotect_folio_pte_batch() would look ahead and determine the
PAE/non-PAE sub-batch and return this nr_pages. It'd check 'this page is PAE, so
when's the next page that is not/hit max_nr_pages?'
So that'd be 1 in the first case.
Then you loop around and go again, and this time it'd check 'this page is !PAE,
so when's the next page that is/hit max_nr_pages?' and then it'd return 511.
A better example I think is e.g. if we had, for the sake argument, it return 16,
16, 480.
Then we scan ahead 16, set nr_ptes = 16, process 16 PTEs. Then the same again,
then the same again only for 480 PTEs.
Each time we set nr_ptes = the sub-batch size.
So I don't think we'll see O(n^2) here?
It would be like:
do {
/* now returns sub-batch count */
nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);
... rest of logic remains roughly the same ...
} while (...);
I may be being naive here in some way?
>
> That feels inefficient to me. So I'd much rather just remember that we have a
> batch of 512, then split into sub batches as needed for PAE compliance. Then we
> only scan each PTE once and each struct page once.
>
> But to achieve this, we either need to merge the 2 loops or we need to carry
> state across function calls (i.e. like an iterator).
>
> >
> > Would this break anything?
>
> It's not about breaking anything, it's about scanning efficiently. Perhaps you
> don't think it's worth worrying about in practice?
The question re: breaking was - if we re-do things like getting oldpte, ptent,
etc. on each sub-batch does _that_ break anything?
The current implementation is not acceptable on the basis of adding a horrible
level of complexity. That function is already terrible, and adding an inner loop
for this batch special casing with _sub batches_ to account for PAE- nobody is
going to understand what's going on.
do {
if (...) {
while (...) {
help!!!
We can do better, and I'm going to go further and say - this series _has_ to do
better, because I can't accept that, however we do it.
I want the efficiency gainz as much as you guys but I"m convinced we can do it
without causing eye bleeding confusion...
>
> >
> > We might need to pass a flag to say 'don't account for this' for prot numa case.
>
> Yep, another bool ;-)
Yeah... but we can't sensibly add a flag for this so the flag idea doesn't fly
for that either... :>)
I mean I don't think we actually need that flag, let it skip the sub-batch size
then check again. Now that, I reckon, is a small overhead.
>
> >
> >>
> >> Then we just have a simple loop here that gets the next batch and processes it?
>
Another advantage of doing things this way is we can actually add a comment
explaining the sub-batch size.
Right now there's just absolutely _nothing_ and it's entirely unclear what on
earth is going on.
More information about the linux-arm-kernel
mailing list