[RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory

Mon Apr 17 08:44:25 PDT 2023

>>>
>>>>
>>>> Further, we have to be a bit careful regarding replacing ranges that are backed
>>>> by different anon pages (for example, due to fork() deciding to copy some
>>>> sub-pages of a PTE-mapped folio instead of sharing all sub-pages).
>>>
>>> I don't understand this statement; do you mean "different anon _folios_"? I am
>>> scanning the page table to expand the region that I reuse/copy and as part of
>>> that scan, make sure that I only cover a single folio. So I think I conform here
>>> - the scan would give up once it gets to the hole.
>>
>> During fork(), what could happen (temporary detection of pinned page resulting
>> in a copy) is something weird like:
>>
>> PTE 0: subpage0 of anon page #1 (maybe shared)
>> PTE 1: subpage1 of anon page #1 (maybe shared
>> PTE 2: anon page #2 (exclusive)
>> PTE 3: subpage2 of anon page #1 (maybe shared
> 
> Hmm... I can see how this could happen if you mremap PTE2 to PTE3, then mmap
> something new in PTE2. But I don't see how it happens at fork. For PTE3, did you
> mean subpage _3_?
>

Yes, fat fingers :) Thanks for paying attention!

Above could be optimized by processing all consecutive PTEs at once: 
meaning, we check if the page maybe pinned only once, and then either 
copy all PTEs or share all PTEs. It's unlikely to happen in practice, I 
guess, though.

>>
>> Of course, any combination of above.
>>
>> Further, with mremap() we might get completely crazy layouts, randomly mapping
>> sub-pages of anon pages, mixed with other sub-pages or base-page folios.
>>
>> Maybe it's all handled already by your code, just pointing out which kind of
>> mess we might get :)
> 
> Yep, this is already handled; the scan to expand the range ensures that all the
> PTEs map to the expected contiguous pages in the same folio.

Okay, great.

> 
>>
>>>
>>>>
>>>>
>>>> So what should be safe is replacing all sub-pages of a folio that are marked
>>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
>>>> worth the complexity. For THP we were happy so far to *not* optimize this,
>>>> implying that maybe we shouldn't worry about optimizing the fork() case for now
>>>> that heavily.
>>>
>>> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
>>> large copies was contributing a measurable amount to the performance
>>> improvement. (Certainly, the zero-page copy case, is definitely a big
>>> contributer). I don't have access to the HW at the moment but can rerun later
>>> with and without to double check.
>>
>> In which test exactly? Some micro-benchmark?
> 
> The kernel compile benchmark that I quoted numbers for in the cover letter. I
> have some trace points (not part of the submitted series) that tell me how many
> mappings of each order we get for each code path. I'm pretty sure I remember all
> of these 4 code paths contributing non-negligible amounts.

Interesting! It would be great to see if there is an actual difference 
after patch #10 was applied without the other COW replacement.

-- 
Thanks,

David / dhildenb