[LSF/MM/BPF TOPIC] Per-process page size

Wed Feb 18 01:15:14 PST 2026

On 2/18/26 09:58, Dev Jain wrote:
> 
> On 18/02/26 2:09 pm, Dev Jain wrote:
>> On 17/02/26 8:52 pm, Matthew Wilcox wrote:
>>> Please don't use the term "enlighten".  Tht's used to describe something
>>> something or other with hypervisors.  Come up with a new term or use one
>>> that already exists.
>> Sure.
>>
>>> That's going to be messy.  I don't have a good idea for solving this
>>> problem, but the page cache really isn't set up to change minimum folio
>>> order while the inode is in use.
>> Holding mapping->invalidate_lock, bumping mapping->min_folio_order and
>> dropping-rereading the range suffers from a race - filemap_fault operating
>> on some other partially populated 64K range will observe in filemap_get_folio
>> that nothing is in the pagecache. Then, it will read the updated min_order
>> in __filemap_get_folio, then use filemap_add_folio to add a 64K folio, but since
>> the 64K range is partially populated, we get stuck in an infinite loop due to -EEXIST.
>>
>> So I figured that deleting the entire pagecache is simpler. We will also bail
>> out early in __filemap_add_folio if the folio order asked by the caller to
>> create is less than mapping_min_folio_order. Eventually the caller is going
>> to read the correct min order. This algorithm avoids the race above, however...
>>
>> my assumption here was that we are synchronized on mapping->invalidate_lock.
>> The kerneldoc above read_cache_folio() and some other comments convinced me
>> of that, but I just checked with a VM_WARN_ON(!is_rwsem_locked()) in
>> __filemap_add_folio and this doesn't seem to be the case for all code paths...
>> If the algorithm sounds reasonable, I wonder what is the correct synchronization
>> mechanism here.
> 
> I may have been vague here... to avoid the race I described above, we must
> ensure that after all folios have been dropped from pagecache, and min order
> is bumped up, no other code path remembers the old order and partially
> populates a 64K range. For this we need synchronization.

And I don't think you can reliably do that when other processes might be 
using the files concurrently.

It's best to start like Ryan suggested: lifting min_order on these 
systems for now and leaving dynamically switching the min order as 
future work.

-- 
Cheers,

David