[PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance

Sun May 3 12:55:43 PDT 2026

On Mon, May 4, 2026 at 2:17 AM Jan Kara <jack at suse.cz> wrote:
>
> On Fri 01-05-26 18:57:52, Matthew Wilcox wrote:
> > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy at infradead.org> wrote:
> > > >
> > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > both the hardware and the software stack (bio/request queues and the
> > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > for an unpredictable amount of time.
> > > >
> > > > But does that actually happen?  I find it hard to believe that thread A
> > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > it still seems really unlikely to me.
> > >
> > > It doesn’t have to involve unmapping or applying mprotect to
> > > the entire VMA—just a portion of it is sufficient.
> >
> > Yes, but that still fails to answer "does this actually happen".  How much
> > performance is all this complexity in the page fault handler buying us?
> > If you don't answer this question, I'm just going to go in and rip it
> > all out.
>
> I fully agree with you we should verify whether the retry code still brings
> in real-world advantage today with VMA locks. After all the retry logic has
> been introduced in 2010. That being said if there are realistic loads where
> one thread needs VMA write lock while another thread is faulting the VMA,
> then the latencies can be indeed extreme. For example things like cgroup IO
> throttling happen on the IO path and thus can throttle IO of a low-priority
> thread for a long time.

I’m quite sure that swap-in and VMA writes can occur
concurrently, and this is fairly common. For example,
Java GC may use mprotect or userfaultfd on a small
portion of a large Java heap while other portions are
still under do_swap_page().

If we start exploring different approaches for anon and
file, I agree I can revisit this on an Android phone if
there is a real, serious case where a file VMA can be
written and a page fault occurs at the same time.

Please note that, as an Android developer, I am particularly
cautious about priority inversion. A recent issue causing
severe priority inversion is zram attempting to support
preemption[1]. When a task performing compression or
decompression is migrated to another CPU and then preempted
by other tasks, high-priority tasks waiting on the mutex may
be significantly delayed, impacting user experience.

>
> BTW I'm not sure I quite understand Barry's priority inversion problem
> since I'd expect all threads of a task to generally be treated with the
> same priority...

Exactly not. Maybe these slides[2] and this project[3] can give
you a hint—they aim to standardize things on Linux by
learning from Apple OS. Basically, tasks are classified
into five types:

USER_INTERACTIVE: Requires immediate response.
USER_INITIATED: Tolerates a short delay, but must respond quickly still.
UTILITY: Tolerates long delays, but not prolonged ones.
BACKGROUND: Doesn’t mind prolonged delays.
DEFAULT: System default behavior.

[1] https://lore.kernel.org/linux-mm/20250303022425.285971-3-senozhatsky@chromium.org/
[2] https://lpc.events/event/19/contributions/2089/attachments/1797/3877/Userspace%20Assisted%20Scheduling%20via%20Sched%20QoS.pdf
[3] https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbuntu/

Thanks
Barry