[PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance

Mon May 4 07:15:52 PDT 2026

On Mon, May 4, 2026 at 9:04 PM Jan Kara <jack at suse.cz> wrote:
>
> On Mon 04-05-26 03:55:43, Barry Song wrote:
> > On Mon, May 4, 2026 at 2:17 AM Jan Kara <jack at suse.cz> wrote:
> > > On Fri 01-05-26 18:57:52, Matthew Wilcox wrote:
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy at infradead.org> wrote:
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen?  I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen".  How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > >
> > > I fully agree with you we should verify whether the retry code still brings
> > > in real-world advantage today with VMA locks. After all the retry logic has
> > > been introduced in 2010. That being said if there are realistic loads where
> > > one thread needs VMA write lock while another thread is faulting the VMA,
> > > then the latencies can be indeed extreme. For example things like cgroup IO
> > > throttling happen on the IO path and thus can throttle IO of a low-priority
> > > thread for a long time.
> >
> > I’m quite sure that swap-in and VMA writes can occur
> > concurrently, and this is fairly common. For example,
> > Java GC may use mprotect or userfaultfd on a small
> > portion of a large Java heap while other portions are
> > still under do_swap_page().
>
> OK, makes sense.
>
> > If we start exploring different approaches for anon and
> > file, I agree I can revisit this on an Android phone if
> > there is a real, serious case where a file VMA can be
> > written and a page fault occurs at the same time.
> >
> > Please note that, as an Android developer, I am particularly
> > cautious about priority inversion. A recent issue causing
> > severe priority inversion is zram attempting to support
> > preemption[1]. When a task performing compression or
> > decompression is migrated to another CPU and then preempted
> > by other tasks, high-priority tasks waiting on the mutex may
> > be significantly delayed, impacting user experience.
>
> Well, container people are concerned about priority inversion as well. But
> usually this is with coarse lock (such as global filesystem locks) but VMA
> lock is specific to a task (and a VMA) so there the opportunity for
> priority inversion looks more limited.  But the example with Java where GC
> thread can presumably have higher priority than ordinary Java threads is an
> interesting one.

A major difference in Android apps is that each thread can
affect user experience differently. And it is not simply a matter
of whether a VMA writer has higher or lower priority than a
page-fault (PF) thread performing I/O.

For example, thread A handles a PF; thread B attempts to
modify the VMA where the PF occurs; thread C tries to modify
another VMA (requiring mmap_lock in write mode) or iterate
VMAs (requiring mmap_lock in read mode). Regardless of
thread B’s priority, it holds mmap_lock in write mode while
waiting for the VMA lock. The usual pattern for a VMA writer
is:

mmap_write_lock()
vma_start_write()

As a result, thread C can be blocked even if it has higher
priority but operates on a different VMA.

In essence, when a PF and a VMA write occur concurrently,
high-priority threads may be blocked even if they operate on
different VMAs, not necessarily the same one.

Thanks
Barry