[PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
Yang Shi
shy828301 at gmail.com
Tue May 19 13:53:03 PDT 2026
On Tue, May 19, 2026 at 11:50 AM Yang Shi <shy828301 at gmail.com> wrote:
>
> On Tue, May 19, 2026 at 4:07 AM Barry Song <baohua at kernel.org> wrote:
> >
> > On Tue, May 19, 2026 at 5:21 AM Yang Shi <shy828301 at gmail.com> wrote:
> > >
> > > On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua at kernel.org> wrote:
> > > >
> > > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy at infradead.org> wrote:
> > > > >
> > > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy at infradead.org> wrote:
> > > > > > >
> > > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > > for an unpredictable amount of time.
> > > > > > >
> > > > > > > But does that actually happen? I find it hard to believe that thread A
> > > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > > > > it still seems really unlikely to me.
> > > > > >
> > > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > > the entire VMA—just a portion of it is sufficient.
> > > > >
> > > > > Yes, but that still fails to answer "does this actually happen". How much
> > > > > performance is all this complexity in the page fault handler buying us?
> > > > > If you don't answer this question, I'm just going to go in and rip it
> > > > > all out.
> > > > >
> > > >
> > > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > > > waiting for answers),
> > > >
> > > > As promised during LSF/MM/BPF, we conducted thorough
> > > > testing on Android phones to determine whether performing
> > > > I/O in `filemap_fault()` can block `vma_start_write()`.
> > > > I wanted to give a quick update on this question.
> > > >
> > > > Nanzhe at Xiaomi created tracing scripts and ran various
> > > > applications on Android devices with I/O performed under
> > > > the VMA lock in `filemap_fault()`. We found that:
> > > >
> > > > 1. There are very few cases where unmap() is blocked by
> > > > page faults. I assume this is due to buggy user code
> > > > or poor synchronization between reads and unmap().
> > > > So I assume it is not a problem.
> > > >
> > > > 2. We observed many cases where `vma_start_write()`
> > > > is blocked by page-fault I/O in some applications.
> > > > The blocking occurs in the `dup_mmap()` path during
> > > > fork().
> > > >
> > > > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > > > the parent process when forking"), we now always hold
> > > > `vma_write_lock()` for each VMA. Note that the
> > > > `mmap_lock` write lock is also held, which could lead to
> > > > chained waiting if page-fault I/O is performed without
> > > > releasing the VMA lock.
> > > >
> > > > My gut feeling is that Suren's commit may be overshooting,
> > > > so my rough idea is that we might want to do something like
> > > > the following (we haven't tested it yet and it might be
> > > > wrong):
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 2311ae7c2ff4..5ddaf297f31a 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > > > *mm, struct mm_struct *oldmm)
> > > > for_each_vma(vmi, mpnt) {
> > > > struct file *file;
> > > >
> > > > - retval = vma_start_write_killable(mpnt);
> > > > + /*
> > > > + * For anonymous or writable private VMAs, prevent
> > > > + * concurrent CoW faults.
> > > > + */
> > > > + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > > > + (mpnt->vm_flags & VM_WRITE)))
> > > > + retval = vma_start_write_killable(mpnt);
> > > > if (retval < 0)
> > > > goto loop_out;
> > > > if (mpnt->vm_flags & VM_DONTCOPY) {
> > >
> > > Maybe a little bit off topic. This is an interesting idea. It seems
> > > possible we don't have to take vma write lock unconditionally. IIUC
> > > the write lock is mainly used to serialize against page fault and
> > > madvise, right? I got a crazy idea off the top of my head. We may be
> > > able to just take vma write lock iff vma->anon_vma is not NULL.
> > >
> > > First of all, write mmap_lock is held, so the vma can't go or be
> > > changed under us.
> > >
> > > Secondly, if vma->anon_vma is NULL, it basically means either no page
> > > fault happened or no cow happened, so there is no page table to copy,
> > > this is also what copy_page_range() does currently. So we can shrink
> > > the critical section to:
> > >
> > > if (vma->anon_vma) {
> > > vma_start_write_killable(src_vma);
> > > anon_vma_fork(dst_vma, src_vma);
> > > copy_page_range(dst_vma, src_vma);
> > > }
> > >
> > > But page fault can happen before write mmap_lock is taken, when we
> > > check vma->anon_vma, it is possible it has not been set up yet. But it
> > > seems to be equivalent to page fault after fork and won't break the
> > > semantic.
> >
> > Re-reading Suren's commit log for fb49c455323ff8
> > ("fork: lock VMAs of the parent process when forking"),
> > it seems that vm_start_write() is used to protect
> > against a race where anon_vma changes from NULL to
> > non-NULL during fork. In that scenario, we hold the
> > mmap_lock write lock, but not vma_start_write(), so a
> > concurrent anon_vma_prepare() could still install an
> > anon_vma.
> >
> > " A concurrent page fault on a page newly marked read-only by the page
> > copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
> > source vma, defeating the anon_vma_clone() that wasn't done because the
> > parent vma originally didn't have an anon_vma, but we now might end up
> > copying a pte entry for a page that has one.
> > "
> >
> > If that is the case, then your change does not work.
> >
> > Nowadays, nobody calls anon_vma_prepare(vma) directly.
> > Instead, vmf_anon_prepare() is used, and we always
> > require the mmap_lock read lock before calling
> > __anon_vma_prepare(). As a result, anon_vma cannot
> > transition from NULL to non-NULL during fork.
> >
> > So the original race condition has effectively
> > disappeared.
>
> anon_vma_prepare() has some usecases too, but it seems like it
> requires taking read mmap_lock too if I read the code correctly.
>
> >
> > You also mentioned the madvise() case. If I understand
> > correctly, madvise() should take mmap_lock before
> > modifying anon_vma. Only some parts of madvise() can
> > support per-VMA locking. Therefore, we probably do not
> > need:
> >
> > if (vma->anon_vma) {
> > vma_start_write_killable(src_vma);
> > ...
> > }
>
> I think we still need write vma lock to serialize anon_vma fork
> otherwise we may see:
>
> CPU 0 CPU 1
> fork page fault
> src vma has no anon_vma
> skip vma fork
>
> allocate anon_vma for src vma
> vma_needs_copy() sees anon_vma
> copy page
>
> Then we may end up being no anon_vma for dst vma, but with pages mapped in it.
Sorry, this should not happen because creating anon_vma in page fault
needs to take mmap_lock.
Thanks,
Yang
>
> Thanks,
> Yang
>
> >
> > >
> > > Anyway, just a crazy idea, I may miss some corner cases.
> >
> > To me, it seems that we could remove vma_start_write()
> > entirely now. Or is that an even crazier idea?
>
>
> >
> > Thanks
> > Barry
More information about the linux-riscv
mailing list