[RFC PATCH 0/6] Remove XA_ZERO from error recovery of
Lorenzo Stoakes
lorenzo.stoakes at oracle.com
Mon Aug 18 08:47:58 PDT 2025
On Mon, Aug 18, 2025 at 11:44:16AM +0200, David Hildenbrand wrote:
> On 15.08.25 21:10, Liam R. Howlett wrote:
> > Before you read on, please take a moment to acknowledge that David
> > Hildenbrand asked for this, so I'm blaming mostly him :)
>
> :)
>
> >
> > It is possible that the dup_mmap() call fails on allocating or setting
> > up a vma after the maple tree of the oldmm is copied. Today, that
> > failure point is marked by inserting an XA_ZERO entry over the failure
> > point so that the exact location does not need to be communicated
> > through to exit_mmap().
> >
> > However, a race exists in the tear down process because the dup_mmap()
> > drops the mmap lock before exit_mmap() can remove the partially set up
> > vma tree. This means that other tasks may get to the mm tree and find
> > the invalid vma pointer (since it's an XA_ZERO entry), even though the
> > mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE.
> >
> > To remove the race fully, the tree must be cleaned up before dropping
> > the lock. This is accomplished by extracting the vma cleanup in
> > exit_mmap() and changing the required functions to pass through the vma
> > search limit.
> >
> > This does run the risk of increasing the possibility of finding no vmas
> > (which is already possible!) in code this isn't careful.
>
> Right, it would also happen if __mt_dup() fails I guess.
>
> >
> > The passing of so many limits and variables was such a mess when the
> > dup_mmap() was introduced that it was avoided in favour of the XA_ZERO
> > entry marker, but since the swap case was the second time we've hit
> > cases of walking an almost-dead mm, here's the alternative to checking
> > MMF_UNSTABLE before wandering into other mm structs.
>
> Changes look fairly small and reasonable, so I really like this.
>
> I agree with Jann that doing a partial teardown might be even better, but
> code-wise I suspect it might end up with a lot more churn and weird
> allocation-corner-cases to handle.
I've yet to review the series and see exactly what's proposed but on gut
instinct (and based on past experience with the munmap gather/reattach
stuff), some kind of a partial thing like this tends to end up a nightmare
of weird-stuff-you-didn't-think-about.
So I'm instincitively against this.
However I'll take a proper look through this series shortly and hopefully
have more intelligent things to say...
An aside - I was working on a crazy anon idea over the weekend (I know, I
know) and noticed that mm life cycle is just weird. I observed apparent
duplicate calls of __mmdrop() for instance (I think the unwinding just
broke), the delayed mmdrop is strange and the whole area seems rife with
complexity.
So I'm glad David talked you into doing this ;) this particular edge case
was always strange and the fact we have now hid it twice (and this time
more seriously - as it's due to a fatal signal which is much more likely to
arise than an OOM scenario with too-small-to-fail allocations).
BTW where are we with the hotfix for the swapoff case [0]? I think we
agreed settng on MMF_UNSTABLE there and using that to decide not to proceed
in unuse_mm() right?
Cheers, Lorenzo
[0]: https://lore.kernel.org/all/20250808092156.1918973-1-quic_charante@quicinc.com/
More information about the maple-tree
mailing list