[PATCH v3 3/3] iova: defer maple tree erase on GFP_ATOMIC failure

Fri Jun 12 11:44:06 PDT 2026

+Cc maple-tree list.

On 26/06/12 03:03PM, Jason Gunthorpe wrote:
> On Fri, Jun 12, 2026 at 01:23:58PM -0400, Rik van Riel wrote:
> > On Fri, 2026-06-12 at 13:48 -0300, Jason Gunthorpe wrote:
> > > On Fri, Jun 12, 2026 at 12:02:55PM -0400, Rik van Riel wrote:
> > > > 
> > > > The mas_erase() function calls mas_nomem(mas, GFP_KERNEL),
> > > > which is not safe to call while holding a spinlock.
> > > 
> > > Oh, the kdoc doesn't say that, it doesn't return any error code if it
> > > can't allocate memory, and not a single caller checks for erase
> > > failures.

I should fix that.

> > > 
> > > I assumed internally it "somehow worked out" even though there are
> > > allocations in the callchains..
> > > 
> > > This is probably a better question for Liam? Can mtree_erase actually
> > > fail ENOMEM? Is it safe to call it in an atomic context?
> > 
> > Yes, it can fail.
> 
> Currently it never returns a failure to the caller. Look at mas_erase():
> 
> 	entry = mas_state_walk(mas);
> 	if (!entry)
> 		return NULL;
> [..]
> 	if (mas_is_err(mas))
> 		goto out;
> [..]
> out:
> 	mas_destroy(mas);
> 	return entry;
> 
> There is no propogation of ENOMEM, it returns success. No caller
> checks for any error here either.

At one point this was considered to be impossible to fail, and it is
documented to return the entry or null.

> 
> So I think the intention is that it cannot fail, yet it does have the
> memory allocations and busted failure path. Hence asking Liam what it
> should be, and what about an atomic context.
> 
> Perhaps this might be relying on the modern kernels "small allocations
> never fail", meaning mas_erase never fails, but then you can't
> call it from an atomic context..

It _can_ fail, but right now that error will not be propagated to the
caller.  The caller could infer the failure due to the return of NULL...
if that's what would happen, so there's very much an issue on failure
that I need to investigate and fix.

It is possible that it fails with -ENOMEM.  And the spinlock can be
dropped if you are not using an external lock and the gfp flag allows
blocking.  On failure to allocate and the lock is dropped, we retry the
operation from the start in case there was a race with another writer.

I think I should probably change this to return -ENOMEM, fix the docs on
it, and probably audit the callers (most use external locks or are
early-boot-so-don't-worry-about-it).  Any issue here should also be
caught by lockdep pretty quickly.

> 
> In any case, it does look like you can't use mas_erase from an atomic
> context anyhow so your prior option with the mas_store_gfp() and
> failure handling seems reasonable.

mas_store_gfp() works if you know the range of your entry.  You could
also write the XA_ZERO_ENTRY over the entry so that there is no internal
node changes - just a value swap.  If you do this, you have to be
careful when reading things back when using the mas_ interface.

I think, in your case, hitting an XA_ZERO_ENTRY would be necessary to
indicate that we cannot reuse this particular location until it is
correctly dealt with?  Or is the maple tree the only reason it is
considered unusable?

One thing to remember is that each write can cause allocations to occur,
so if you have a list of items being overwritten then you are causing
the tree to do each write and potentially rebalancing (as you shrink the
data beyond the lower limit of the node).

One way around that is to write XA_ZERO_ENTRY over each one as you
deal with your entry.  Then, when you are done you do a single
mas_store_gfp() of NULL over the whole range.  It will be a larger tree
operation, but smaller than the incremental steps.

Thanks,
Liam