[BUG] trigger BUG_ON in mas_store_prealloc when low memory

Wed Jul 19 12:22:04 PDT 2023

* Liam R. Howlett <Liam.Howlett at Oracle.com> [230719 14:51]:
> * John Hsu (許永翰) <John.Hsu at mediatek.com> [230712 23:26]:
> > On Mon, 2023-07-10 at 10:24 -0400, Liam R. Howlett wrote:
> 
> ...
> 
> > > > > > > > As you mentioned, mas_preallocate() should allocate enough
> > > > > node,
> > > > > > > but there is such functions mas_node_count() in
> > > > > mas_store_prealloc().
> > > > > > > > In mas_node_count() checks whether the *mas* has enough
> > > nodes,
> > > > > and
> > > > > > > allocate memory for node if there was no enough nodes in mas.
> > > > > > > 
> > > > > > > Right, we call mas_node_count() so that both code paths are
> > > used
> > > > > for
> > > > > > > preallocations and regular mas_store()/mas_store_gfp().  It
> > > > > shouldn't
> > > > > > > take a significant amount of time to verify there is enough
> > > > > nodes.
> > > > > > 
> > > > > > Yap..., it didn't take a significant amount of time to verify
> > > > > whether
> > > > > > there is enough nodes. The problem is why the flow in
> > > > > mas_node_count
> > > > > > will alloc nodes if there was no enough nodes in mas?
> > > > > 
> > > > > What I meant is that both methods use the same call path because
> > > > > there
> > > > > is not a reason to duplicate the path.  After mas_preallocate()
> > > has
> > > > > allocated the nodes needed, the call to check if there is enough
> > > > > nodes
> > > > > will be quick.
> > > > 
> > > > So whether the purpose of mas_preallocate() is decreasing the lock
> > > > retention time?
> > > 
> > > It could be, but in this case it's the locking order.  We have to
> > > pre-allocate and fail early if we are out of memory, because we
> > > _cannot_
> > > use GFP_KERNEL where we call mas_store_prealloc().  mas_preallocate()
> > > will use GFP_KERENL though.  We cannot use GFP_KERNEL during the
> > > store
> > > because reclaim may sleep and we cannot sleep holding the locks we
> > > need
> > > to hold at the time of the store operation in __vma_adjust().
> > > 
> > Yap, if the type of lock is spinlock, the flow shouldn't sleep in the
> > critical sections. mmap_lock is implmented by rw_semaphore(mutex). Is
> > there any other lock in this section?
> 
> Yes, the i_mmap_lock_write(), the anon_vma_lock_write(),
> flush_dcache_mmap_lock().
> 
> > > > 
> > > > > > 
> > > > > > > > I think that if mas_preallocate() allocate enough node, why
> > > we
> > > > > > > check the node count and allocate nodes if there was no
> > > enough
> > > > > nodes
> > > > > > > in mas in mas_node_count()?
> > > > > > > 
> > > > > > > We check for the above reason.
> > > > > > > 
> > > > > > 
> > > > > > OK..., this is one of the root cause of this BUG.
> > > > > 
> > > > > The root cause is that there was not enough memory for a store
> > > > > operation.  Regardless of if we check the allocations in the
> > > > > mas_store_prealloc() path or not, this would fail.  If we remove
> > > the
> > > > > check for nodes within this path, then we would have to BUG_ON()
> > > when
> > > > > we
> > > > > run out of nodes to use or have a null pointer dereference BUG
> > > > > anyways.
> > > > > 
> > > > Yap, the root cause is oom. The BUG_ON() for the situations that
> > > the
> > > > maple tree struct cannot be maintained because of the lack of
> > > memory is
> > > > necessary. But the the buddy system in linux kernel can reclaim
> > > memory
> > > > when the system is under the low memory status. If we use
> > > GFP_KERNEL
> > > > after trying GFP_NOWAIT to allocate node, maybe we can get enough
> > > > memory when the second try with GFP_KERNEL. 
> > > 
> > > Right, but the GFP_KERNEL cannot be used when holding certain locks
> > > because it may sleep.
> > > 
> > > > > > 
> > > > > > > > 
> > > > > > > > We have seen that there may be some maple_tree operations
> > > in
> > > > > > > merge_vma...
> > > > > > > 
> > > > > > > If merge_vma() does anything, then there was an operation to
> > > the
> > > > > > > maple
> > > > > > > tree.
> > > > > > > 
> > > > > > > > 
> > > > > > > > Moreover, would maple_tree provides an API for assigning
> > > user's
> > > > > gfp
> > > > > > > flag for allocating node?
> > > > > > > 
> > > > > > > mas_preallocate() and mas_store_gfp() has gfp flags as an
> > > > > > > argument.  In
> > > > > > > your call stack, it will be called in __vma_adjust() as such:
> > > > > > > 
> > > > > > > if (mas_preallocate(&mas, vma, GFP_KERNEL))
> > > > > > > return -ENOMEM;
> > > > > > > 
> > > > > > > line 715 in v6.1.25
> > > > > > > 
> > > > > > > > In rb_tree, we allocate vma_area_struct (rb_node is in this
> > > > > > > struct.) with GFP_KERNEL, and maple_tree allocate node with
> > > > > > > GFP_NOWAIT and __GFP_NOWARN.
> > > > > > > 
> > > > > > > We use GFP_KERNEL as I explained above for the VMA tree.
> > > > > > 
> > > > > > Got it! But the mas_node_count() always use GFP_NOWAIT and
> > > > > __GFP_NOWARN
> > > > > > in inserting tree flow. Do you consider the performance of
> > > > > maintaining
> > > > > > the structure of maple_tree?
> > > > > 
> > > > > Sorry, I don't understand what you mean by 'consider the
> > > performance
> > > > > of
> > > > > maintaining the structure of maple_tree'.
> > > > > 
> > > > As I mentioned above, GFP_NOWAIT will not allow buddy system for
> > > > reclaiming memory, so "Do you consider the performance of
> > > maintaining
> > > > the structure of maple_tree" means that: whether the
> > > mas_node_count()
> > > > path is not allowed to reclaim or compact memory for the
> > > performance.
> > > 
> > > Ah, no.  This is not for performance.  It was initially on the road
> > > map
> > > for performance, but it was needed for the complicated locking in the
> > > MM
> > > code.
> > > 
> > > rb tree embedded the nodes in the VMA which is allocated outside this
> > > critical section and so it could use GFP_KERNEL.
> > > 
> > As I know, following is rb_tree flow in 5.15.186:
> > 
> > ...
> > mmap_write_lock_killable(mm)
> > ...
> > do_mmap()
> > ...
> > mmap_region()
> > ...
> > vm_area_alloc(mm)
> > ...
> > mmap_write_unlock(mm)
> > 
> > vm_area_alloc is in the mmap_lock hoding period.
> > It seems that the flow would sleep here in rb_tree flow.
> > If I miss anything, please tell me, thanks!
> 
> Before the mmap_write_unlock(mm) in the above sequence,  the
> i_mmap_lock_write(), anon_vma_lock_write(), and/or the
> flush_dcache_mmap_lock() may be taken.  Check __vma_adjust().
> 
> The insertion into the tree needs to hold some subset of these locks.
> The rb-tree insert did not allocate within these locks, but the maple
> tree would need to allocate within these locks to insert into the tree.
> This is why the preallocation exists and why it is necessary.
> 
> > 
> > 
> > > > > > 
> > > > > > > It also will drop the lock and retry with GFP_KERNEL on
> > > failure
> > > > > > > when not using the external lock.  The mmap_lock is
> > > configured as
> > > > > an
> > > > > > > external lock.
> > > > > > > 
> > > > > > > > Allocation will not wait for reclaiming and compacting when
> > > > > there
> > > > > > > is no enough available memory.
> > > > > > > > Is there any concern for this design?
> > > > > > > 
> > > > > > > This has been addressed above, but let me know if I missed
> > > > > anything
> > > > > > > here.
> > > > > > > 
> > > > > > 
> > > > > > I think that the mas_node_count() has higher rate of triggering
> > > > > > BUG_ON() when allocating nodes with GFP_NOWAIT and
> > > __GFP_NOWARN. If
> > > > > > mas_node_count() use GFP_KERNEL as mas_preallocate() in the
> > > mmap.c,
> > > > > the
> > > > > > allocation fail rate may be lower than use GFP_NOWAIT.
> > > > > 
> > > > > Which BUG_ON() are you referring to?
> > > > > 
> > > > > If I was to separate the code path for mas_store_prealloc() and
> > > > > mas_store_gfp(), then a BUG_ON() would still need to exist and
> > > still
> > > > > would have been triggered..  We are in a place in the code where
> > > we
> > > > > should never sleep and we don't have enough memory allocated to
> > > do
> > > > > what
> > > > > was necessary.
> > > > > 
> > > > Yap. There is no reason to seprate mas_store_prealloc() and
> > > > mas_store_gfp. Is it possible to retry to allocate mas_node with
> > > > GFP_KERNEL (wait for system reclaim and compact) instead of
> > > triggering
> > > > BUG_ON once the GFP_NOWAIT allocation failed?
> > > 
> > > Unfortunately not, no.  In some cases it may actually work out that
> > > the
> > > VMA may not need the locks in question, but it cannot be generalized
> > > for
> > > __vma_adjust().  Where I am able, I use the mas_store_gfp() calls so
> > > we
> > > can let reclaim and compact run, but it's not possible here.
> > > 
> > We have used GFP_KERNEL as alloc flag in mas_node_count flow about 2
> > months. The mentioned problem we mentioned in the first mail didn't
> > reproduce under high stress stability test.
> > 
> > I have seen the patch provided by you. I will verify this patch in our
> > stability test. But there is a problem, if maple_tree behavior is
> > unexpected (e.g. redundant write bug this time). We can only review the
> > whole mm flow to find out the wrong behavior. Do we have an efficient
> > debug method for maple tree?
> 
> There is a test suite for the maple tree found in
> tools/testing/radix-tree, but it does not test the mm code and
> integration.
> 
> There are other mm tests, but they will not test the OOM scenario you
> hit - to the best of my knowledge.
> 
> There are also config options to debug the tree operations, but they do
> not detect the redundant write issues.  Perhaps I can look at adding
> support for detecting redundant writes,  but that will not be backported
> to a stable kernel.

I just wanted to clarify what I said here:

I'm looking at adding this detection within CONFIG_DEBUG_MAPLE_TREE.
This is how I narrowed down the issue in this particular case, but it's
far from production quality.  Even if or when it is production quality,
it won't be sent for inclusion into the stable kernel release.

Thanks,
Liam