Maple Tree Work 2023/12

Mon Dec 4 13:05:22 PST 2023

Hello,

This months update of the maple tree development and work items follows.
I send out an updated list monthly with peoples names against work items
to avoid work duplication, so please don't work on something with a name
next to the work item.

If you want to work on something then please respond to this email so
everyone knows, and I can add your name to the item.  Feel free to ask
questions to clarify the feature or discuss directions.

Likewise, if you cannot work on anything anymore then let me know so
others can.

If you have any ideas, then please email the list.  We can discuss it
and I can add it to the list.

Please sign up to the maple tree mailing list [1] to get all updates.

Completed:
 - Fork & Dup tree + Delete DONT_COPY - Peng Zhang 2023/07/12
 	This is to optimize dup_mmap() in kernel/fork.c, but other
	users may want faster duplications of a tree.
	This should be faster than building a tree in bulk mode.  The
	idea is to just copy each node and replace the pointers in,
	probably, a BFS order.  Once the leaves are reached, the VMAs
	will be replaced by the copies made in fork, unless DONT_COPY is
	set, in which case the VMA will be deleted from the copied tree.
	DONT_COPY is not common and since the tree isn't visible, all
	nodes should be available for reuse (no RCU worries). v7 was
	sent to the mailing list [2] and has been reviewed.
	Thanks for the feature!

- mas_for_each caching of end of node - Liam R. Howlett 2023/10/01
	Adding metadata for the end of node was a big performance win.
	Extend this by caching the end of the node while iterating (to
	start).  Later this could be expanded to other iterators. v1 has
	been sent to the mailing list [3].

Features:
 - mas_store_gfp() align with mas_prealloc() - Sidhartha Kumar
   2023/09/10
	Right now mas_store_gfp() is going to allocate once it figures
	out what to do and it does the worst case allocations.  This is
	inefficient and can easily be done better by doing more of what
	mas_prealloc()/mas_store_prealloc() does - or at least will be
	doing once the 'Better preallocation calculations' lands.

 - Store type (enum for store type?) - Sidhartha Kumar 2023/09/10
	Extending the "Tracking tree status on walks down", the
	information can then be used to determine what operation is
	needed during a mas_prealloct(), etc.  The operation can be
	stored in the maple state to continue on a mas_store_prealloc().
	Obviously, this would work out well if we have mas_store_gfp()
	using the same functionality as mentioned above.  This has been
	combined with mas_store_gfp() alignment work.

 - Tracking tree status on walks down
 	Track in the maple state when the last minimum nodes, space for
	2, or space for 3 slots occurred (height in the tree).  This
	will allow for a better estimate on how many nodes are needed
	for a write.  This can be complicated on next/prev walking, etc.
	It has to be done in read mode since some walk may have been
	done before deciding to write - see mmap_region(), for example.
	This may not be necessary once Sid has landed his work, so we
	should hold off on this.

 - Full count/flag & Dense nodes - Liam R. Howlett 2023/11/01
 	There is a bit that exists in the pointer reserved to indicate
	there is no NULLs in the leaves below.  This feature is mainly
	for trees that are non-alloc trees.  We can then know there is
	at least one singleton that can be stored below this entry.
	This is coupled with restoring Dense nodes as a potential node
	type.  The tricky part is deciding on when to switch to dense
	nodes/back from dense nodes (all entries to dense nodes must be
	singletons!).  See mte_set_full(), mte_clear_full(),
	mte_has_null() which use MAPLE_ENODE_NULL.

 - Remove BIG_NODE - Liam R. Howlett 2023/11/01 (pulled into Dense work)
 	maple tree nodes are rebalanced, split, and updated using a
	maple_big_node struct.  This structure is over twice the size of
	regular nodes to ensure there is enough space to contain the
	data.  Ideally the same feats can be accomplished by using
	regular sized nodes.

 - Push reuse parent
 	During an insert, new nodes can be "pushed" left or right -
	see mas_push_data().  On a left push, it may be possible to
	reuse the node by making the write like an 'append'.  This may
	also be possible to be done during a split operation when the
	write extends the last slot. This needs examining to ensure RCU
	safety.  Take note of the cached end of the node, as it may be
	tricky.

 - Contiguous iterator
 	Some users want to iterate across a range of items, but only
	contiguous items.  Sometimes a gap within the range is
	considered an error while other times they are not.  Good
	targets for this feature are mlock() and mprotect() system
	calls.  The goal is to provide an easy to use interface that
	reduces the complexity of the external code that keeps track of
	the contiguous nature of the iteration.  Status may be one way
	to do this item without a lot of added complexity.

 - Test & Fix destroy_rebalance
 	Destroy_rebalance needs better testing when it occurs in rcu
	mode.  Currently it is not used in this mode, but it should be
	supported.  This part of the code needs better test coverage and
	fix any necessary issues that arise with rcu readers during
	updates.

  - More userspace testing - Liam R. Howlett 2023/09/12
  	Add userspace fuzzer testing to the test suite.  There was a
	patch sent out to create a basic clang fuzzer.  Using something
	along those lines, write a fuzzer to test the maple tree.
	Development has paused on this for now.  If you have interest in
	taking over the task, please let me know.

  - Live locking with delays for testing
  	Add some sort of functional delay operation to work with testing
	to reduce potential locking issues, especially with CPU
	contention.

  - Add min split span to parents
  	When nodes are split, the parent node splits by number of
	entries.  It may be better to split by range as it would help
	avoid splitting more frequently on inserts.

 - Search Marks
 	To complete the features that xarray support, the maple tree
	needs to support search marks.  Marks will replace the gap
	tracking in a new node type complete with searching for a
	specific tag from the top-level to the leaves; just like the
	gaps that exist in the allocation nodes.  It will probably need
	two new node types (range and dense). Dense with 29 pointers,
	range64 with 15 pointers.

 - Pruning trees under memory pressure
	For the page cache, we support retaining workingset information in
	the tree after the folio has been swapped out ("shadow entries").
	The problem with this is that trees full of stale workingset
	information can occupy a large amount of memory.  For the radix
	tree, this was solved by putting nodes which contained only
	shadow entries on the 'shadow_nodes' list in mm/workingset.c.
	We might take a different approach for maple tree, such as
	putting entire trees that contain many shadow entries on an LRU
	list and pruning the entire tree of its shadow entries (or some
	high percentage of its shadow entries) under memory pressure.

 - Tree Jitter Reduction (probably worth holding off until dense nodes
   and big node structure removal)
 	Peng's rework on duplicating the tree (forking changes) caused a
	significant increase in brk2 benchmark scores [4].  The analysis
	of the increase appears to point to expanding and contracting
	nodes for every loop in the older fork code.  Although this
	method of duplcating the tree is no longer used, this potential
	issue probably still exists.  Analyze exactly what is going on
	(probably writing a gap to a node end), and rework rebalance and
	split calculations to avoid such a situation.  Perhaps by
	splitting nodes to avoid starting with a gap and not to push
	left if there is only room for a single entry and cause a node
	to start with a NULL; favour splitting instead.

- Better gap searching
	The gap searching is not performing as well as it should.  One
	of the issues is the window being searched often runs to the end
	of the tree, which often has a huge gap at the start and end -
	but not all of those addresses are usable.  The starting address
	in the VMA space can also be changed.  But to start, it might be
	worth re-examining the algorithm of gap searching so that the
	wrong path can be avoided, but the entire area needs to be
	re-examined.  This area has broken s390 and 32b compatibility,
	so pay attention to the various arch testing and previous
	regressions.

Larger Features:
 - 64 bit stores on 32 bit arch
 	A number of users want to use the maple tree, but cannot because
	they need a storage medium that supports 64 bit stores on 32 bit
	arch.

 - wr_mas with alloc instead of mas
 	Internally, we have a write maple state, but the maple state
	currently holds the allocations.  I'm not sure if this is worth
	it, and it will probably need "Tracking tree status on walks
	down" so that preallocations/allocations can be accurate.  Other
	considerations are how mas_prealloc() would work with this
	change.  This may not be necessary, depending on the per-cpu
	array cache.

 - Big Dense Node Type
 	There are various node types that could be added to the maple
	tree.  One of the most useful is probably a 'big dense' with the
	node being a 4k block of singletons.  This would have to come
	after the dense node type as dense enables more users and helps
	scope this work.

 - Judy Array
	The Judy Array is another fancy data structure.  Mathieu
	Desnoyers has enquired and spoken to us about incorporating judy
	arrays as a node type or other features he has in the judy
	array.  This is in the early thought stages.

Please note that any new features will need test cases added to ensure
they function correctly.

[1] https://lists.infradead.org/mailman/listinfo/maple-tree
[2] https://lore.kernel.org/linux-mm/20231101171629.3612299-1-Liam.Howlett@oracle.com/
[3] https://lore.kernel.org/linux-mm/20231027033845.90608-11-zhangpeng.00@bytedance.com/
[4] https://lore.kernel.org/linux-mm/202311282145.ff13737b-oliver.sang@intel.com/