Maple Tree Work 2023/12
Liam R. Howlett
Liam.Howlett at Oracle.com
Mon Dec 4 13:05:22 PST 2023
Hello,
This months update of the maple tree development and work items follows.
I send out an updated list monthly with peoples names against work items
to avoid work duplication, so please don't work on something with a name
next to the work item.
If you want to work on something then please respond to this email so
everyone knows, and I can add your name to the item. Feel free to ask
questions to clarify the feature or discuss directions.
Likewise, if you cannot work on anything anymore then let me know so
others can.
If you have any ideas, then please email the list. We can discuss it
and I can add it to the list.
Please sign up to the maple tree mailing list [1] to get all updates.
Completed:
- Fork & Dup tree + Delete DONT_COPY - Peng Zhang 2023/07/12
This is to optimize dup_mmap() in kernel/fork.c, but other
users may want faster duplications of a tree.
This should be faster than building a tree in bulk mode. The
idea is to just copy each node and replace the pointers in,
probably, a BFS order. Once the leaves are reached, the VMAs
will be replaced by the copies made in fork, unless DONT_COPY is
set, in which case the VMA will be deleted from the copied tree.
DONT_COPY is not common and since the tree isn't visible, all
nodes should be available for reuse (no RCU worries). v7 was
sent to the mailing list [2] and has been reviewed.
Thanks for the feature!
- mas_for_each caching of end of node - Liam R. Howlett 2023/10/01
Adding metadata for the end of node was a big performance win.
Extend this by caching the end of the node while iterating (to
start). Later this could be expanded to other iterators. v1 has
been sent to the mailing list [3].
Features:
- mas_store_gfp() align with mas_prealloc() - Sidhartha Kumar
2023/09/10
Right now mas_store_gfp() is going to allocate once it figures
out what to do and it does the worst case allocations. This is
inefficient and can easily be done better by doing more of what
mas_prealloc()/mas_store_prealloc() does - or at least will be
doing once the 'Better preallocation calculations' lands.
- Store type (enum for store type?) - Sidhartha Kumar 2023/09/10
Extending the "Tracking tree status on walks down", the
information can then be used to determine what operation is
needed during a mas_prealloct(), etc. The operation can be
stored in the maple state to continue on a mas_store_prealloc().
Obviously, this would work out well if we have mas_store_gfp()
using the same functionality as mentioned above. This has been
combined with mas_store_gfp() alignment work.
- Tracking tree status on walks down
Track in the maple state when the last minimum nodes, space for
2, or space for 3 slots occurred (height in the tree). This
will allow for a better estimate on how many nodes are needed
for a write. This can be complicated on next/prev walking, etc.
It has to be done in read mode since some walk may have been
done before deciding to write - see mmap_region(), for example.
This may not be necessary once Sid has landed his work, so we
should hold off on this.
- Full count/flag & Dense nodes - Liam R. Howlett 2023/11/01
There is a bit that exists in the pointer reserved to indicate
there is no NULLs in the leaves below. This feature is mainly
for trees that are non-alloc trees. We can then know there is
at least one singleton that can be stored below this entry.
This is coupled with restoring Dense nodes as a potential node
type. The tricky part is deciding on when to switch to dense
nodes/back from dense nodes (all entries to dense nodes must be
singletons!). See mte_set_full(), mte_clear_full(),
mte_has_null() which use MAPLE_ENODE_NULL.
- Remove BIG_NODE - Liam R. Howlett 2023/11/01 (pulled into Dense work)
maple tree nodes are rebalanced, split, and updated using a
maple_big_node struct. This structure is over twice the size of
regular nodes to ensure there is enough space to contain the
data. Ideally the same feats can be accomplished by using
regular sized nodes.
- Push reuse parent
During an insert, new nodes can be "pushed" left or right -
see mas_push_data(). On a left push, it may be possible to
reuse the node by making the write like an 'append'. This may
also be possible to be done during a split operation when the
write extends the last slot. This needs examining to ensure RCU
safety. Take note of the cached end of the node, as it may be
tricky.
- Contiguous iterator
Some users want to iterate across a range of items, but only
contiguous items. Sometimes a gap within the range is
considered an error while other times they are not. Good
targets for this feature are mlock() and mprotect() system
calls. The goal is to provide an easy to use interface that
reduces the complexity of the external code that keeps track of
the contiguous nature of the iteration. Status may be one way
to do this item without a lot of added complexity.
- Test & Fix destroy_rebalance
Destroy_rebalance needs better testing when it occurs in rcu
mode. Currently it is not used in this mode, but it should be
supported. This part of the code needs better test coverage and
fix any necessary issues that arise with rcu readers during
updates.
- More userspace testing - Liam R. Howlett 2023/09/12
Add userspace fuzzer testing to the test suite. There was a
patch sent out to create a basic clang fuzzer. Using something
along those lines, write a fuzzer to test the maple tree.
Development has paused on this for now. If you have interest in
taking over the task, please let me know.
- Live locking with delays for testing
Add some sort of functional delay operation to work with testing
to reduce potential locking issues, especially with CPU
contention.
- Add min split span to parents
When nodes are split, the parent node splits by number of
entries. It may be better to split by range as it would help
avoid splitting more frequently on inserts.
- Search Marks
To complete the features that xarray support, the maple tree
needs to support search marks. Marks will replace the gap
tracking in a new node type complete with searching for a
specific tag from the top-level to the leaves; just like the
gaps that exist in the allocation nodes. It will probably need
two new node types (range and dense). Dense with 29 pointers,
range64 with 15 pointers.
- Pruning trees under memory pressure
For the page cache, we support retaining workingset information in
the tree after the folio has been swapped out ("shadow entries").
The problem with this is that trees full of stale workingset
information can occupy a large amount of memory. For the radix
tree, this was solved by putting nodes which contained only
shadow entries on the 'shadow_nodes' list in mm/workingset.c.
We might take a different approach for maple tree, such as
putting entire trees that contain many shadow entries on an LRU
list and pruning the entire tree of its shadow entries (or some
high percentage of its shadow entries) under memory pressure.
- Tree Jitter Reduction (probably worth holding off until dense nodes
and big node structure removal)
Peng's rework on duplicating the tree (forking changes) caused a
significant increase in brk2 benchmark scores [4]. The analysis
of the increase appears to point to expanding and contracting
nodes for every loop in the older fork code. Although this
method of duplcating the tree is no longer used, this potential
issue probably still exists. Analyze exactly what is going on
(probably writing a gap to a node end), and rework rebalance and
split calculations to avoid such a situation. Perhaps by
splitting nodes to avoid starting with a gap and not to push
left if there is only room for a single entry and cause a node
to start with a NULL; favour splitting instead.
- Better gap searching
The gap searching is not performing as well as it should. One
of the issues is the window being searched often runs to the end
of the tree, which often has a huge gap at the start and end -
but not all of those addresses are usable. The starting address
in the VMA space can also be changed. But to start, it might be
worth re-examining the algorithm of gap searching so that the
wrong path can be avoided, but the entire area needs to be
re-examined. This area has broken s390 and 32b compatibility,
so pay attention to the various arch testing and previous
regressions.
Larger Features:
- 64 bit stores on 32 bit arch
A number of users want to use the maple tree, but cannot because
they need a storage medium that supports 64 bit stores on 32 bit
arch.
- wr_mas with alloc instead of mas
Internally, we have a write maple state, but the maple state
currently holds the allocations. I'm not sure if this is worth
it, and it will probably need "Tracking tree status on walks
down" so that preallocations/allocations can be accurate. Other
considerations are how mas_prealloc() would work with this
change. This may not be necessary, depending on the per-cpu
array cache.
- Big Dense Node Type
There are various node types that could be added to the maple
tree. One of the most useful is probably a 'big dense' with the
node being a 4k block of singletons. This would have to come
after the dense node type as dense enables more users and helps
scope this work.
- Judy Array
The Judy Array is another fancy data structure. Mathieu
Desnoyers has enquired and spoken to us about incorporating judy
arrays as a node type or other features he has in the judy
array. This is in the early thought stages.
Please note that any new features will need test cases added to ensure
they function correctly.
[1] https://lists.infradead.org/mailman/listinfo/maple-tree
[2] https://lore.kernel.org/linux-mm/20231101171629.3612299-1-Liam.Howlett@oracle.com/
[3] https://lore.kernel.org/linux-mm/20231027033845.90608-11-zhangpeng.00@bytedance.com/
[4] https://lore.kernel.org/linux-mm/202311282145.ff13737b-oliver.sang@intel.com/
More information about the maple-tree
mailing list