[PROJECT] Is combining allocation and lookup a good idea?

Liam R. Howlett Liam.Howlett at Oracle.com
Tue Nov 14 18:37:40 PST 2023


* Matthew Wilcox <willy at infradead.org> [231114 12:57]:
> The rbtree combined allocation and lookup.  We followed that lead with
> the maple tree and have "allocating nodes" which contain 10 slots. 10
> gaps and 9 pivots.  Now I'm wondering if we made the right decision.
> 
> If we split "find a free range of address space" into a separate data
> structure, we get 16 slots and 15 pivots.  Assuming that a tree node
> is 70% full, an allocating tree can store 7 * 7 * 11 = 539 VMAs in a
> height-3 tree.

4 height tree, right?  root -> 7 -> 7 -> 11 (entries).

> If we have a tree of regular nodes, we can store 1331
> in a height-3 tree -- over twice as many!

In practice, I don't think we see 70% usage on non-leaf nodes.
Splitting occurs when the leaves are full/over-full, and the same goes
for parents.  So we end up with a potential of the leaves being 50%
full, but the parents are significantly more dense - data wise.  We also
push left/right at each level.

To get to your example, we would have had to expand to splitting many
times and then removing a large portion of those VMAs - that rarely
happens from what I've seen.  Not only that but below 6, we'd rebalance
so that 7 value is close to the worst you could have picked here.

Realistically, we see 3 high trees (not 4 like above), which have a root
of 2-4 with close to full children.

Certainly, the gap-tracking eats some of the space - but generally we
have lots of room in the root, so that's free.

> 
> We should definitely be optimising for lookup, not allocation, so keeping
> the allocation information in the tree alongside the lookup information
> is probably a really bad idea.  I have lots of bad ideas.

I'm not so sure.  Our write time is longer than what we are replacing
and our read time is shorter.  I fear this will slow down the write time
(which is currently our pain point) in hopes of reducing one read
dereference.  But we mostly see level 3 trees and I'm not even sure we'd
see that many decreases in heigh in the normal VMA range?

> 
> We'd obviously want to keep the two data structures under the same lock,
> but they needn't both be a maple tree.  There are definitely downsides
> to the current scheme; for example , if you mmap() all of memory, then
> munmap() alternate 64MB chunks, mmap() specifying a size of 64M and
> alignment of 64M will not find any of the open spaces because we look
> for a 128MB gap so that we can guarantee alignment.

Yes, this is annoying but mostly for 32b people.  I'd like to fix it,
but that calculation is tricky to get right.

> 
> I suspect the obvious first place to start is a second maple tree
> which contains (start, length) descriptions of the gaps, but honestly
> designing the right data structure for this could be a project in itself.
> Just separating the allocation from the lookup is a big enough project,
> I think.

Having 2 trees will double the nodes required to track the data, and
allocations are also a rather painful point right now.  We'd have to
pre-allocate a lot more nodes here.  We would also increase the cache
usage by ~2x, or at least 50%?

We still have the full bit, remember.  We could utilize the bit when
searching for a gap to avoid walking down full sub-trees.  It might be
better all around if we just use the non-alloc tree and the start
location to find gaps and check them on iteration.

Another advantage to the gap tree that's offered significant gains is
the metadata stored out-of-band.  This allowed for finding the end of a
node and the largest gap quickly.  We will obviously lose this on full
nodes if we decide to proceed this route.  Non-full nodes utilize the
last slot for this (although this is slower).

What if we mix R64 and AR64 nodes in the same tree.  Searching for
gaps would either use the size on ar64 or the full bit on r64 nodes.
We'd switch to R64 when the AR64 nodes are full.  This would reduce the
height when possible, but also allow for some support of size searching
- we could fall back to walking the tree at lower levels by using the
full bit to skip nodes that have no NULLs.

This method would make gap searches slower, but make less allocations on
writes and also decrease the dereferences.



More information about the maple-tree mailing list