[PROJECT] Is combining allocation and lookup a good idea?

Liam R. Howlett Liam.Howlett at Oracle.com
Wed Nov 15 05:10:29 PST 2023


* Peng Zhang <zhangpeng.00 at bytedance.com> [231114 22:20]:
> 
> 
> 在 2023/11/15 01:57, Matthew Wilcox 写道:
> > The rbtree combined allocation and lookup.  We followed that lead with
> > the maple tree and have "allocating nodes" which contain 10 slots. 10
> > gaps and 9 pivots.  Now I'm wondering if we made the right decision.
> > 
> > If we split "find a free range of address space" into a separate data
> > structure, we get 16 slots and 15 pivots.  Assuming that a tree node
> > is 70% full, an allocating tree can store 7 * 7 * 11 = 539 VMAs in a
> > height-3 tree.  If we have a tree of regular nodes, we can store 1331
> > in a height-3 tree -- over twice as many!
> I have thought about this before. I have a question: do we really need
> to store the gap of each subtree in non-leaf nodes? Perhaps we don't need
> this gap information in non-leaf nodes. Instead, we can simply store the
> maximum gap in the metadata, and when we need the specific maximum gap of
> a particular subtree, we can enter that child node and retrieve its metadata.
> I'm not sure if I'm missing something.

I'm not sure how much you know about the current interface, but it is
usually searched within a window.  So, if a child node has a maximum and
is within that window (say 0x5000 - oo while searching down from
0x100000), then the location of that maximum may fall outside of the
window.  In this case, we end up having to check all the children within
that sub-tree anyways.  This wouldn't be very cache-efficient - although
I'm not sure how cache efficient the other options are either..

We would also have to, essentially, double our metadata.  And we lose
the space for the metadata when the node is near-full or full.  We
picked the node size to be cache-aligned to 4 cache lines, so we're not
expanding a node (I hope).

(r64/ar64 = range 64/ allocation range 64 nodes)

But, maybe this is good enough?  It would slow down on full trees like
my idea of using a mix of ar64 and r64 nodes within the same tree
depending on fullness.  I like the ar64/r64 mix because we usually have
less dense nodes at the top (root and the next level) and those two
levels will hugely reduce the search space.

Doing one doesn't rule out the other here, though.  Before we need 11
slots, we could use ar64 and then promote to r64 with more metadata
until we need more than 14(?) slots.

> > 
> > We should definitely be optimising for lookup, not allocation, so keeping
> > the allocation information in the tree alongside the lookup information
> > is probably a really bad idea.  I have lots of bad ideas.
> > 
> > We'd obviously want to keep the two data structures under the same lock,
> > but they needn't both be a maple tree.  There are definitely downsides
> > to the current scheme; for example , if you mmap() all of memory, then
> > munmap() alternate 64MB chunks, mmap() specifying a size of 64M and
> > alignment of 64M will not find any of the open spaces because we look
> > for a 128MB gap so that we can guarantee alignment.
> I have an idea: perhaps the gap in the maple tree could be represented
> by a 64-bit bitmap, where each bit indicates the power of 2 that the
> subtree can allocate as a range. This can address the issue you mentioned.
> When I started working a year ago, I implemented a version of IOVA
> allocation using a red-black tree, as referenced in patch [1]. It addressed
> the address alignment issue. Unfortunately, it didn't receive support from
> the IOVA maintainer, but I believe it was a good solution for addressing
> address alignment.

This is quite interesting. The searching we deal with is either up/down
within a min/max, with a mask and a size.  It looks like what you have
there can handle all of this for a single gap?  We would have (at most)
8 gaps in a node with r64 node in the left-most position, usually the
max would be 7.

Our current search mirrors what the rbtree did and just doubles the size
to ensure what we find will fit the alignment.

> 
> [1] https://lore.kernel.org/lkml/20220818121703.20977-1-zhangpeng.00@bytedance.com/
> > 
> > I suspect the obvious first place to start is a second maple tree
> > which contains (start, length) descriptions of the gaps, but honestly
> > designing the right data structure for this could be a project in itself.
> > Just separating the allocation from the lookup is a big enough project,
> > I think.
> > 

I really question the two tree idea.  I am concerned because of our
allocating requirements.  Even with Vlastimil's per-cpu cache and the
dipping into reserves, I think this will remain an issue.

I also believe that the gap tracking serves very little purpose after
the initial flurry of a task setup when writes reduce a lot.  I'm just
wondering how much that extra dereference is worth - will we be trading
not just code complexity, but also instructions and (when writing) cache
space?

Thanks,
Liam



More information about the maple-tree mailing list