[PROJECT] Is combining allocation and lookup a good idea?

Liam R. Howlett Liam.Howlett at Oracle.com
Fri Nov 17 07:33:13 PST 2023


* Matthew Wilcox <willy at infradead.org> [231115 09:47]:
> On Tue, Nov 14, 2023 at 09:37:40PM -0500, Liam R. Howlett wrote:
> > * Matthew Wilcox <willy at infradead.org> [231114 12:57]:
> > > The rbtree combined allocation and lookup.  We followed that lead with
> > > the maple tree and have "allocating nodes" which contain 10 slots. 10
> > > gaps and 9 pivots.  Now I'm wondering if we made the right decision.
> > > 
> > > If we split "find a free range of address space" into a separate data
> > > structure, we get 16 slots and 15 pivots.  Assuming that a tree node
> > > is 70% full, an allocating tree can store 7 * 7 * 11 = 539 VMAs in a
> > > height-3 tree.
> > 
> > 4 height tree, right?  root -> 7 -> 7 -> 11 (entries).
> 
> ... however we call it ;-)
> 
> > > If we have a tree of regular nodes, we can store 1331
> > > in a height-3 tree -- over twice as many!
> > 
> > In practice, I don't think we see 70% usage on non-leaf nodes.
> > Splitting occurs when the leaves are full/over-full, and the same goes
> > for parents.  So we end up with a potential of the leaves being 50%
> > full, but the parents are significantly more dense - data wise.  We also
> > push left/right at each level.
> 
> OK, but this calculation would be similar for both using ar64 and r64
> nodes.  Let's say the leaves are 60% full and the parents are 90% full.
> With an ar64 tree, we get 9 * 9 * 10 = 810.  With a purely r64 tree,
> we get 14 * 14 * 10 = 1960.

I'm just not sold how much increasing the branching factor from 10 to 16
will benefit the readers.  Looking at my bloated firefox tab mess, I see
~5500 VMAs.

r64: root (3 nodes) * 14 * 14 * 10 = 5880
ar64: root (7 nodes) * 9 * 9 * 10 = 5670

So we're going to end up tracking the gaps outside of the nodes for no
reason.  And the write tree would need more allocations.

What this does is increases the step function of how many dereferences
are needed vs the number of items stored.

Would it be better to change the size of nodes, or look at changing this
at the same time?  We will be using more nodes for the write tree
already...  Another way to look at it is you are going to increase the
node size by making this write tree.  Using your numbers above, we'd use
10% more I guess?

> 
> 
> > Realistically, we see 3 high trees (not 4 like above), which have a root
> > of 2-4 with close to full children.
> 
> The processes which have two elements at root might be able to shrink
> the tree by one level.  Assuming each leaf is 60% full, a 2-level ar64
> tree can have 100 VMAs (10 leaves, each leaf with 10 VMAs) before adding
> an extra layer.  A 2-level r64 tree can have 16 leaves, each with 10
> VMAs for a total of 160 VMAs.  Yes, in the range between 160 and 810
> there's no difference in the height of the tree.
> 
> > > We should definitely be optimising for lookup, not allocation, so keeping
> > > the allocation information in the tree alongside the lookup information
> > > is probably a really bad idea.  I have lots of bad ideas.
> > 
> > I'm not so sure.  Our write time is longer than what we are replacing
> > and our read time is shorter.  I fear this will slow down the write time
> > (which is currently our pain point) in hopes of reducing one read
> > dereference.  But we mostly see level 3 trees and I'm not even sure we'd
> > see that many decreases in heigh in the normal VMA range?
> 
> I think you're over-reacting to the pain points we see.  Nobody's filing
> bugs saying "the performance of page faults improved AND I DON'T LIKE IT".

Right, people are saying "we are seeing a regression in this
micro-benchmark that doesn't matter", or: "why is my task start time
sucking a lot?"  That last one is my concern.

I also have people asking how to work around or pre-allocate for certain
use cases.  I think this will make it worse.

> 
> > > I suspect the obvious first place to start is a second maple tree
> > > which contains (start, length) descriptions of the gaps, but honestly
> > > designing the right data structure for this could be a project in itself.
> > > Just separating the allocation from the lookup is a big enough project,
> > > I think.
> > 
> > Having 2 trees will double the nodes required to track the data, and
> > allocations are also a rather painful point right now.  We'd have to
> > pre-allocate a lot more nodes here.  We would also increase the cache
> > usage by ~2x, or at least 50%?
> 
> Ooh, no.  You'd effectively collapse all the mapped addresses into a
> single entry.
> 
> $ cat /proc/self/maps
> 559b9e0ce000-559b9e0d0000 r--p 00000000 fe:01 805306718                  /usr/bin/cat
> 559b9e0d0000-559b9e0d5000 r-xp 00002000 fe:01 805306718                  /usr/bin/cat
> 559b9e0d5000-559b9e0d8000 r--p 00007000 fe:01 805306718                  /usr/bin/cat
> 559b9e0d8000-559b9e0d9000 r--p 00009000 fe:01 805306718                  /usr/bin/cat
> 559b9e0d9000-559b9e0da000 rw-p 0000a000 fe:01 805306718                  /usr/bin/cat
> 559b9f943000-559b9f964000 rw-p 00000000 00:00 0                          [heap]
> 7ffb60000000-7ffb60564000 r--p 00000000 fe:01 6352021                    /usr/lib/locale/locale-archive
> 7ffb606bc000-7ffb606bf000 rw-p 00000000 00:00 0
> 7ffb606bf000-7ffb606e5000 r--p 00000000 fe:01 548883                     /usr/lib/x86_64-linux-gnu/libc.so.6
> 7ffb606e5000-7ffb6083a000 r-xp 00026000 fe:01 548883                     /usr/lib/x86_64-linux-gnu/libc.so.6
> 7ffb6083a000-7ffb6088e000 r--p 0017b000 fe:01 548883                     /usr/lib/x86_64-linux-gnu/libc.so.6
> 7ffb6088e000-7ffb60892000 r--p 001cf000 fe:01 548883                     /usr/lib/x86_64-linux-gnu/libc.so.6
> 7ffb60892000-7ffb60894000 rw-p 001d3000 fe:01 548883                     /usr/lib/x86_64-linux-gnu/libc.so.6
> 7ffb60894000-7ffb608a1000 rw-p 00000000 00:00 0
> 7ffb608a7000-7ffb608cb000 rw-p 00000000 00:00 0
> 7ffb608cb000-7ffb608cc000 r--p 00000000 fe:01 235950                     /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
> 7ffb608cc000-7ffb608f1000 r-xp 00001000 fe:01 235950                     /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
> 7ffb608f1000-7ffb608fb000 r--p 00026000 fe:01 235950                     /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
> 7ffb608fb000-7ffb608fd000 r--p 00030000 fe:01 235950                     /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
> 7ffb608fd000-7ffb608ff000 rw-p 00032000 fe:01 235950                     /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
> 7ffece153000-7ffece174000 rw-p 00000000 00:00 0                          [stack]
> 7ffece1d8000-7ffece1dc000 r--p 00000000 00:00 0                          [vvar]
> 7ffece1dc000-7ffece1de000 r-xp 00000000 00:00 0                          [vdso]
> 
> Instead of having 23 entries, we'd have:
> 
> 0-559b9e0ce000
> 559b9e0da000-559b9f943000
> 559b9f964000-7ffb60000000
> 7ffb60564000-7ffb606bc000
> 7ffb608a1000-7ffb608a7000
> 7ffb608ff000-7ffece153000
> 7ffece174000-7ffece1d8000
> 7ffece1de000-oo
> 
> just 8 entries.

This is a bad example for your argument.  We would have:
3 nodes for the read side, 1 node for the write side.

Today we'd have 3 nodes for read and write.

You've made cat 25% worse on memory for the maple tree, and we know how
the internet loves cats ;)

But I see what you are saying.  I think these smaller tasks (probably
also gcc?) would increase memory usage for no gain, but larger tasks
would not see such a scale up.

But, if we had the maple tree here pointing to the leaf node with the
gap, then we could quickly find the leaf to modify.

However, if we were not using the gaps and unmapping (way more common),
then we have a much more painful update which would pull more nodes into
the cpu cache.  On the positive side, we wouldn't need to propagate the
gap change up the tree (this doesn't need allocations as they are only
read/used under the write lock).

> > We still have the full bit, remember.  We could utilize the bit when
> > searching for a gap to avoid walking down full sub-trees.  It might be
> > better all around if we just use the non-alloc tree and the start
> > location to find gaps and check them on iteration.
> 
> I'm hoping we get to a better algorithm for finding a gap than either
> "over allocate" or "scan the entire thing".  There's plenty of
> literature on writing good memory allocators.
> 
> > What if we mix R64 and AR64 nodes in the same tree.  Searching for
> > gaps would either use the size on ar64 or the full bit on r64 nodes.
> > We'd switch to R64 when the AR64 nodes are full.  This would reduce the
> > height when possible, but also allow for some support of size searching
> > - we could fall back to walking the tree at lower levels by using the
> > full bit to skip nodes that have no NULLs.
> 
> Seems more complex, but would have the same advantage in terms of
> reducing height ...

Well, we have to back-out of descends already since a node may overlap
the area but the gap may not overlap the area we are searching.  So at
least this complexity is required with the existing layout.  I think it
would not exist with your two-tree model and that's nice.

The removal of a VMA would become a mess though;  We would write a NULL
to the write tree, but that may completely remove a VMA or split a vma
(punching a hole) which would replace one of the two VMA pointers, or
shifting a boundary would mean nothing.  At what point do we push this
out of the data structure itself and into the vma_iterator (an already
out-of-date name).

Thanks,
Liam




More information about the maple-tree mailing list