[PROJECT] Is combining allocation and lookup a good idea?

Wed Nov 15 06:47:20 PST 2023

On Tue, Nov 14, 2023 at 09:37:40PM -0500, Liam R. Howlett wrote:
> * Matthew Wilcox <willy at infradead.org> [231114 12:57]:
> > The rbtree combined allocation and lookup.  We followed that lead with
> > the maple tree and have "allocating nodes" which contain 10 slots. 10
> > gaps and 9 pivots.  Now I'm wondering if we made the right decision.
> > 
> > If we split "find a free range of address space" into a separate data
> > structure, we get 16 slots and 15 pivots.  Assuming that a tree node
> > is 70% full, an allocating tree can store 7 * 7 * 11 = 539 VMAs in a
> > height-3 tree.
> 
> 4 height tree, right?  root -> 7 -> 7 -> 11 (entries).

... however we call it ;-)

> > If we have a tree of regular nodes, we can store 1331
> > in a height-3 tree -- over twice as many!
> 
> In practice, I don't think we see 70% usage on non-leaf nodes.
> Splitting occurs when the leaves are full/over-full, and the same goes
> for parents.  So we end up with a potential of the leaves being 50%
> full, but the parents are significantly more dense - data wise.  We also
> push left/right at each level.

OK, but this calculation would be similar for both using ar64 and r64
nodes.  Let's say the leaves are 60% full and the parents are 90% full.
With an ar64 tree, we get 9 * 9 * 10 = 810.  With a purely r64 tree,
we get 14 * 14 * 10 = 1960.

> Realistically, we see 3 high trees (not 4 like above), which have a root
> of 2-4 with close to full children.

The processes which have two elements at root might be able to shrink
the tree by one level.  Assuming each leaf is 60% full, a 2-level ar64
tree can have 100 VMAs (10 leaves, each leaf with 10 VMAs) before adding
an extra layer.  A 2-level r64 tree can have 16 leaves, each with 10
VMAs for a total of 160 VMAs.  Yes, in the range between 160 and 810
there's no difference in the height of the tree.

> > We should definitely be optimising for lookup, not allocation, so keeping
> > the allocation information in the tree alongside the lookup information
> > is probably a really bad idea.  I have lots of bad ideas.
> 
> I'm not so sure.  Our write time is longer than what we are replacing
> and our read time is shorter.  I fear this will slow down the write time
> (which is currently our pain point) in hopes of reducing one read
> dereference.  But we mostly see level 3 trees and I'm not even sure we'd
> see that many decreases in heigh in the normal VMA range?

I think you're over-reacting to the pain points we see.  Nobody's filing
bugs saying "the performance of page faults improved AND I DON'T LIKE IT".

> > I suspect the obvious first place to start is a second maple tree
> > which contains (start, length) descriptions of the gaps, but honestly
> > designing the right data structure for this could be a project in itself.
> > Just separating the allocation from the lookup is a big enough project,
> > I think.
> 
> Having 2 trees will double the nodes required to track the data, and
> allocations are also a rather painful point right now.  We'd have to
> pre-allocate a lot more nodes here.  We would also increase the cache
> usage by ~2x, or at least 50%?

Ooh, no.  You'd effectively collapse all the mapped addresses into a
single entry.

$ cat /proc/self/maps
559b9e0ce000-559b9e0d0000 r--p 00000000 fe:01 805306718                  /usr/bin/cat
559b9e0d0000-559b9e0d5000 r-xp 00002000 fe:01 805306718                  /usr/bin/cat
559b9e0d5000-559b9e0d8000 r--p 00007000 fe:01 805306718                  /usr/bin/cat
559b9e0d8000-559b9e0d9000 r--p 00009000 fe:01 805306718                  /usr/bin/cat
559b9e0d9000-559b9e0da000 rw-p 0000a000 fe:01 805306718                  /usr/bin/cat
559b9f943000-559b9f964000 rw-p 00000000 00:00 0                          [heap]
7ffb60000000-7ffb60564000 r--p 00000000 fe:01 6352021                    /usr/lib/locale/locale-archive
7ffb606bc000-7ffb606bf000 rw-p 00000000 00:00 0
7ffb606bf000-7ffb606e5000 r--p 00000000 fe:01 548883                     /usr/lib/x86_64-linux-gnu/libc.so.6
7ffb606e5000-7ffb6083a000 r-xp 00026000 fe:01 548883                     /usr/lib/x86_64-linux-gnu/libc.so.6
7ffb6083a000-7ffb6088e000 r--p 0017b000 fe:01 548883                     /usr/lib/x86_64-linux-gnu/libc.so.6
7ffb6088e000-7ffb60892000 r--p 001cf000 fe:01 548883                     /usr/lib/x86_64-linux-gnu/libc.so.6
7ffb60892000-7ffb60894000 rw-p 001d3000 fe:01 548883                     /usr/lib/x86_64-linux-gnu/libc.so.6
7ffb60894000-7ffb608a1000 rw-p 00000000 00:00 0
7ffb608a7000-7ffb608cb000 rw-p 00000000 00:00 0
7ffb608cb000-7ffb608cc000 r--p 00000000 fe:01 235950                     /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7ffb608cc000-7ffb608f1000 r-xp 00001000 fe:01 235950                     /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7ffb608f1000-7ffb608fb000 r--p 00026000 fe:01 235950                     /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7ffb608fb000-7ffb608fd000 r--p 00030000 fe:01 235950                     /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7ffb608fd000-7ffb608ff000 rw-p 00032000 fe:01 235950                     /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7ffece153000-7ffece174000 rw-p 00000000 00:00 0                          [stack]
7ffece1d8000-7ffece1dc000 r--p 00000000 00:00 0                          [vvar]
7ffece1dc000-7ffece1de000 r-xp 00000000 00:00 0                          [vdso]

Instead of having 23 entries, we'd have:

0-559b9e0ce000
559b9e0da000-559b9f943000
559b9f964000-7ffb60000000
7ffb60564000-7ffb606bc000
7ffb608a1000-7ffb608a7000
7ffb608ff000-7ffece153000
7ffece174000-7ffece1d8000
7ffece1de000-oo

just 8 entries.

> We still have the full bit, remember.  We could utilize the bit when
> searching for a gap to avoid walking down full sub-trees.  It might be
> better all around if we just use the non-alloc tree and the start
> location to find gaps and check them on iteration.

I'm hoping we get to a better algorithm for finding a gap than either
"over allocate" or "scan the entire thing".  There's plenty of
literature on writing good memory allocators.

> What if we mix R64 and AR64 nodes in the same tree.  Searching for
> gaps would either use the size on ar64 or the full bit on r64 nodes.
> We'd switch to R64 when the AR64 nodes are full.  This would reduce the
> height when possible, but also allow for some support of size searching
> - we could fall back to walking the tree at lower levels by using the
> full bit to skip nodes that have no NULLs.

Seems more complex, but would have the same advantage in terms of
reducing height ...