[PATCH RFC 0/6] SLUB percpu sheaves
Vlastimil Babka
vbabka at suse.cz
Tue Nov 12 08:38:44 PST 2024
Hi,
This is a RFC to add an opt-in percpu array-based caching layer to SLUB.
The name "sheaf" was invented by Matthew so we don't call it magazine
like the original Bonwick paper. The per-NUMA-node cache of sheaves is
thus called "barn".
This may seem similar to the arrays in SLAB, but the main differences
are:
- opt-in, not used for every cache
- does not distinguish NUMA locality, thus no "alien" arrays that would
need periodical flushing
- improves kfree_rcu() handling
- API for obtaining a preallocated sheaf that can be used for guaranteed
and efficient allocations in a restricted context, when upper bound is
known but rarely reached
The motivation comes mainly from the ongoing work related to VMA
scalability and the related maple tree operations. This is why maple
tree node and vma caches are sheaf-enabled in the RFC. Performance benefits
were measured by Suren in preliminary non-public versions.
A sheaf-enabled cache has the following expected advantages:
- Cheaper fast paths. For allocations, instead of local double cmpxchg,
with Patch 5 it's preempt_disable() and no atomic operations. Same for
freeing, which is normally a local double cmpxchg only for a short
term allocations (so the same slab is still active on the same cpu when
freeing the object) and a more costly locked double cmpxchg otherwise.
The downside is lack of NUMA locality guarantees for the allocated
objects.
I hope this scheme will also allow (non-guaranteed) slab allocations
in context where it's impossible today and achieved by building caches
on top of slab, i.e. the BPF allocator.
- kfree_rcu() batching. kfree_rcu() will put objects to a separate
percpu sheaf and only submit the whole sheaf to call_rcu() when full.
After the grace period, the sheaf can be used for allocations, which
is more efficient than handling individual slab objects (even with the
batching done by kfree_rcu() implementation itself). In case only some
cpus are allowed to handle rcu callbacks, the sheaf can still be made
available to other cpus on the same node via the shared barn.
Both maple_node and vma caches can benefit from this.
- Preallocation support. A prefilled sheaf can be borrowed for a short
term operation that is not allowed to block and may need to allocate
some objects. If an upper bound (worst case) for the number of
allocations is known, but only much fewer allocations actually needed
on average, borrowing and returning a sheaf is much more efficient then
a bulk allocation for the worst case followed by a bulk free of the
many unused objects. Maple tree write operations should benefit from
this.
Patch 1 implements the basic sheaf functionality and using
local_lock_irqsave() for percpu sheaf locking.
Patch 2 adds the kfree_rcu() support.
Patches 3 and 4 enable sheaves for maple tree nodes and vma's.
Patch 5 replaces the local_lock_irqsave() locking with a cheaper scheme
inspired by online conversations with Mateusz Guzik and Jann Horn. In
the past I have tried to copy the scheme from page allocator's pcplists
that also avoids disabling irqs by using a trylock for operations that
might be attempted from an irq handler conext. But spin locks used for
pcplists are more costly than a simple flag with only compiler barriers.
On the other hand it's not possible to take the lock from a different
cpu (except for hotplug handling when the actual local cpu cannot race
with us), but we don't need that remote locking for sheaves.
Patch 6 implements borrowing prefilled sheaf, with maple tree being the
ancticipated user once converted to use it by someone more knowledgeable
than myself.
(RFC) LIMITATIONS:
- with slub_debug enabled, objects in sheaves are considered allocated
so allocation/free stacktraces may become imprecise and checking of
e.g. redzone violations may be delayed
- kfree_rcu() via sheaf is only hooked to tree rcu, not tiny rcu. Also
in case we fail to allocate a sheaf, and fallback to the existing
implementation, it may use kfree_bulk() where destructors are not
hooked. It's however possible we won't need the destructor support
for now at all if vma_lock is moved to vma itself [1] and if it's
possible to free anon_name and numa balancing tracking immediately
and not after a grace period.
- in case a prefilled sheaf is requested with more objects than the
cache's sheaf_capacity, it will fail. This should be possible to
handle by allocating a bigger sheaf and then freeing it when returned,
to avoid mixing up different sizes. Ineffective, but acceptable if
very rare.
[1] https://lore.kernel.org/all/20241111205506.3404479-1-surenb@google.com/
Vlastimil
git branch: https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v1r5
---
Vlastimil Babka (6):
mm/slub: add opt-in caching layer of percpu sheaves
mm/slub: add sheaf support for batching kfree_rcu() operations
maple_tree: use percpu sheaves for maple_node_cache
mm, vma: use sheaves for vm_area_struct cache
mm, slub: cheaper locking for percpu sheaves
mm, slub: sheaf prefilling for guaranteed allocations
include/linux/slab.h | 60 +++
kernel/fork.c | 27 +-
kernel/rcu/tree.c | 8 +-
lib/maple_tree.c | 11 +-
mm/slab.h | 27 +
mm/slab_common.c | 8 +-
mm/slub.c | 1427 ++++++++++++++++++++++++++++++++++++++++++++++++--
7 files changed, 1503 insertions(+), 65 deletions(-)
---
base-commit: 2d5404caa8c7bb5c4e0435f94b28834ae5456623
change-id: 20231128-slub-percpu-caches-9441892011d7
Best regards,
--
Vlastimil Babka <vbabka at suse.cz>
More information about the maple-tree
mailing list