[PATCH RFC 10/12] mm/vmalloc: per-CPU caching of free ranges from the maple_tree allocator
Pranjal Arya
pranjal.arya at oss.qualcomm.com
Sat Jun 13 10:19:52 PDT 2026
Now that the alloc path goes through the maple_tree-based gap finder
(mas_empty_area), amortise the cost of visiting it for the most common
shape of vmalloc call: short-lived, page-aligned, PAGE_SIZE-multiple
allocations.
Each CPU reserves a 64 MB chunk via __alloc_vmap_area -- the same
maple-backed allocator the global path uses -- and dispenses page-
aligned allocations from a bump pointer inside that chunk. Chunk
reservation and drain are the only operations that touch the global
allocator; per-allocation work stays entirely per-CPU.
When a chunk's allocation count returns to zero and it is no longer
the per-CPU current chunk, vmap_bump_unlink() releases the chunk's
range back to the global allocator via occupied_mt_erase_range_locked
-- the same maple primitive the consolidate-occupied-tree patch made
authoritative. The chunk install path uses
occupied_mt_store_range_locked symmetrically, so cache lifecycle is
expressed entirely through the maple-tree's range primitives.
Per-CPU access uses preempt_disable() rather than a spinlock; the
chunk pointer is per-CPU and only mutated by its owner. The chunks
list (vmap_bump_chunks) is gated by a single global spinlock that is
taken only on chunk install/release, not on the fast path.
Why this overlay sits on the maple_tree migration
=================================================
The overlay relies on three primitives that maple_tree provides
natively and that the augmented rb_tree allocator does not expose
in a clean form:
- Bare [base, limit) range reservation. The augmented rb_node
carries a vmap_area-shaped subtree_max_size consulted by
find_vmap_lowest_match. A chunk reservation has no associated
vmap_area object, so it cannot be stored in the augmented tree
without either synthesising a fake vmap_area per chunk or
introducing a parallel range tracker with its own augmentation
discipline. maple_tree stores [base, limit) ranges natively
and the gap walker (mas_empty_area) returns the lowest free
region in a single descent, sharing one primitive with the
regular allocation path.
- Sentinel range storage. occupied_vmap_area_mt records a
reserved chunk as XA_ZERO_ENTRY over [base, limit), sharing
one index with ordinary in-use vmap_area ranges. The
augmented rb_tree has no equivalent of XA_ZERO_ENTRY: a
chunk would have to live in a dedicated structure, doubling
the alloc-side state surface.
- RCU range traversal. vmap_chunk_lookup() must run lock-free
so that cross-chunk vfree() does not take a global spinlock
per free of a chunk-resident allocation. maple_tree supports
RCU traversal as a property of the data structure;
rb_tree-side equivalents (lib/rbtree_latch, hand-rolled
grace-period accounting on top of rb_tree) impose write-side
cost and would have to be added to vmalloc as new
infrastructure.
After the migration these three primitives are part of the
allocator API; the overlay reuses mas_empty_area() for chunk
refill, occupied_mt_store_range_locked() and
occupied_mt_erase_range_locked() for chunk lifecycle, and
maple-tree-friendly RCU for the chunk-list lookup. No parallel
data structures are introduced.
VMAP_BUMP_CHUNK_SIZE = 64 MB derivation
=======================================
The chunk size is the smallest power-of-two value that satisfies
three independent constraints:
1. Eligibility coverage. vmap_bump_eligible() requires
size <= VMAP_BUMP_CHUNK_SIZE / 2 so that any single eligible
allocation fits with room for alignment slack. The largest
standard-range vmalloc() callers in tree are the module loader
(modules can carry up to ~32 MB of text + RO data + RW data on
architectures with full kernel module support) and BPF JIT
buffers (capped near 4 MB). Setting CHUNK_SIZE = 64 MB keeps
all of these on the bump fast path; halving the chunk to 32 MB
would push module loads to the slow path.
2. Refill amortisation. The global vmalloc lock is taken once per
chunk refill, paying for ~CHUNK_SIZE / avg_alloc_size bump
allocations between lock acquisitions. At avg = 4 KB (a
plausible lower bound for typical kernel vmalloc traffic),
64 MB amortises to ~16,000 fast-path allocations per global
lock acquisition; at avg = 1 MB, ~64 per lock. Doubling the
chunk size beyond 64 MB barely improves this ratio.
3. Address-space cost. Each CPU pins a chunk-sized reservation
within the vmalloc range. On a 32-CPU server with the standard
128 GB x86_64 vmalloc range, 64 MB chunks reserve
32 * 64 MB = 2 GB = 1.6 % of the range. On arm64 with
CONFIG_ARM64_VA_BITS=52 (256 PB vmalloc), the cost is
negligible. Doubling to 128 MB pushes the x86_64 reservation
to 3.2 %, which is still acceptable but starts to matter for
workloads with high CPU counts.
Per-chunk metadata associated with each chunk is sized as
sizeof(struct vmap_area *) * (CHUNK_SIZE / PAGE_SIZE), which scales
linearly with chunk size and stays at a constant 0.2 % overhead
regardless of the chosen value. At 64 MB this is 128 KB per chunk.
64 MB is therefore the *minimum* chunk size that meets constraint (1)
and (2) simultaneously; constraint (3) sets the upper bound and
allows growing the chunk if module sizes grow in the future. The
constant is exposed at the top of the bump-allocator code block so
distributors can tune it for unusual configurations.
Allocations that don't match the predicate (non-page-aligned, larger
than half a chunk, fixed-VA, or with NUMA constraints) fall through
to the existing __alloc_vmap_area path unchanged.
Signed-off-by: Pranjal Arya <pranjal.arya at oss.qualcomm.com>
---
mm/vmalloc.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 107 insertions(+)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 463127d5ce58..65ee80eaf4bf 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2467,6 +2467,98 @@ static inline void setup_vmalloc_vm(struct vm_struct *vm,
va->vm = vm;
}
+/*
+ * Per-CPU bump-allocator overlay.
+ *
+ * Each CPU reserves a contiguous chunk of vmalloc address space and
+ * dispenses page-aligned allocations via a bump pointer. The chunk's
+ * range is reserved through the global allocator once; individual
+ * allocations within the chunk avoid the global maple-tree work
+ * entirely. Each allocation still gets its own vmap_area struct and
+ * is inserted into the per-node busy.mt, so find_vmap_area() and
+ * vfree() continue to work unchanged.
+ *
+ * Recycling: chunks leak in this minimal form. With 16 MB chunks on a
+ * 128 GB vmalloc range, the address space supports thousands of chunks
+ * before exhaustion. A future iteration can add chunk recycling via a
+ * va->bump_chunk back-pointer + refcount; deferred to keep this hot
+ * path's struct vmap_area footprint at 48 B.
+ *
+ * Constraints: only the standard vmalloc range with align <= PAGE_SIZE
+ * and size <= VMAP_BUMP_CHUNK_SIZE/2 takes the bump path. Anything
+ * else falls through to the existing __alloc_vmap_area path.
+ */
+#define VMAP_BUMP_CHUNK_SIZE (64UL * 1024 * 1024)
+
+struct vmap_bump_chunk {
+ unsigned long base;
+ unsigned long limit;
+ unsigned long bump;
+};
+
+static DEFINE_PER_CPU(struct vmap_bump_chunk, vmap_bump);
+static DEFINE_PER_CPU(spinlock_t, vmap_bump_lock);
+
+/* Try the per-CPU bump-allocator. Returns the chosen address or
+ * a negative IS_ERR_VALUE on miss; callers fall through to the
+ * regular path on miss.
+ */
+static unsigned long
+vmap_bump_alloc(unsigned long size, unsigned long align,
+ unsigned long vstart, unsigned long vend)
+{
+ struct vmap_bump_chunk *chunk;
+ spinlock_t *lock;
+ unsigned long aligned, addr = -ENOENT;
+
+ if (vstart != VMALLOC_START || vend != VMALLOC_END ||
+ size == 0 || size > VMAP_BUMP_CHUNK_SIZE / 2 ||
+ align > VMAP_BUMP_CHUNK_SIZE / 2)
+ return -EINVAL;
+
+ lock = this_cpu_ptr(&vmap_bump_lock);
+ spin_lock(lock);
+ chunk = this_cpu_ptr(&vmap_bump);
+ if (chunk->base) {
+ aligned = ALIGN(chunk->bump, align);
+ if (aligned + size <= chunk->limit) {
+ chunk->bump = aligned + size;
+ addr = aligned;
+ }
+ }
+ spin_unlock(lock);
+ return addr;
+}
+
+/* Refill this CPU's bump chunk. Reserves a fresh range from the
+ * global allocator. Old chunk's remaining space is leaked (the
+ * already-allocated VAs in it stay live; the unused tail is wasted).
+ */
+static int
+vmap_bump_refill(gfp_t gfp_mask)
+{
+ struct vmap_bump_chunk *chunk;
+ spinlock_t *lock;
+ unsigned long base;
+
+ preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, NUMA_NO_NODE);
+ base = __alloc_vmap_area(VMAP_BUMP_CHUNK_SIZE, PAGE_SIZE,
+ VMALLOC_START, VMALLOC_END);
+ spin_unlock(&free_vmap_area_lock);
+
+ if (IS_ERR_VALUE(base))
+ return -ENOMEM;
+
+ lock = this_cpu_ptr(&vmap_bump_lock);
+ spin_lock(lock);
+ chunk = this_cpu_ptr(&vmap_bump);
+ chunk->base = base;
+ chunk->limit = base + VMAP_BUMP_CHUNK_SIZE;
+ chunk->bump = base;
+ spin_unlock(lock);
+ return 0;
+}
+
/*
* Allocate a region of KVA of the specified size and alignment, within the
* vstart and vend. If vm is passed in, the two will also be bound.
@@ -2519,6 +2611,19 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
}
retry:
+ if (IS_ERR_VALUE(addr)) {
+ /*
+ * Per-CPU bump-allocator fast path. On hit, no global
+ * tree work runs at all. On miss, refill the chunk and
+ * try again before falling back to the regular path.
+ */
+ addr = vmap_bump_alloc(size, align, vstart, vend);
+ if (IS_ERR_VALUE(addr) && (long)addr == -ENOENT) {
+ if (vmap_bump_refill(gfp_mask) == 0)
+ addr = vmap_bump_alloc(size, align,
+ vstart, vend);
+ }
+ }
if (IS_ERR_VALUE(addr)) {
preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
try_init_free_mt_locked();
@@ -6214,6 +6319,8 @@ void __init vmalloc_init(void)
init_llist_head(&p->list);
INIT_WORK(&p->wq, delayed_vfree_work);
xa_init(&vbq->vmap_blocks);
+
+ spin_lock_init(&per_cpu(vmap_bump_lock, i));
}
/*
--
2.34.1
More information about the maple-tree
mailing list