[PATCHSET v4 sched_ext/for-7.2] bpf/arena: Direct kernel-side access
Tejun Heo
tj at kernel.org
Fri May 22 10:22:11 PDT 2026
Hello,
This makes BPF arena memory directly dereferenceable from kernel code
(struct_ops callbacks, kfuncs). Each arena gets a per-arena scratch page
that an arch fault hook installs into empty PTEs on kernel-side faults,
after KFENCE. The faulting instruction retries and the violation is reported
through the program's BPF stream.
v4:
- Patch 1: note that the strict-zero cmpxchg is narrower than pte_none() in
inline comments on both x86 and arm64. (Andrea)
- Patch 2: stub bpf_arena_handle_page_fault() for !CONFIG_BPF_SYSCALL via a
new include/linux/bpf_defs.h. (lkp)
- Patch 7: scx_arena_alloc() retries via a loop instead of a single retry on
pool growth. (Andrea)
- Picked up Reviewed-by tags from Emil and Andrea.
v3: https://lore.kernel.org/r/20260520235052.4180316-1-tj@kernel.org
v2: https://lore.kernel.org/r/20260517211232.1670594-1-tj@kernel.org
v1 (RFC): https://lore.kernel.org/r/20260427105109.2554518-1-tj@kernel.org
Motivation
----------
sched_ext's ops_cid.set_cmask() hands the BPF scheduler a struct scx_cmask
*. The kernel translates a kernel cpumask to a cmask, but it had no way to
write into the arena, so the cmask lived in kernel memory and was passed as
a trusted pointer. BPF cmask helpers all operate on arena cmasks though, so
the BPF side had to word-by-word probe-read the kernel cmask into an arena
cmask via cmask_copy_from_kernel() before any helper could touch it. It
works, but is clumsy.
The shape isn't unique to set_cmask. Sub-scheduler support is on the way and
more sched_ext callbacks will want to pass structured data to BPF. Anywhere
a kfunc or struct_ops callback wants to hand a struct to a BPF program,
arena residence is the natural answer.
Approach
--------
Each arena gets a per-arena scratch page. Arenas stay sparsely mapped as
today - PTEs are populated only for allocated pages. A new arch fault hook
(bpf_arena_handle_page_fault) is wired into x86 page_fault_oops() and arm64
__do_kernel_fault(), after KFENCE. When a kernel-side access faults inside
an arena's kern_vm range, the helper walks the stack to find the BPF program
responsible, range-checks the fault address against prog->aux->arena, and
atomically installs the scratch page into the empty PTE via the new
ptep_try_set() wrapper. The kernel instruction retries and reads/writes the
scratch page. Free paths and map destruction treat scratch as non-owned.
Real allocation refuses to overwrite scratch (apply_range_set_cb returns
-EBUSY). A scratched address stays dead until map destroy, since its
presence means the BPF program has already malfunctioned.
The mechanism is default behavior - no UAPI flag.
What this preserves
-------------------
All the debugging properties of today's sparse-PTE design are preserved:
* BPF programs still fault on unmapped arena accesses. The fault semantics
(instruction retry with rdst = 0) and the violation report through
bpf_streams are unchanged for prog-side accesses.
* The first kernel-side touch of an unmapped address is reported via
bpf_streams the same way as a prog-side fault, with the stack walk
attributing it to the originating prog.
* User-side fault on a never-scratched address still lazy-allocates a real
page (or returns SIGSEGV under BPF_F_SEGV_ON_FAULT). User-side fault on a
scratched address SIGSEGVs.
What changes for the kernel-side caller is just that an unmapped deref no
longer oopses - it retries through the scratch page and emits a violation
report. The same shape today's BPF instruction faults have.
Patches 1-2 (atomic PTE install + arena scratch-page recovery)
--------------------------------------------------------------
mm: Add ptep_try_set() for lockless empty-slot installs
bpf: Recover arena kernel faults with scratch page
Patches 3-5 (helpers used by struct_ops registration)
-----------------------------------------------------
bpf: Add sleepable variant of bpf_arena_alloc_pages for kernel callers
bpf: Add bpf_struct_ops_for_each_prog()
bpf/arena: Add bpf_arena_map_kern_vm_start() and bpf_prog_arena()
Patches 6-8 (sched_ext: arena auto-discovery, allocator, set_cmask)
-------------------------------------------------------------------
sched_ext: Require an arena for cid-form schedulers
sched_ext: Sub-allocator over kernel-claimed BPF arena pages
sched_ext: Convert ops.set_cmask() to arena-resident cmask
Patch 6 reads each member prog's prog->aux->arena via bpf_prog_arena() and
requires the cid-form struct_ops to reference exactly one arena. Patch 7
builds a gen_pool sub-allocator inside that arena. Patch 8 converts
set_cmask() to write into arena memory. BPF dereferences via __arena like
any other arena struct, no probe-reads.
Base
----
sched_ext/for-7.2 (f31e89a8f583)
Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git arena-direct-v4
Documentation/bpf/kfuncs.rst | 14 +++
arch/arm64/include/asm/pgtable.h | 12 ++
arch/arm64/mm/fault.c | 10 +-
arch/x86/include/asm/pgtable.h | 12 ++
arch/x86/mm/fault.c | 12 +-
include/linux/bpf.h | 14 +++
include/linux/bpf_defs.h | 19 +++
include/linux/pgtable.h | 25 ++++
kernel/bpf/arena.c | 216 +++++++++++++++++++++++++++-------
kernel/bpf/bpf_struct_ops.c | 36 ++++++
kernel/bpf/core.c | 5 +
kernel/sched/build_policy.c | 4 +
kernel/sched/ext.c | 135 ++++++++++++++++++++-
kernel/sched/ext_arena.c | 126 ++++++++++++++++++++
kernel/sched/ext_arena.h | 18 +++
kernel/sched/ext_cid.c | 20 +---
kernel/sched/ext_internal.h | 23 +++-
tools/sched_ext/include/scx/cid.bpf.h | 52 --------
tools/sched_ext/scx_qmap.bpf.c | 5 +-
19 files changed, 630 insertions(+), 128 deletions(-)
Thanks.
--
tejun
More information about the linux-arm-kernel
mailing list