[PATCHSET v2 sched_ext/for-7.2] bpf/arena: Direct kernel-side access
Tejun Heo
tj at kernel.org
Sun May 17 14:12:24 PDT 2026
Hello,
This makes BPF arena memory directly dereferenceable from kernel code
(struct_ops callbacks, kfuncs). Each arena gets a per-arena scratch page
that an arch fault hook installs into empty PTEs on kernel-side faults,
after KFENCE. The faulting instruction retries and the violation is reported
through the program's BPF stream.
v2 changes since RFC v1 (20260427105109.2554518-1-tj at kernel.org):
Switched from pre-filling the arena recovery range with a fallback page to
leaving PTEs sparse and installing the scratch page from an arch fault hook
on demand. Pre-fill erased the fault and lost error attribution; on-demand
keeps the fault visible and reports the violation through the program's
bpf_stream. Plus misc fixes.
Motivation
----------
sched_ext's ops_cid.set_cmask() hands the BPF scheduler a struct scx_cmask
*. The kernel translates a kernel cpumask to a cmask, but it had no way to
write into the arena, so the cmask lived in kernel memory and was passed as
a trusted pointer. BPF cmask helpers all operate on arena cmasks though, so
the BPF side had to word-by-word probe-read the kernel cmask into an arena
cmask via cmask_copy_from_kernel() before any helper could touch it. It
works, but is clumsy.
The shape isn't unique to set_cmask. Sub-scheduler support is on the way and
more sched_ext callbacks will want to pass structured data to BPF. Anywhere
a kfunc or struct_ops callback wants to hand a struct to a BPF program,
arena residence is the natural answer.
Approach
--------
Each arena gets a per-arena scratch page. Arenas stay sparsely mapped as
today - PTEs are populated only for allocated pages. A new arch fault hook
(bpf_arena_handle_page_fault) is wired into x86 page_fault_oops() and arm64
__do_kernel_fault(), after KFENCE. When a kernel-side access faults inside
an arena's kern_vm range, the helper walks the stack to find the BPF program
responsible, range- checks the fault address against prog->aux->arena, and
atomically installs the scratch page into the empty PTE via the new
ptep_try_install() wrapper. The kernel instruction retries and reads/writes
the scratch page. Free paths and map destruction treat scratch as non-owned.
Real allocation refuses to overwrite scratch (apply_range_set_cb returns
-EBUSY); a scratched address stays dead until map destroy, since its
presence means the BPF program has already malfunctioned.
The mechanism is default behavior - no UAPI flag.
What this preserves
-------------------
All the debugging properties of today's sparse-PTE design are
preserved:
* BPF programs still fault on unmapped arena accesses. The fault semantics
(instruction retry with rdst = 0) and the violation report through
bpf_streams are unchanged for prog-side accesses.
* The first kernel-side touch of an unmapped address is reported via
bpf_streams the same way as a prog-side fault, with the stack walk
attributing it to the originating prog.
* User-side fault on a never-scratched address still lazy-allocates a real
page (or returns SIGSEGV under BPF_F_SEGV_ON_FAULT). User- side fault on a
scratched address SIGSEGVs.
What changes for the kernel-side caller is just that an unmapped deref no
longer oopses - it retries through the scratch page and emits a violation
report. The same shape today's BPF instruction faults have.
Patches 1-2 (atomic PTE install + arena scratch-page recovery)
--------------------------------------------------------------
mm: Add ptep_try_install() for lockless empty-slot installs
bpf: Recover arena kernel faults with scratch page
Patches 3-5 (helpers used by struct_ops registration)
-----------------------------------------------------
bpf: Add sleepable variant of bpf_arena_alloc_pages for kernel callers
bpf: Add bpf_struct_ops_for_each_prog()
bpf/arena: Add bpf_arena_map_kern_vm_start() and bpf_prog_arena()
Patches 6-8 (sched_ext: arena auto-discovery, allocator, set_cmask)
-------------------------------------------------------------------
sched_ext: Require an arena for cid-form schedulers
sched_ext: Sub-allocator over kernel-claimed BPF arena pages
sched_ext: Convert ops.set_cmask() to arena-resident cmask
Patch 6 reads each member prog's prog->aux->arena via bpf_prog_arena() and
requires the cid-form struct_ops to reference exactly one arena. Patch 7
builds a gen_pool sub-allocator inside that arena. Patch 8 converts
set_cmask() to write into arena memory; BPF dereferences via __arena like
any other arena struct, no probe-reads.
Base
----
sched_ext/for-7.2 (c9017d335aab) with cmask-prep-v2.1 applied:
https://lore.kernel.org/r/20260517183614.1191534-1-tj@kernel.org
Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git arena-direct-v2
Documentation/bpf/kfuncs.rst | 14 +++
arch/arm64/include/asm/pgtable.h | 8 ++
arch/arm64/mm/fault.c | 10 +-
arch/x86/include/asm/pgtable.h | 8 ++
arch/x86/mm/fault.c | 12 +-
include/linux/bpf.h | 14 +++
include/linux/bpf_defs.h | 11 ++
include/linux/pgtable.h | 16 +++
kernel/bpf/arena.c | 209 ++++++++++++++++++++++++++++------
kernel/bpf/bpf_struct_ops.c | 36 ++++++
kernel/bpf/core.c | 5 +
kernel/sched/build_policy.c | 4 +
kernel/sched/ext.c | 135 +++++++++++++++++++++-
kernel/sched/ext_arena.c | 127 +++++++++++++++++++++
kernel/sched/ext_arena.h | 18 +++
kernel/sched/ext_cid.c | 20 +---
kernel/sched/ext_internal.h | 23 +++-
tools/sched_ext/include/scx/cid.bpf.h | 52 ---------
tools/sched_ext/scx_qmap.bpf.c | 5 +-
19 files changed, 604 insertions(+), 123 deletions(-)
Thanks.
--
tejun
More information about the linux-arm-kernel
mailing list