[PATCH 2/8] bpf: Recover arena kernel faults with scratch page
Emil Tsalapatis
emil at etsalapatis.com
Wed May 20 20:16:57 PDT 2026
On Wed May 20, 2026 at 7:50 PM EDT, Tejun Heo wrote:
> From: Kumar Kartikeya Dwivedi <memxor at gmail.com>
>
> BPF arena usage is becoming more prevalent, but kernel <-> BPF communication
> over arena memory is awkward today. Data has to be staged through a trusted
> kernel pointer with extra code and copying on the BPF side. While reads
> through arena pointers can use a fault-safe helper, writes don't have a good
> solution. The in-line alternative would need instruction emulation or asm
> fixup labels.
>
> Enable direct kernel-side reads and writes within GUARD_SZ / 2 of any
> handed-in arena pointer, without bounds checking. A per-arena scratch page
> is installed by the arch fault path into empty arena kernel PTEs - x86 from
> page_fault_oops() for not-present faults, arm64 from __do_kernel_fault() for
> translation faults, both after the existing exception-table and KFENCE
> handling. The faulting instruction retries and the access is also reported
> through the program's BPF stream, preserving error reporting.
>
> bpf_prog_find_from_stack() resolves the current BPF program (and its arena)
> from the kernel stack - no new bpf_run_ctx state is added. Recovery covers
> the 4 GiB arena plus the upper half-guard (GUARD_SZ / 2). The lower
> half-guard is excluded because well-behaved kfuncs only access forward from
> arena pointers. The kfunc-author contract - access at most GUARD_SZ / 2 past
> a handed-in pointer - is documented in Documentation/bpf/kfuncs.rst.
>
> The install is lock-free via ptep_try_set(). On race-loss the winning
> installer's PTE is already valid, so the access retry succeeds. The arena
> clear path uses ptep_get_and_clear() so installer and clearer race through
> atomic accessors. No flush_tlb_kernel_range() afterwards. Stale "not mapped"
> entries just cause one extra re-fault, cheaper than a global IPI on every
> install.
>
> Scratch exists only to keep the kernel from oopsing on an in-line arena
> access. Its presence at a PTE means the BPF program has already
> malfunctioned, and the violation is reported through the program's BPF
> stream. The only requirement for behavior on a scratched PTE is that the
> kernel doesn't crash. In particular, any user-side access through such a PTE
> may segfault. The shared scratch page is freed once during map destruction.
>
> BPF instruction faults continue to use the existing JIT exception-table
> path. This patch changes only the kernel-text fault path. No UAPI flag is
> added. The new behavior is the default.
>
> v2: Use ptep_get_and_clear() in apply_range_clear_cb(). (David)
>
> Suggested-by: Alexei Starovoitov <ast at kernel.org>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor at gmail.com>
> Signed-off-by: Tejun Heo <tj at kernel.org>
> Cc: David Hildenbrand <david at kernel.org>
> ---
Reviewed-by: Emil Tsalapatis <emil at etsalapatis.com>
> Documentation/bpf/kfuncs.rst | 14 +++
> arch/arm64/mm/fault.c | 10 +-
> arch/x86/mm/fault.c | 12 ++-
> include/linux/bpf.h | 1 +
> include/linux/bpf_defs.h | 11 +++
> kernel/bpf/arena.c | 177 +++++++++++++++++++++++++++--------
> kernel/bpf/core.c | 5 +
> 7 files changed, 183 insertions(+), 47 deletions(-)
> create mode 100644 include/linux/bpf_defs.h
>
> diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
> index 75e6c078e0e7..6d497e720998 100644
> --- a/Documentation/bpf/kfuncs.rst
> +++ b/Documentation/bpf/kfuncs.rst
> @@ -462,6 +462,20 @@ In order to accommodate such requirements, the verifier will enforce strict
> PTR_TO_BTF_ID type matching if two types have the exact same name, with one
> being suffixed with ``___init``.
>
> +2.8 Accessing arena memory through kfunc arguments
> +--------------------------------------------------
> +
> +A read or write at any address inside an arena does not oops the kernel.
> +Unallocated arena pages are lazily backed by a scratch page and the
> +access is reported through the program's BPF stream as an error. Only
> +the BPF program's correctness is affected; the kernel itself remains
> +intact.
> +
> +The arena is followed by a ``GUARD_SZ / 2`` (32 KiB) guard region that
> +is also covered by this recovery. A kfunc handed an arena pointer may
> +therefore access up to ``GUARD_SZ / 2`` past it without bounds-checking
> +against the arena. Larger accesses must verify the range explicitly.
> +
> .. _BPF_kfunc_lifecycle_expectations:
>
> 3. kfunc lifecycle expectations
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 920a8b244d59..0d58d667fcd8 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -9,6 +9,7 @@
>
> #include <linux/acpi.h>
> #include <linux/bitfield.h>
> +#include <linux/bpf_defs.h>
> #include <linux/extable.h>
> #include <linux/kfence.h>
> #include <linux/signal.h>
> @@ -416,9 +417,12 @@ static void __do_kernel_fault(unsigned long addr, unsigned long esr,
> } else if (addr < PAGE_SIZE) {
> msg = "NULL pointer dereference";
> } else {
> - if (esr_fsc_is_translation_fault(esr) &&
> - kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs))
> - return;
> + if (esr_fsc_is_translation_fault(esr)) {
> + if (kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs))
> + return;
> + if (bpf_arena_handle_page_fault(addr, esr & ESR_ELx_WNR, regs->pc))
> + return;
> + }
>
> msg = "paging request";
> }
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f0e77e084482..b0f103ddbd23 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -8,6 +8,7 @@
> #include <linux/sched/task_stack.h> /* task_stack_*(), ... */
> #include <linux/kdebug.h> /* oops_begin/end, ... */
> #include <linux/memblock.h> /* max_low_pfn */
> +#include <linux/bpf_defs.h> /* bpf_arena_handle_page_fault */
> #include <linux/kfence.h> /* kfence_handle_page_fault */
> #include <linux/kprobes.h> /* NOKPROBE_SYMBOL, ... */
> #include <linux/mmiotrace.h> /* kmmio_handler, ... */
> @@ -688,10 +689,13 @@ page_fault_oops(struct pt_regs *regs, unsigned long error_code,
> if (IS_ENABLED(CONFIG_EFI))
> efi_crash_gracefully_on_page_fault(address);
>
> - /* Only not-present faults should be handled by KFENCE. */
> - if (!(error_code & X86_PF_PROT) &&
> - kfence_handle_page_fault(address, error_code & X86_PF_WRITE, regs))
> - return;
> + /* Only not-present faults should be handled by KFENCE or BPF arena. */
> + if (!(error_code & X86_PF_PROT)) {
> + if (kfence_handle_page_fault(address, error_code & X86_PF_WRITE, regs))
> + return;
> + if (bpf_arena_handle_page_fault(address, error_code & X86_PF_WRITE, regs->ip))
> + return;
> + }
>
> oops:
> /*
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 0136a108d083..831996c411cf 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -6,6 +6,7 @@
>
> #include <uapi/linux/bpf.h>
> #include <uapi/linux/filter.h>
> +#include <linux/bpf_defs.h>
>
> #include <crypto/sha2.h>
> #include <linux/workqueue.h>
> diff --git a/include/linux/bpf_defs.h b/include/linux/bpf_defs.h
> new file mode 100644
> index 000000000000..d98e033b8c0b
> --- /dev/null
> +++ b/include/linux/bpf_defs.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Subset of bpf.h declarations, split out so files that need only these
> + * declarations can avoid bpf.h's full include cost.
> + */
> +#ifndef _LINUX_BPF_DEFS_H
> +#define _LINUX_BPF_DEFS_H
> +
> +bool bpf_arena_handle_page_fault(unsigned long addr, bool is_write, unsigned long fault_ip);
> +
> +#endif /* _LINUX_BPF_DEFS_H */
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 08d008cc471e..1c0b87ecc817 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c
> @@ -53,6 +53,7 @@ struct bpf_arena {
> u64 user_vm_start;
> u64 user_vm_end;
> struct vm_struct *kern_vm;
> + struct page *scratch_page;
> struct range_tree rt;
> /* protects rt */
> rqspinlock_t spinlock;
> @@ -118,6 +119,11 @@ struct apply_range_data {
> int i;
> };
>
> +struct clear_range_data {
> + struct llist_head *free_pages;
> + struct page *scratch_page;
> +};
> +
> static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
> {
> struct apply_range_data *d = data;
> @@ -144,33 +150,59 @@ static void flush_vmap_cache(unsigned long start, unsigned long size)
> flush_cache_vmap(start, start + size);
> }
>
> -static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *free_pages)
> +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
> {
> + struct clear_range_data *d = data;
> pte_t old_pte;
> struct page *page;
>
> - /* sanity check */
> - old_pte = ptep_get(pte);
> + /*
> + * Pairs with ptep_try_set() in the kernel-fault scratch installer.
> + * Both sides must be atomic.
> + */
> + old_pte = ptep_get_and_clear(&init_mm, addr, pte);
> if (pte_none(old_pte) || !pte_present(old_pte))
> - return 0; /* nothing to do */
> + return 0;
>
> page = pte_page(old_pte);
> if (WARN_ON_ONCE(!page))
> return -EINVAL;
>
> - pte_clear(&init_mm, addr, pte);
> + /*
> + * Skip the per-arena scratch page. A kernel fault on an unallocated uaddr
> + * scratches its PTE. A later bpf_arena_free_pages() over that range walks
> + * here. Without the skip, scratch_page would be freed.
> + */
> + if (page == d->scratch_page)
> + return 0;
> +
> + __llist_add(&page->pcp_llist, d->free_pages);
> + return 0;
> +}
>
> - /* Add page to the list so it is freed later */
> - if (free_pages)
> - __llist_add(&page->pcp_llist, free_pages);
> +static int apply_range_set_scratch_cb(pte_t *pte, unsigned long addr, void *data)
> +{
> + struct page *scratch_page = data;
>
> + if (!pte_none(ptep_get(pte)))
> + return 0;
> + /*
> + * Best-effort install. ptep_try_set() returns false only if another
> + * installer (real allocation or concurrent fault) won the cmpxchg.
> + * Their PTE is already valid, so the access retry succeeds.
> + *
> + * No flush_tlb_kernel_range() needed. Stale "not mapped" entries just
> + * cause one extra re-fault through this same path.
> + */
> + ptep_try_set(pte, mk_pte(scratch_page, PAGE_KERNEL));
> return 0;
> }
>
> static int populate_pgtable_except_pte(struct bpf_arena *arena)
> {
> + /* Populate intermediates for the recovery range (4 GiB + upper half-guard). */
> return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
> - KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL);
> + SZ_4G + GUARD_SZ / 2, apply_range_set_cb, NULL);
> }
>
> static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
> @@ -221,22 +253,29 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
> init_irq_work(&arena->free_irq, arena_free_irq);
> INIT_WORK(&arena->free_work, arena_free_worker);
> bpf_map_init_from_attr(&arena->map, attr);
> +
> + err = bpf_map_alloc_pages(&arena->map, NUMA_NO_NODE, 1, &arena->scratch_page);
> + if (err)
> + goto err_free_arena;
> +
> range_tree_init(&arena->rt);
> err = range_tree_set(&arena->rt, 0, attr->max_entries);
> - if (err) {
> - bpf_map_area_free(arena);
> - goto err;
> - }
> + if (err)
> + goto err_free_scratch;
> mutex_init(&arena->lock);
> raw_res_spin_lock_init(&arena->spinlock);
> err = populate_pgtable_except_pte(arena);
> - if (err) {
> - range_tree_destroy(&arena->rt);
> - bpf_map_area_free(arena);
> - goto err;
> - }
> + if (err)
> + goto err_destroy_rt;
>
> return &arena->map;
> +
> +err_destroy_rt:
> + range_tree_destroy(&arena->rt);
> +err_free_scratch:
> + __free_page(arena->scratch_page);
> +err_free_arena:
> + bpf_map_area_free(arena);
> err:
> free_vm_area(kern_vm);
> return ERR_PTR(err);
> @@ -244,6 +283,7 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
>
> static int existing_page_cb(pte_t *ptep, unsigned long addr, void *data)
> {
> + struct bpf_arena *arena = data;
> struct page *page;
> pte_t pte;
>
> @@ -251,6 +291,12 @@ static int existing_page_cb(pte_t *ptep, unsigned long addr, void *data)
> if (!pte_present(pte)) /* sanity check */
> return 0;
> page = pte_page(pte);
> + /*
> + * Skip the scratch page. The walk is page-table-driven, not range-tree-driven,
> + * so it can visit scratch PTEs at uaddrs the BPF program never allocated.
> + */
> + if (page == arena->scratch_page)
> + return 0;
> /*
> * We do not update pte here:
> * 1. Nobody should be accessing bpf_arena's range outside of a kernel bug
> @@ -286,9 +332,10 @@ static void arena_map_free(struct bpf_map *map)
> * free those pages.
> */
> apply_to_existing_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
> - KERN_VM_SZ - GUARD_SZ, existing_page_cb, NULL);
> + SZ_4G + GUARD_SZ / 2, existing_page_cb, arena);
> free_vm_area(arena->kern_vm);
> range_tree_destroy(&arena->rt);
> + __free_page(arena->scratch_page);
> bpf_map_area_free(arena);
> }
>
> @@ -374,33 +421,37 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
> return VM_FAULT_RETRY;
>
> page = vmalloc_to_page((void *)kaddr);
> - if (page)
> + if (page) {
> + if (page == arena->scratch_page)
> + /* BPF triggered scratch here; don't lazy-alloc over it */
> + goto out_sigsegv;
> /* already have a page vmap-ed */
> goto out;
> + }
>
> bpf_map_memcg_enter(&arena->map, &old_memcg, &new_memcg);
>
> if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT)
> /* User space requested to segfault when page is not allocated by bpf prog */
> - goto out_unlock_sigsegv;
> + goto out_sigsegv_memcg;
>
> ret = range_tree_clear(&arena->rt, vmf->pgoff, 1);
> if (ret)
> - goto out_unlock_sigsegv;
> + goto out_sigsegv_memcg;
>
> struct apply_range_data data = { .pages = &page, .i = 0 };
> /* Account into memcg of the process that created bpf_arena */
> ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
> if (ret) {
> range_tree_set(&arena->rt, vmf->pgoff, 1);
> - goto out_unlock_sigsegv;
> + goto out_sigsegv_memcg;
> }
>
> ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data);
> if (ret) {
> range_tree_set(&arena->rt, vmf->pgoff, 1);
> free_pages_nolock(page, 0);
> - goto out_unlock_sigsegv;
> + goto out_sigsegv_memcg;
> }
> flush_vmap_cache(kaddr, PAGE_SIZE);
> bpf_map_memcg_exit(old_memcg, new_memcg);
> @@ -409,8 +460,9 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
> raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
> vmf->page = page;
> return 0;
> -out_unlock_sigsegv:
> +out_sigsegv_memcg:
> bpf_map_memcg_exit(old_memcg, new_memcg);
> +out_sigsegv:
> raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
> return VM_FAULT_SIGSEGV;
> }
> @@ -668,6 +720,7 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt,
> struct llist_head free_pages;
> struct llist_node *pos, *t;
> struct arena_free_span *s;
> + struct clear_range_data cdata;
> unsigned long flags;
> int ret = 0;
>
> @@ -696,9 +749,11 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt,
> range_tree_set(&arena->rt, pgoff, page_cnt);
>
> init_llist_head(&free_pages);
> + cdata.free_pages = &free_pages;
> + cdata.scratch_page = arena->scratch_page;
> /* clear ptes and collect struct pages */
> apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT,
> - apply_range_clear_cb, &free_pages);
> + apply_range_clear_cb, &cdata);
>
> /* drop the lock to do the tlb flush and zap pages */
> raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
> @@ -788,6 +843,7 @@ static void arena_free_worker(struct work_struct *work)
> struct arena_free_span *s;
> u64 arena_vm_start, user_vm_start;
> struct llist_head free_pages;
> + struct clear_range_data cdata;
> struct page *page;
> unsigned long full_uaddr;
> long kaddr, page_cnt, pgoff;
> @@ -801,6 +857,8 @@ static void arena_free_worker(struct work_struct *work)
> bpf_map_memcg_enter(&arena->map, &old_memcg, &new_memcg);
>
> init_llist_head(&free_pages);
> + cdata.free_pages = &free_pages;
> + cdata.scratch_page = arena->scratch_page;
> arena_vm_start = bpf_arena_get_kern_vm_start(arena);
> user_vm_start = bpf_arena_get_user_vm_start(arena);
>
> @@ -813,7 +871,7 @@ static void arena_free_worker(struct work_struct *work)
>
> /* clear ptes and collect pages in free_pages llist */
> apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT,
> - apply_range_clear_cb, &free_pages);
> + apply_range_clear_cb, &cdata);
>
> range_tree_set(&arena->rt, pgoff, page_cnt);
> }
> @@ -928,23 +986,12 @@ static int __init kfunc_init(void)
> }
> late_initcall(kfunc_init);
>
> -void bpf_prog_report_arena_violation(bool write, unsigned long addr, unsigned long fault_ip)
> +static void __bpf_prog_report_arena_violation(struct bpf_prog *prog, bool write,
> + unsigned long addr, unsigned long fault_ip)
> {
> struct bpf_stream_stage ss;
> - struct bpf_prog *prog;
> u64 user_vm_start;
>
> - /*
> - * The RCU read lock is held to safely traverse the latch tree, but we
> - * don't need its protection when accessing the prog, since it will not
> - * disappear while we are handling the fault.
> - */
> - rcu_read_lock();
> - prog = bpf_prog_ksym_find(fault_ip);
> - rcu_read_unlock();
> - if (!prog)
> - return;
> -
> /* Use main prog for stream access */
> prog = prog->aux->main_prog_aux->prog;
>
> @@ -957,3 +1004,53 @@ void bpf_prog_report_arena_violation(bool write, unsigned long addr, unsigned lo
> bpf_stream_dump_stack(ss);
> }));
> }
> +
> +bool bpf_arena_handle_page_fault(unsigned long addr, bool is_write, unsigned long fault_ip)
> +{
> + struct bpf_arena *arena;
> + struct bpf_prog *prog;
> + unsigned long kbase;
> + unsigned long page_addr = addr & PAGE_MASK;
> +
> + prog = bpf_prog_find_from_stack();
> + if (!prog)
> + return false;
> +
> + arena = prog->aux->arena;
> + /* a prog not using arena may be on stack, so arena can be NULL */
> + if (!arena)
> + return false;
> +
> + kbase = bpf_arena_get_kern_vm_start(arena);
> +
> + /*
> + * Recovery covers the 4 GiB mappable band plus the upper half-guard.
> + * Lower guard is unreachable from kfuncs; an address there indicates
> + * a different bug class - leave it to the regular kernel oops path.
> + */
> + if (page_addr < kbase || page_addr >= kbase + SZ_4G + GUARD_SZ / 2)
> + return false;
> +
> + apply_to_page_range(&init_mm, page_addr, PAGE_SIZE,
> + apply_range_set_scratch_cb, arena->scratch_page);
> + flush_vmap_cache(page_addr, PAGE_SIZE);
> + __bpf_prog_report_arena_violation(prog, is_write, page_addr - kbase, fault_ip);
> + return true;
> +}
> +
> +void bpf_prog_report_arena_violation(bool write, unsigned long addr, unsigned long fault_ip)
> +{
> + struct bpf_prog *prog;
> +
> + /*
> + * The RCU read lock is held to safely traverse the latch tree, but we
> + * don't need its protection when accessing the prog, since it will not
> + * disappear while we are handling the fault.
> + */
> + rcu_read_lock();
> + prog = bpf_prog_ksym_find(fault_ip);
> + rcu_read_unlock();
> + if (!prog)
> + return;
> + __bpf_prog_report_arena_violation(prog, write, addr, fault_ip);
> +}
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 066b86e7233c..fa368d8920d9 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -3290,6 +3290,11 @@ __weak u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena)
> {
> return 0;
> }
> +__weak bool bpf_arena_handle_page_fault(unsigned long addr, bool is_write,
> + unsigned long fault_ip)
> +{
> + return false;
> +}
>
> #ifdef CONFIG_BPF_SYSCALL
> static int __init bpf_global_ma_init(void)
More information about the linux-arm-kernel
mailing list