[PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics

Wed Dec 17 08:56:40 PST 2025

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 2da986136d26..654fb94bf60c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h

[ ... ]

> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
>
>  static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
>  {
> +#ifdef CONFIG_ARM64
> +	u8 rctx = interrupt_context_level();
> +	u8 *active = this_cpu_ptr(prog->active);
> +
> +	active[rctx]++;
> +	barrier();
> +	if (get_unaligned_le32(active) != BIT(rctx * 8))
> +		return false;
> +
> +	return true;
> +#else
>  	return this_cpu_inc_return(*(prog->active)) == 1;
> +#endif
>  }

Can preemption between the increment and check cause a counter leak on
CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
(documented at include/linux/rcupdate.h:856).

Consider this scenario on an ARM64 system with PREEMPT_RCU:

1. Thread A increments active[0] to 1
2. Preemption occurs before Thread A reaches the check
3. Thread B on same CPU increments active[0] to 2
4. Thread B checks: sees 2 != BIT(0), returns false
5. Thread A resumes, checks: sees 2 != BIT(0), returns false
6. Both threads return false, neither runs BPF
7. Neither calls bpf_prog_put_recursion_context() (see
   __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
8. Counter permanently stuck at 2, all future BPF on this CPU fails

The old atomic code handled this correctly because this_cpu_inc_return()
completes atomically, ensuring Thread A reads the value 1 before Thread B
can interfere. With non-atomic operations, Thread A increments but hasn't
read the value yet, allowing Thread B to interfere and cause both threads
to see the modified value.

Is there a guarantee that same-context preemption cannot occur in the BPF
execution path on ARM64, or does this need additional synchronization?

> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
>
>  static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
>  {
> +#ifdef CONFIG_ARM64
> +	u8 rctx = interrupt_context_level();
> +	u8 *active = this_cpu_ptr(prog->active);
> +
> +	barrier();
> +	active[rctx]--;
> +#else
>  	this_cpu_dec(*(prog->active));
> +#endif
>  }

The barrier() is placed before the decrement. Does this provide the needed
ordering guarantee, or should there also be a barrier after the decrement
to ensure visibility?

> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index c66316e32563..b5063acfcf92 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c

[ ... ]

> @@ -112,7 +112,8 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
>  		vfree(fp);
>  		return NULL;
>  	}
> -	fp->active = alloc_percpu_gfp(int, bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
> +	fp->active = __alloc_percpu_gfp(sizeof(u8[BPF_NR_CONTEXTS]), 8,
> +					bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));

The allocation changes to u8[4] for all architectures, but the non-ARM64
code path still uses *(prog->active) which now dereferences only the first
byte. This changes the counter from int (typically 4 bytes, range ~2^31)
to u8 (1 byte, range 256). While 256 recursion levels is sufficient in
practice, is this intentional for non-ARM64 platforms, or should they
continue using a full int-sized allocation?

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/20310266487