[PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics
Puranjay Mohan
puranjay12 at gmail.com
Wed Dec 17 10:56:30 PST 2025
On Wed, Dec 17, 2025 at 6:46 PM Alexei Starovoitov
<alexei.starovoitov at gmail.com> wrote:
>
> On Wed, Dec 17, 2025 at 10:13 AM Puranjay Mohan <puranjay at kernel.org> wrote:
> >
> > On Wed, Dec 17, 2025 at 4:56 PM <bot+bpf-ci at kernel.org> wrote:
> > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 2da986136d26..654fb94bf60c 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > >
> > > [ ... ]
> > >
> > > > @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
> > > >
> > > > static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
> > > > {
> > > > +#ifdef CONFIG_ARM64
> > > > + u8 rctx = interrupt_context_level();
> > > > + u8 *active = this_cpu_ptr(prog->active);
> > > > +
> > > > + active[rctx]++;
> > > > + barrier();
> > > > + if (get_unaligned_le32(active) != BIT(rctx * 8))
> > > > + return false;
> > > > +
> > > > + return true;
> > > > +#else
> > > > return this_cpu_inc_return(*(prog->active)) == 1;
> > > > +#endif
> > > > }
> > >
> > > Can preemption between the increment and check cause a counter leak on
> > > CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
> > > rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
> > > (documented at include/linux/rcupdate.h:856).
> > >
> > > Consider this scenario on an ARM64 system with PREEMPT_RCU:
> > >
> > > 1. Thread A increments active[0] to 1
> > > 2. Preemption occurs before Thread A reaches the check
> > > 3. Thread B on same CPU increments active[0] to 2
> > > 4. Thread B checks: sees 2 != BIT(0), returns false
> > > 5. Thread A resumes, checks: sees 2 != BIT(0), returns false
> > > 6. Both threads return false, neither runs BPF
> > > 7. Neither calls bpf_prog_put_recursion_context() (see
> > > __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
> > > 8. Counter permanently stuck at 2, all future BPF on this CPU fails
> >
> > Step 7 is incorrect. Looking at the JIT-generated code, the exit
> > function is ALWAYS called, regardless of whether the enter function
> > returns 0 or a start time:
> >
> > // x86 JIT at arch/x86/net/bpf_jit_comp.c:2998-3050
> > call bpf_trampoline_enter() // Line 2998
> > test rax, rax // Line 3006
> > je skip_exec // Conditional jump
> > ... BPF program execution ... // Lines 3011-3023
> > skip_exec: // Line 3037 (jump lands here)
> > call bpf_trampoline_exit() // Line 3049 - ALWAYS executed
> >
> > The bpf_trampoline_exit() call is after the skip_exec label, so it
> > executes in both cases.
> >
> > What Actually Happens:
> >
> > Initial state: active[0] = 0
> >
> > Thread A (normal context, rctx=0):
> > 1. active[0]++ → active[0] = 1
> > 2. Preempted before barrier()
> >
> > Thread B (scheduled on same CPU, normal context, rctx=0):
> > 3. active[0]++ → active[0] = 2
> > 4. barrier()
> > 5. get_unaligned_le32(active) → reads 0x00000002
> > 6. Check: 0x00000002 != BIT(0) = 0x00000001 → returns false
> > 7. __bpf_prog_enter_recur returns 0
> > 8. JIT checks return value, skips BPF execution
> > 9. JIT ALWAYS calls __bpf_prog_exit_recur (see
> > arch/arm64/net/bpf_jit_comp.c:2362)
> > 10. bpf_prog_put_recursion_context(prog) executes
> > 11. barrier(), active[0]-- → active[0] = 1
> >
> > Thread A resumes:
> > 12. barrier()
> > 13. get_unaligned_le32(active) → reads 0x00000001 (Thread B already
> > decremented!)
> > 14. Check: 0x00000001 == BIT(0) = 0x00000001 → returns true ✓
> > 15. __bpf_prog_enter_recur returns start_time
> > 16. BPF program executes
> > 17. __bpf_prog_exit_recur called
> > 18. bpf_prog_put_recursion_context(prog) executes
> > 19. barrier(), active[0]-- → active[0] = 0 ✓
> >
> > Final State
> >
> > - Counter returns to 0 ✓
> > - No leak ✓
> > - Thread B detected interference and aborted ✓
> > - Thread A executed successfully ✓
> > - Only ONE thread executed the BPF program ✓
> >
> >
> > Now that I think of it, there is another race condition that leads to
> > NEITHER program running:
> >
> > Consider this scenario on an arm64 system with PREEMPT_RCU:
> >
> > 1. Thread A increments active[0] from 0 to 1
> > 2. Thread A is preempted before reaching barrier()
> > 3. Thread B (same CPU, same context) increments active[0] from 1 to 2
> > 4. Thread B executes barrier() and checks: sees 2 != BIT(0), returns false
> > 5. Thread A resumes, executes barrier() and checks: sees 2 != BIT(0),
> > returns false
> > 6. Both threads return false to __bpf_prog_enter_recur()
> > 7. Both skip BPF program execution
> > 8. Both call bpf_prog_put_recursion_context() and decrement: 2->1->0
> > 9. Neither BPF program executes, but the counter correctly returns to 0
> >
> > This means the patch is changing the behaviour in case of recursion
> > from "One program gets to run" to
> > "At most one program gets to run", but given the performance benefits,
> > I think we can accept this change.
>
> Agree. It's fine, but we can mitigate it, but doing this rctx trick
> only when RCU is not preemptable. Which pretty much would mean
> that PREEMPT_RT will use atomic and !RT will use rctx
> and this 'no prog executes' will not happen.
The issue is also with sleepable programs, they use
rcu_read_lock_trace() and can end up with
'no prog executes' scenario.
What do you think is the best approach for them?
More information about the linux-arm-kernel
mailing list