[PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics

Wed Dec 17 14:29:23 PST 2025

On Wed, Dec 17, 2025 at 9:32 PM Yonghong Song <yonghong.song at linux.dev> wrote:
>
>
>
> On 12/17/25 10:56 AM, Puranjay Mohan wrote:
> > On Wed, Dec 17, 2025 at 6:46 PM Alexei Starovoitov
> > <alexei.starovoitov at gmail.com> wrote:
> >> On Wed, Dec 17, 2025 at 10:13 AM Puranjay Mohan <puranjay at kernel.org> wrote:
> >>> On Wed, Dec 17, 2025 at 4:56 PM <bot+bpf-ci at kernel.org> wrote:
> >>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >>>>> index 2da986136d26..654fb94bf60c 100644
> >>>>> --- a/include/linux/bpf.h
> >>>>> +++ b/include/linux/bpf.h
> >>>> [ ... ]
> >>>>
> >>>>> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
> >>>>>
> >>>>>   static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
> >>>>>   {
> >>>>> +#ifdef CONFIG_ARM64
> >>>>> +     u8 rctx = interrupt_context_level();
> >>>>> +     u8 *active = this_cpu_ptr(prog->active);
> >>>>> +
> >>>>> +     active[rctx]++;
> >>>>> +     barrier();
> >>>>> +     if (get_unaligned_le32(active) != BIT(rctx * 8))
> >>>>> +             return false;
> >>>>> +
> >>>>> +     return true;
> >>>>> +#else
> >>>>>        return this_cpu_inc_return(*(prog->active)) == 1;
> >>>>> +#endif
> >>>>>   }
> >>>> Can preemption between the increment and check cause a counter leak on
> >>>> CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
> >>>> rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
> >>>> (documented at include/linux/rcupdate.h:856).
> >>>>
> >>>> Consider this scenario on an ARM64 system with PREEMPT_RCU:
> >>>>
> >>>> 1. Thread A increments active[0] to 1
> >>>> 2. Preemption occurs before Thread A reaches the check
> >>>> 3. Thread B on same CPU increments active[0] to 2
> >>>> 4. Thread B checks: sees 2 != BIT(0), returns false
> >>>> 5. Thread A resumes, checks: sees 2 != BIT(0), returns false
> >>>> 6. Both threads return false, neither runs BPF
> >>>> 7. Neither calls bpf_prog_put_recursion_context() (see
> >>>>     __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
> >>>> 8. Counter permanently stuck at 2, all future BPF on this CPU fails
> >>> Step 7 is incorrect. Looking at the JIT-generated code, the exit
> >>> function is ALWAYS called, regardless of whether the enter function
> >>> returns 0 or a start time:
> >>>
> >>>    // x86 JIT at arch/x86/net/bpf_jit_comp.c:2998-3050
> >>>    call bpf_trampoline_enter()    // Line 2998
> >>>    test rax, rax                   // Line 3006
> >>>    je skip_exec                    // Conditional jump
> >>>    ... BPF program execution ...   // Lines 3011-3023
> >>>    skip_exec:                      // Line 3037 (jump lands here)
> >>>    call bpf_trampoline_exit()      // Line 3049 - ALWAYS executed
> >>>
> >>>    The bpf_trampoline_exit() call is after the skip_exec label, so it
> >>> executes in both cases.
> >>>
> >>> What Actually Happens:
> >>>
> >>>    Initial state: active[0] = 0
> >>>
> >>>    Thread A (normal context, rctx=0):
> >>>    1. active[0]++ → active[0] = 1
> >>>    2. Preempted before barrier()
> >>>
> >>>    Thread B (scheduled on same CPU, normal context, rctx=0):
> >>>    3. active[0]++ → active[0] = 2
> >>>    4. barrier()
> >>>    5. get_unaligned_le32(active) → reads 0x00000002
> >>>    6. Check: 0x00000002 != BIT(0) = 0x00000001 → returns false
> >>>    7. __bpf_prog_enter_recur returns 0
> >>>    8. JIT checks return value, skips BPF execution
> >>>    9. JIT ALWAYS calls __bpf_prog_exit_recur (see
> >>> arch/arm64/net/bpf_jit_comp.c:2362)
> >>>    10. bpf_prog_put_recursion_context(prog) executes
> >>>    11. barrier(), active[0]-- → active[0] = 1
> >>>
> >>>    Thread A resumes:
> >>>    12. barrier()
> >>>    13. get_unaligned_le32(active) → reads 0x00000001 (Thread B already
> >>> decremented!)
> >>>    14. Check: 0x00000001 == BIT(0) = 0x00000001 → returns true ✓
> >>>    15. __bpf_prog_enter_recur returns start_time
> >>>    16. BPF program executes
> >>>    17. __bpf_prog_exit_recur called
> >>>    18. bpf_prog_put_recursion_context(prog) executes
> >>>    19. barrier(), active[0]-- → active[0] = 0 ✓
> >>>
> >>>    Final State
> >>>
> >>>    - Counter returns to 0 ✓
> >>>    - No leak ✓
> >>>    - Thread B detected interference and aborted ✓
> >>>    - Thread A executed successfully ✓
> >>>    - Only ONE thread executed the BPF program ✓
> >>>
> >>>
> >>> Now that I think of it, there is another race condition that leads to
> >>> NEITHER program running:
> >>>
> >>> Consider this scenario on an arm64 system with PREEMPT_RCU:
> >>>
> >>> 1. Thread A increments active[0] from 0 to 1
> >>> 2. Thread A is preempted before reaching barrier()
> >>> 3. Thread B (same CPU, same context) increments active[0] from 1 to 2
> >>> 4. Thread B executes barrier() and checks: sees 2 != BIT(0), returns false
> >>> 5. Thread A resumes, executes barrier() and checks: sees 2 != BIT(0),
> >>> returns false
> >>> 6. Both threads return false to __bpf_prog_enter_recur()
> >>> 7. Both skip BPF program execution
> >>> 8. Both call bpf_prog_put_recursion_context() and decrement: 2->1->0
> >>> 9. Neither BPF program executes, but the counter correctly returns to 0
> >>>
> >>> This means the patch is changing the behaviour in case of recursion
> >>> from "One program gets to run" to
> >>> "At most one program gets to run", but given the performance benefits,
> >>> I think we can accept this change.
> >> Agree. It's fine, but we can mitigate it, but doing this rctx trick
> >> only when RCU is not preemptable. Which pretty much would mean
> >> that PREEMPT_RT will use atomic and !RT will use rctx
> >> and this 'no prog executes' will not happen.
> >
> > The issue is also with sleepable programs, they use
> > rcu_read_lock_trace() and can end up with
> > 'no prog executes' scenario.
> >
> > What do you think is the best approach for them?
>
> For sleepable programs, maybe we can use the original approach like
>    return this_cpu_inc_return(*(prog->active)) == 1;
> ?
> This should solve the 'no prog execution' issue.

I tried putting preempt_disable/enable() around inc+read in entry and
dec in exit:

New data, a little different config, so ignore exact values.

This patch:                           56.524M/s
This patch with preempt_disable():    53.856M/s
bpf-next/master:                      43.067M/s
bpf-next/master without Catalin fix:  51.862M/s

This is still very good and covers the sleepable case as well. So let's do this?

Thanks,
Puranjay