[PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics

Wed Dec 17 10:46:21 PST 2025

On Wed, Dec 17, 2025 at 10:13 AM Puranjay Mohan <puranjay at kernel.org> wrote:
>
> On Wed, Dec 17, 2025 at 4:56 PM <bot+bpf-ci at kernel.org> wrote:
> >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 2da986136d26..654fb94bf60c 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> >
> > [ ... ]
> >
> > > @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
> > >
> > >  static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
> > >  {
> > > +#ifdef CONFIG_ARM64
> > > +     u8 rctx = interrupt_context_level();
> > > +     u8 *active = this_cpu_ptr(prog->active);
> > > +
> > > +     active[rctx]++;
> > > +     barrier();
> > > +     if (get_unaligned_le32(active) != BIT(rctx * 8))
> > > +             return false;
> > > +
> > > +     return true;
> > > +#else
> > >       return this_cpu_inc_return(*(prog->active)) == 1;
> > > +#endif
> > >  }
> >
> > Can preemption between the increment and check cause a counter leak on
> > CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
> > rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
> > (documented at include/linux/rcupdate.h:856).
> >
> > Consider this scenario on an ARM64 system with PREEMPT_RCU:
> >
> > 1. Thread A increments active[0] to 1
> > 2. Preemption occurs before Thread A reaches the check
> > 3. Thread B on same CPU increments active[0] to 2
> > 4. Thread B checks: sees 2 != BIT(0), returns false
> > 5. Thread A resumes, checks: sees 2 != BIT(0), returns false
> > 6. Both threads return false, neither runs BPF
> > 7. Neither calls bpf_prog_put_recursion_context() (see
> >    __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
> > 8. Counter permanently stuck at 2, all future BPF on this CPU fails
>
> Step 7 is incorrect. Looking at the JIT-generated code, the exit
> function is ALWAYS called, regardless of whether the enter function
> returns 0 or a start time:
>
>   // x86 JIT at arch/x86/net/bpf_jit_comp.c:2998-3050
>   call bpf_trampoline_enter()    // Line 2998
>   test rax, rax                   // Line 3006
>   je skip_exec                    // Conditional jump
>   ... BPF program execution ...   // Lines 3011-3023
>   skip_exec:                      // Line 3037 (jump lands here)
>   call bpf_trampoline_exit()      // Line 3049 - ALWAYS executed
>
>   The bpf_trampoline_exit() call is after the skip_exec label, so it
> executes in both cases.
>
> What Actually Happens:
>
>   Initial state: active[0] = 0
>
>   Thread A (normal context, rctx=0):
>   1. active[0]++ → active[0] = 1
>   2. Preempted before barrier()
>
>   Thread B (scheduled on same CPU, normal context, rctx=0):
>   3. active[0]++ → active[0] = 2
>   4. barrier()
>   5. get_unaligned_le32(active) → reads 0x00000002
>   6. Check: 0x00000002 != BIT(0) = 0x00000001 → returns false
>   7. __bpf_prog_enter_recur returns 0
>   8. JIT checks return value, skips BPF execution
>   9. JIT ALWAYS calls __bpf_prog_exit_recur (see
> arch/arm64/net/bpf_jit_comp.c:2362)
>   10. bpf_prog_put_recursion_context(prog) executes
>   11. barrier(), active[0]-- → active[0] = 1
>
>   Thread A resumes:
>   12. barrier()
>   13. get_unaligned_le32(active) → reads 0x00000001 (Thread B already
> decremented!)
>   14. Check: 0x00000001 == BIT(0) = 0x00000001 → returns true ✓
>   15. __bpf_prog_enter_recur returns start_time
>   16. BPF program executes
>   17. __bpf_prog_exit_recur called
>   18. bpf_prog_put_recursion_context(prog) executes
>   19. barrier(), active[0]-- → active[0] = 0 ✓
>
>   Final State
>
>   - Counter returns to 0 ✓
>   - No leak ✓
>   - Thread B detected interference and aborted ✓
>   - Thread A executed successfully ✓
>   - Only ONE thread executed the BPF program ✓
>
>
> Now that I think of it, there is another race condition that leads to
> NEITHER program running:
>
> Consider this scenario on an arm64 system with PREEMPT_RCU:
>
> 1. Thread A increments active[0] from 0 to 1
> 2. Thread A is preempted before reaching barrier()
> 3. Thread B (same CPU, same context) increments active[0] from 1 to 2
> 4. Thread B executes barrier() and checks: sees 2 != BIT(0), returns false
> 5. Thread A resumes, executes barrier() and checks: sees 2 != BIT(0),
> returns false
> 6. Both threads return false to __bpf_prog_enter_recur()
> 7. Both skip BPF program execution
> 8. Both call bpf_prog_put_recursion_context() and decrement: 2->1->0
> 9. Neither BPF program executes, but the counter correctly returns to 0
>
> This means the patch is changing the behaviour in case of recursion
> from "One program gets to run" to
> "At most one program gets to run", but given the performance benefits,
> I think we can accept this change.

Agree. It's fine, but we can mitigate it, but doing this rctx trick
only when RCU is not preemptable. Which pretty much would mean
that PREEMPT_RT will use atomic and !RT will use rctx
and this 'no prog executes' will not happen.

> I am not sure what would change in reality if we allow both BPF
> programs to be skipped in case of high interleaving
> like above.
>
> > The old atomic code handled this correctly because this_cpu_inc_return()
> > completes atomically, ensuring Thread A reads the value 1 before Thread B
> > can interfere. With non-atomic operations, Thread A increments but hasn't
> > read the value yet, allowing Thread B to interfere and cause both threads
> > to see the modified value.
> >
> > Is there a guarantee that same-context preemption cannot occur in the BPF
> > execution path on ARM64, or does this need additional synchronization?
> >
> > > @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
> > >
> > >  static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
> > >  {
> > > +#ifdef CONFIG_ARM64
> > > +     u8 rctx = interrupt_context_level();
> > > +     u8 *active = this_cpu_ptr(prog->active);
> > > +
> > > +     barrier();
> > > +     active[rctx]--;
> > > +#else
> > >       this_cpu_dec(*(prog->active));
> > > +#endif
> > >  }
> >
> > The barrier() is placed before the decrement. Does this provide the needed
> > ordering guarantee, or should there also be a barrier after the decrement
> > to ensure visibility?
> >
> > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > > index c66316e32563..b5063acfcf92 100644
> > > --- a/kernel/bpf/core.c
> > > +++ b/kernel/bpf/core.c
> >
> > [ ... ]
> >
> > > @@ -112,7 +112,8 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
> > >               vfree(fp);
> > >               return NULL;
> > >       }
> > > -     fp->active = alloc_percpu_gfp(int, bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
> > > +     fp->active = __alloc_percpu_gfp(sizeof(u8[BPF_NR_CONTEXTS]), 8,
> > > +                                     bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
> >
> > The allocation changes to u8[4] for all architectures, but the non-ARM64
> > code path still uses *(prog->active) which now dereferences only the first
> > byte. This changes the counter from int (typically 4 bytes, range ~2^31)
> > to u8 (1 byte, range 256). While 256 recursion levels is sufficient in
> > practice, is this intentional for non-ARM64 platforms, or should they
> > continue using a full int-sized allocation?
>
> This was intentional so that we don't add another #ifdef here, 256
> recursion levels are more than enough.

256 will be fine for !RT.
So suggestion above will address this concern as well when 256+ tasks
are racing concurrently on this cpu.