[PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics

Wed Dec 17 10:13:29 PST 2025

On Wed, Dec 17, 2025 at 4:56 PM <bot+bpf-ci at kernel.org> wrote:
>
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 2da986136d26..654fb94bf60c 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
>
> [ ... ]
>
> > @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
> >
> >  static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
> >  {
> > +#ifdef CONFIG_ARM64
> > +     u8 rctx = interrupt_context_level();
> > +     u8 *active = this_cpu_ptr(prog->active);
> > +
> > +     active[rctx]++;
> > +     barrier();
> > +     if (get_unaligned_le32(active) != BIT(rctx * 8))
> > +             return false;
> > +
> > +     return true;
> > +#else
> >       return this_cpu_inc_return(*(prog->active)) == 1;
> > +#endif
> >  }
>
> Can preemption between the increment and check cause a counter leak on
> CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
> rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
> (documented at include/linux/rcupdate.h:856).
>
> Consider this scenario on an ARM64 system with PREEMPT_RCU:
>
> 1. Thread A increments active[0] to 1
> 2. Preemption occurs before Thread A reaches the check
> 3. Thread B on same CPU increments active[0] to 2
> 4. Thread B checks: sees 2 != BIT(0), returns false
> 5. Thread A resumes, checks: sees 2 != BIT(0), returns false
> 6. Both threads return false, neither runs BPF
> 7. Neither calls bpf_prog_put_recursion_context() (see
>    __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
> 8. Counter permanently stuck at 2, all future BPF on this CPU fails

Step 7 is incorrect. Looking at the JIT-generated code, the exit
function is ALWAYS called, regardless of whether the enter function
returns 0 or a start time:

  // x86 JIT at arch/x86/net/bpf_jit_comp.c:2998-3050
  call bpf_trampoline_enter()    // Line 2998
  test rax, rax                   // Line 3006
  je skip_exec                    // Conditional jump
  ... BPF program execution ...   // Lines 3011-3023
  skip_exec:                      // Line 3037 (jump lands here)
  call bpf_trampoline_exit()      // Line 3049 - ALWAYS executed

  The bpf_trampoline_exit() call is after the skip_exec label, so it
executes in both cases.

What Actually Happens:

  Initial state: active[0] = 0

  Thread A (normal context, rctx=0):
  1. active[0]++ → active[0] = 1
  2. Preempted before barrier()

  Thread B (scheduled on same CPU, normal context, rctx=0):
  3. active[0]++ → active[0] = 2
  4. barrier()
  5. get_unaligned_le32(active) → reads 0x00000002
  6. Check: 0x00000002 != BIT(0) = 0x00000001 → returns false
  7. __bpf_prog_enter_recur returns 0
  8. JIT checks return value, skips BPF execution
  9. JIT ALWAYS calls __bpf_prog_exit_recur (see
arch/arm64/net/bpf_jit_comp.c:2362)
  10. bpf_prog_put_recursion_context(prog) executes
  11. barrier(), active[0]-- → active[0] = 1

  Thread A resumes:
  12. barrier()
  13. get_unaligned_le32(active) → reads 0x00000001 (Thread B already
decremented!)
  14. Check: 0x00000001 == BIT(0) = 0x00000001 → returns true ✓
  15. __bpf_prog_enter_recur returns start_time
  16. BPF program executes
  17. __bpf_prog_exit_recur called
  18. bpf_prog_put_recursion_context(prog) executes
  19. barrier(), active[0]-- → active[0] = 0 ✓

  Final State

  - Counter returns to 0 ✓
  - No leak ✓
  - Thread B detected interference and aborted ✓
  - Thread A executed successfully ✓
  - Only ONE thread executed the BPF program ✓

Now that I think of it, there is another race condition that leads to
NEITHER program running:

Consider this scenario on an arm64 system with PREEMPT_RCU:

1. Thread A increments active[0] from 0 to 1
2. Thread A is preempted before reaching barrier()
3. Thread B (same CPU, same context) increments active[0] from 1 to 2
4. Thread B executes barrier() and checks: sees 2 != BIT(0), returns false
5. Thread A resumes, executes barrier() and checks: sees 2 != BIT(0),
returns false
6. Both threads return false to __bpf_prog_enter_recur()
7. Both skip BPF program execution
8. Both call bpf_prog_put_recursion_context() and decrement: 2->1->0
9. Neither BPF program executes, but the counter correctly returns to 0

This means the patch is changing the behaviour in case of recursion
from "One program gets to run" to
"At most one program gets to run", but given the performance benefits,
I think we can accept this change.
I am not sure what would change in reality if we allow both BPF
programs to be skipped in case of high interleaving
like above.

> The old atomic code handled this correctly because this_cpu_inc_return()
> completes atomically, ensuring Thread A reads the value 1 before Thread B
> can interfere. With non-atomic operations, Thread A increments but hasn't
> read the value yet, allowing Thread B to interfere and cause both threads
> to see the modified value.
>
> Is there a guarantee that same-context preemption cannot occur in the BPF
> execution path on ARM64, or does this need additional synchronization?
>
> > @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
> >
> >  static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
> >  {
> > +#ifdef CONFIG_ARM64
> > +     u8 rctx = interrupt_context_level();
> > +     u8 *active = this_cpu_ptr(prog->active);
> > +
> > +     barrier();
> > +     active[rctx]--;
> > +#else
> >       this_cpu_dec(*(prog->active));
> > +#endif
> >  }
>
> The barrier() is placed before the decrement. Does this provide the needed
> ordering guarantee, or should there also be a barrier after the decrement
> to ensure visibility?
>
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index c66316e32563..b5063acfcf92 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
>
> [ ... ]
>
> > @@ -112,7 +112,8 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
> >               vfree(fp);
> >               return NULL;
> >       }
> > -     fp->active = alloc_percpu_gfp(int, bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
> > +     fp->active = __alloc_percpu_gfp(sizeof(u8[BPF_NR_CONTEXTS]), 8,
> > +                                     bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
>
> The allocation changes to u8[4] for all architectures, but the non-ARM64
> code path still uses *(prog->active) which now dereferences only the first
> byte. This changes the counter from int (typically 4 bytes, range ~2^31)
> to u8 (1 byte, range 256). While 256 recursion levels is sufficient in
> practice, is this intentional for non-ARM64 platforms, or should they
> continue using a full int-sized allocation?

This was intentional so that we don't add another #ifdef here, 256
recursion levels are more than enough.
But to keep the previous recursion levels, I could change the non
ARM64 code to do something like:

this_cpu_dec(*(int __percpu *)(prog->active));

I can do this in the next version.

Thanks,
Puranjay