[PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics

Wed Dec 17 11:45:14 PST 2025

On 12/17/25 10:44 AM, Puranjay Mohan wrote:
> On Wed, Dec 17, 2025 at 6:24 PM Yonghong Song <yonghong.song at linux.dev> wrote:
>>
>>
>> On 12/17/25 8:56 AM, bot+bpf-ci at kernel.org wrote:
>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>> index 2da986136d26..654fb94bf60c 100644
>>>> --- a/include/linux/bpf.h
>>>> +++ b/include/linux/bpf.h
>>> [ ... ]
>>>
>>>> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
>>>>
>>>>    static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
>>>>    {
>>>> +#ifdef CONFIG_ARM64
>>>> +    u8 rctx = interrupt_context_level();
>>>> +    u8 *active = this_cpu_ptr(prog->active);
>>>> +
>>>> +    active[rctx]++;
>>>> +    barrier();
>>>> +    if (get_unaligned_le32(active) != BIT(rctx * 8))
>>>> +            return false;
>>>> +
>>>> +    return true;
>>>> +#else
>>>>       return this_cpu_inc_return(*(prog->active)) == 1;
>>>> +#endif
>>>>    }
>>> Can preemption between the increment and check cause a counter leak on
>>> CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
>>> rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
>>> (documented at include/linux/rcupdate.h:856).
>>>
>>> Consider this scenario on an ARM64 system with PREEMPT_RCU:
>>>
>>> 1. Thread A increments active[0] to 1
>>> 2. Preemption occurs before Thread A reaches the check
>>> 3. Thread B on same CPU increments active[0] to 2
>>> 4. Thread B checks: sees 2 != BIT(0), returns false
>>> 5. Thread A resumes, checks: sees 2 != BIT(0), returns false
>>> 6. Both threads return false, neither runs BPF
>>> 7. Neither calls bpf_prog_put_recursion_context() (see
>>>      __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
>>> 8. Counter permanently stuck at 2, all future BPF on this CPU fails
>>>
>>> The old atomic code handled this correctly because this_cpu_inc_return()
>>> completes atomically, ensuring Thread A reads the value 1 before Thread B
>>> can interfere. With non-atomic operations, Thread A increments but hasn't
>>> read the value yet, allowing Thread B to interfere and cause both threads
>>> to see the modified value.
>>>
>>> Is there a guarantee that same-context preemption cannot occur in the BPF
>>> execution path on ARM64, or does this need additional synchronization?
>> AI is correct here. See below for another thread:
>>     https://lore.kernel.org/bpf/20251217093326.1745307-1-chen.dylane@linux.dev/T/#m906fd4502fbbedd4609c586122a393363003312a
>> where preempt_disable is necessary to prevent the above scenario.
> See my other reply, the above scenario presented by AI is wrong
> because step 7 is wrong.

Thanks for explanation. I missed the case that bpf_trampoline_exit()
is always executed. In such cases, the active[rctx] counter will be
always corrected.

>
>> But adding preempt_disable may impact the overall system-level performance.
>>
>> Does this patch can improve performance for *all* ARM64 cpu versions?
>> Do you have numbers to show how much performance improvement?
> This should improve performance on all arm64 CPUs because atomics are
> expensive because they are atomic across all cpus.

Good to know. Thanks!

>
> I see a 33% improvement in the fentry trigger benchmark, but I can do
> more benchmarking.
>
>>>> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
>>>>
>>>>    static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
>>>>    {
>>>> +#ifdef CONFIG_ARM64
>>>> +    u8 rctx = interrupt_context_level();
>>>> +    u8 *active = this_cpu_ptr(prog->active);
>>>> +
>>>> +    barrier();
>>>> +    active[rctx]--;
>>>> +#else
>>>>       this_cpu_dec(*(prog->active));
>>>> +#endif
>>>>    }
>> [...]