[PATCH 1/8] mm: Add ptep_try_install() for lockless empty-slot installs

Tue May 19 02:05:54 PDT 2026

On 5/19/26 10:58, Tejun Heo wrote:
> Hello, David.
> 
> On Tue, May 19, 2026 at 10:00:39AM +0200, David Hildenbrand (Arm) wrote:
>> Is that really possible? I'd much rather prefer to trylock and retry, unless
>> that can really result in deadlocks. But I have the feeling that such deadlocks
>> should be impossible here.
> 
> I'm not well versed in either mm or BPF, so the BPF folks will have a
> better take. But here's a scenario that seemed plausible to me:
> 
> 1. A bpf prog calls bpf_arena_alloc_pages() on its arena. The kernel
>    takes arena->spinlock via raw_res_spin_lock_irqsave().
> 2. Under the lock, the alloc path goes through bpf_map_alloc_pages()
>    -> alloc_pages_node(), which fires trace_mm_page_alloc().
> 3. A BPF tracepoint program on mm_page_alloc that shares the arena
>    starts running with the lock still held.
> 4. The tracepoint program calls a kfunc, passing an arena pointer
>    one entry past the array it meant to touch.
> 5. The kfunc dereferences. The kernel-side address is unbacked, so
>    the CPU faults.
> 
> trylock + retry at 5 would A-A deadlock.

Okay, so removing that specific tracepoint (or rather, any tracpoints under the
lock) would solve the problem, right?

> 
>> For example, staring at apply_range_set_cb(), what prevents:
>>
>> (1) apply_range_set_cb() finding pte_none(ptep_get(pte)
>> (2) apply_range_set_scratch_cb() succeeding ptep_try_install()
>> (3) apply_range_set_cb() overwriting the pte with set_pte_at()
>>
>> Between (2) and (3) CPUs could access the scratch PTE.
> 
> Scratch only gets installed when BPF passes an unallocated arena
> address to the kernel side, which is itself the violation, reported
> through the program's BPF stream. Behavior at that addr is then
> undefined. For scx, the scheduler should be aborted and torn down.
> 
> The only requirements are that the kernel doesn't oops and the
> violation gets caught. Beyond that, behavior at the address is
> unspecified, and which installer wins the race doesn't matter as
> long as kernel integrity holds.

You'll have inconsistent TLB state.

I really don't like that approach.

We should really try to just take the lock, and remove any code under the lock
that could trigger such unpleasant deadlocks.

Is that feasible?

-- 
Cheers,

David