[PATCH 1/8] mm: Add ptep_try_install() for lockless empty-slot installs

Tue May 19 02:40:48 PDT 2026

On 5/19/26 11:05, David Hildenbrand (Arm) wrote:
> On 5/19/26 10:58, Tejun Heo wrote:
>> Hello, David.
>>
>> On Tue, May 19, 2026 at 10:00:39AM +0200, David Hildenbrand (Arm) wrote:
>>> Is that really possible? I'd much rather prefer to trylock and retry, unless
>>> that can really result in deadlocks. But I have the feeling that such deadlocks
>>> should be impossible here.
>>
>> I'm not well versed in either mm or BPF, so the BPF folks will have a
>> better take. But here's a scenario that seemed plausible to me:
>>
>> 1. A bpf prog calls bpf_arena_alloc_pages() on its arena. The kernel
>>    takes arena->spinlock via raw_res_spin_lock_irqsave().
>> 2. Under the lock, the alloc path goes through bpf_map_alloc_pages()
>>    -> alloc_pages_node(), which fires trace_mm_page_alloc().
>> 3. A BPF tracepoint program on mm_page_alloc that shares the arena
>>    starts running with the lock still held.
>> 4. The tracepoint program calls a kfunc, passing an arena pointer
>>    one entry past the array it meant to touch.
>> 5. The kfunc dereferences. The kernel-side address is unbacked, so
>>    the CPU faults.
>>
>> trylock + retry at 5 would A-A deadlock.
> 
> Okay, so removing that specific tracepoint (or rather, any tracpoints under the
> lock) would solve the problem, right?
> 
>>
>>> For example, staring at apply_range_set_cb(), what prevents:
>>>
>>> (1) apply_range_set_cb() finding pte_none(ptep_get(pte)
>>> (2) apply_range_set_scratch_cb() succeeding ptep_try_install()
>>> (3) apply_range_set_cb() overwriting the pte with set_pte_at()
>>>
>>> Between (2) and (3) CPUs could access the scratch PTE.
>>
>> Scratch only gets installed when BPF passes an unallocated arena
>> address to the kernel side, which is itself the violation, reported
>> through the program's BPF stream. Behavior at that addr is then
>> undefined. For scx, the scheduler should be aborted and torn down.
>>
>> The only requirements are that the kernel doesn't oops and the
>> violation gets caught. Beyond that, behavior at the address is
>> unspecified, and which installer wins the race doesn't matter as
>> long as kernel integrity holds.
> 
> You'll have inconsistent TLB state.
> 
> I really don't like that approach.
> 
> We should really try to just take the lock, and remove any code under the lock
> that could trigger such unpleasant deadlocks.
> 
> Is that feasible?
> 

... or can we run into similar problems with kprobes? (I am obviously no bpf
expert ...)

-- 
Cheers,

David