[PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

Wed Feb 19 09:08:50 PST 2025


On 2/19/2025 11:18 AM, Valentin Schneider wrote:
> On 19/02/25 10:05, Joel Fernandes wrote:
>> On Fri, Jan 17, 2025 at 05:53:33PM +0100, Valentin Schneider wrote:
>>> On 17/01/25 16:52, Jann Horn wrote:
>>>> On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid at redhat.com> wrote:
>>>>> On 14/01/25 19:16, Jann Horn wrote:
>>>>>> On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid at redhat.com> wrote:
>>>>>>> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>>>>>>> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>>>>>>> flush_tlb_kernel_range() IPIs.
>>>>>>>
>>>>>>> Given that CPUs executing in userspace do not access data in the vmalloc
>>>>>>> range, these IPIs could be deferred until their next kernel entry.
>>>>>>>
>>>>>>> Deferral vs early entry danger zone
>>>>>>> ===================================
>>>>>>>
>>>>>>> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>>>>>>> and then accessed in early entry code.
>>>>>>
>>>>>> In other words, it needs a guarantee that no vmalloc allocations that
>>>>>> have been created in the vmalloc region while the CPU was idle can
>>>>>> then be accessed during early entry, right?
>>>>>
>>>>> I'm not sure if that would be a problem (not an mm expert, please do
>>>>> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>>>>> deferred anyway.
>>>>
>>>> flush_cache_vmap() is about stuff like flushing data caches on
>>>> architectures with virtually indexed caches; that doesn't do TLB
>>>> maintenance. When you look for its definition on x86 or arm64, you'll
>>>> see that they use the generic implementation which is simply an empty
>>>> inline function.
>>>>
>>>>> So after vmapping something, I wouldn't expect isolated CPUs to have
>>>>> invalid TLB entries for the newly vmapped page.
>>>>>
>>>>> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>>>>> stale TLB entries can and will remain on isolated CPUs, up until they
>>>>> execute the deferred flush themselves (IOW for the entire duration of the
>>>>> "danger zone").
>>>>>
>>>>> Does that make sense?
>>>>
>>>> The design idea wrt TLB flushes in the vmap code is that you don't do
>>>> TLB flushes when you unmap stuff or when you map stuff, because doing
>>>> TLB flushes across the entire system on every vmap/vunmap would be a
>>>> bit costly; instead you just do batched TLB flushes in between, in
>>>> __purge_vmap_area_lazy().
>>>>
>>>> In other words, the basic idea is that you can keep calling vmap() and
>>>> vunmap() a bunch of times without ever doing TLB flushes until you run
>>>> out of virtual memory in the vmap region; then you do one big TLB
>>>> flush, and afterwards you can reuse the free virtual address space for
>>>> new allocations again.
>>>>
>>>> So if you "defer" that batched TLB flush for CPUs that are not
>>>> currently running in the kernel, I think the consequence is that those
>>>> CPUs may end up with incoherent TLB state after a reallocation of the
>>>> virtual address space.
>>>>
>>>
>>> Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc
>>> that occurred while an isolated CPU was NOHZ-FULL can be an issue if said
>>> CPU accesses it during early entry;
>>
>> So the issue is:
>>
>> CPU1: unmappes vmalloc page X which was previously mapped to physical page
>> P1.
>>
>> CPU2: does a whole bunch of vmalloc and vfree eventually crossing some lazy
>> threshold and sending out IPIs. It then goes ahead and does an allocation
>> that maps the same virtual page X to physical page P2.
>>
>> CPU3 is isolated and executes some early entry code before receving said IPIs
>> which are supposedly deferred by Valentin's patches.
>>
>> It does not receive the IPI becuase it is deferred, thus access by early
>> entry code to page X on this CPU results in a UAF access to P1.
>>
>> Is that the issue?
>>
> 
> Pretty much so yeah. That is, *if* there such a vmalloc'd address access in
> early entry code - testing says it's not the case, but I haven't found a
> way to instrumentally verify this.
Ok, thanks for confirming. Maybe there is an address sanitizer way of verifying,
but yeah it is subtle and there could be more than one way of solving it. Too
much 'fun' ;)

 - Joel