[RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly

Thomas Gleixner tglx at linutronix.de
Sun May 21 16:10:54 PDT 2023


On Sat, May 20 2023 at 07:46, Baoquan He wrote:
> On 05/19/23 at 08:38pm, Thomas Gleixner wrote:
>> That block would have hundreds of pages left and is the most recently
>> allocated. So the next vb_alloc() has to reallocate one instead of using
>> the one which was allocated just before.
>> 
>> This clearly lacks a free space check so that blocks which have more
>> free space than a certain threshold are not thrown away prematurely.
>> Maybe it wants an age check too, so that blocks which are unused for a
>> long time can be recycled, but that's an orthogonal issue.
>
> You are right, the vmap_block alloc/free does have the issue you pointed
> out here. What I can defend is that it should be fine if
> VM_FLUSH_RESET_PERMS memory doesn't upset the situation. As we see, the
> lazy flush will only be triggered when lazy_max_pages() is met, or
> alloc_vmap_area() can't find an available range. If these two happens,
> means we really need to flush and reclaim the unmapped area into free
> list/tree since the vmalloc address space has run out. Even though the
> vmap_block has mach free space left, still need be purged to cope with
> an emergency.
>
> So, if we pick VM_FLUSH_RESET_PERMS memory out and flush it alone, and

That might be counterproductive depending on the usage scenario
especially as that BPF filtering is gaining traction.

> set a threshold for vmap_block purging, is it better?

The point is that there is no differentiation between:

  1) Purging freed areas, i.e. when the vmap_layz_nr hits the threshold,
     from drain_vmap_area_work()

  2) Reclaiming vmalloc address space from pcpu_get_vm_areas()

  3) _unmap_aliases()

#1 You really don't want to reclaim vmap blocks from the per cpu free
   lists, unless they have only a few free pages left and no active
   mappings.

   There is no reason to zap vmap blocks unconditionally, simply because
   reclaiming all lazy areas drains at least

      32MB * fls(num_online_cpus())

   per invocation which is plenty, no?

#2 You might want to try #1 first and if that does not help then reclaim
   all vmap blocks on the per cpu lists which have no active mappings
   unconditionally.

#3 _unmap_aliases() requires to touch everything because the caller has
   no clue which vmap_area used a particular page and the vmap_area lost
   that information too.

   Except for the vfree + VM_FLUSH_RESET_PERMS case, which removes the
   vmap area first and then cares about the flush. That in turn requires
   a full walk of _all_ vmap areas including the one which was just
   added to the purge list.

   I can see the point that this is used to combine outstanding TLB
   flushes and do the housekeeping of purging freed areas, but like #1
   there is no real good reason to zap usable vmap blocks
   unconditionally.

I'm all for consolidating functionality, but that swiss army knife
approach of one fits it all does not make sense to me.

>> The other issue I pointed out:
>> 
>> Assume the block has (for simplicity) 255 allocations size of 4 pages,
>> again free space of 4 pages.
>> 
>> 254 allocations are freed, which means there is one remaining
>> mapped. All 254 freed are flushed via __purge_vmap_area_lazy() over
>> time.
>> 
>> Now the last allocation is freed and the block is moved to the
>> purge_vmap_area_list, which then does a full flush of the complete area,
>> i.e. 4MB in that case, while in fact it only needs to flush 2 pages.
>
> It's easy to fix. For vmap_block, I have marked it in va->flag with
> VMAP_RAM|VMAP_BLOCK. When flushing va in purge list, we can skip
> vmap_block va.

How can you skip that? The last 2 pages in that VA still need to be
flushed, no?

But VA has no information about already done partial flushes via the
vmap_block, so this requires flushing the full VA range unconditionally.

That made me think about the following scenario:

vm_map_ram()
  vb_alloc()
    // Uses page P
    vb->free -= 1UL << order;
    if (vb->free == 0)
      list_del_rcu(&vb->free_list);

vm_unmap_ram()
  vb_free()
    Does not free the last mapping of the vmap_block
    Clears the page table entry for page P, but does not flush TLB

__free_page(P)

page P gets reused

vm_unmap_aliases()

  Does not find the VA which used page P because neither the VB is in
  free_list nor the VA is in the purge_list

  Unless _vm_unmap_aliases() finds a large enough range to cover
  page P or ends up with a flush_tlb_all() the stale TLB(s) persists.

No?

Same problem in this scenario:

addr = vm_map_ram()
  vb_alloc()
    vb->free -= 1UL << order;
    if (vb->free == 0)
      list_del_rcu(&vb->free_list);

set_memory_*(addr)
  vm_unmap_aliases()

    Does not find the VA which contains @addr because neither the VB is in
    free_list nor the VA is in the purge_list

    Unless _vm_unmap_aliases() finds a large enough range to cover
    @addr or ends up with a flush_tlb_all() the stale TLB(s) persists.

No?

> I don't know how you will tackle the per va flush Nadav pointed out,
> so I will not give a dtaft code.

It's not that hard.

An array of ranges which can be memcpy()'ed into that existing x86
percpu flush info thing, which avoids the cache line issue Nadav is
concerned of.

That array would have obviously a limited number of entries as anything
with multiple ranges is most likely going to end up in a full flush
anyway. The drain_vmap_area_work() case definitely does so as
vmap_lazy_nr is guaranteed to be at least covering 32MB/PAGE_SIZE worth
of pages.

I just got distracted and did not come around to finish it for various
reasons. One of them is to make sense of this code :)

>> Also these intermediate flushes are inconsistent versus how fully
>> utilized blocks are handled:
>> 
>> vb_alloc()
>>       if (vb->free == 0)
>>           list_del(vb->free_list);
>> 
>> So all allocations which are freed after that point stay unflushed until
>> the last allocation is freed which moves the block to the
>> purge_vmap_area_list, where it gets a full VA range flush.
>
> That may be risky if stay unflushed until the last allocation is freed.
> We use vm_map_ram() interface to map passed in pages into vmalloc area.
> If vb_free() is called, the sub-region has been unmapped and user maybe
> have released the pages. user of vm_unmap_aliases() may be impacted if
> we don't flush those area freed with vb_free(). In reality, those areas
> have been unmapped, while there's still TLB existing. Not very sure
> about that.
>
> If we can hold the vmap_block flush until purging it w/o risk, it will
> save us many troubles.

The point is that the TLBs _must_ be flushed

  1) Before the virtual address space occupied by a vmap_area is
     released and can be reused.

     So one might argue that flushing TLBs early is good to detect use
     after free.

     The DEBUG_PAGE_ALLOC case flushes right after
     vunmap_range_noflush(), which is the proper thing to do for
     debugging as any UAF will be immediately detected after the
     function returns.

     The production case flushes lazy and fully utilized blocks are not
     on the free list and therefore not flushed until they are
     purged.

     The partial flush via _vm_unmap_aliases() solves a completely
     different problem. See #2/#3 below

  2) Before a page, which might have been reused before the lazy flush
     happened, can be used again. Also when the mapping attributes of a
     page are changed via set_memory_*()

     That's what vm_unmap_aliases() -> _vm_unmap_aliases() is for. 

  3) When an executable mapping is torn down

     That's the vfree() + VM_FLUSH_RESET_PERMS case which also ends up
     in _vm_unmap_aliases() 

I completely understand that there needs to be a balance between the
high utilization use case and the occasional one which made us look into
this. That's fine, but there needs to be a way to make it least overhead
for both.

/me tries to find some spare cycles ...

Thanks,

        tglx






More information about the linux-arm-kernel mailing list