[RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly

Thomas Gleixner tglx at linutronix.de
Fri May 19 11:38:22 PDT 2023


On Fri, May 19 2023 at 20:03, Baoquan He wrote:
> After vb_free() invocation, the va will be purged and put into purge
> tree/list if the entire vmap_block is dirty. If not entirely dirty, the
> vmap_block is still in percpu vmap_block_queue list, just like below two
> graphs:
>
> (1)
>   |-----|------------|-----------|-------|
>   |dirty|still mapped|   dirty   | free  |
>
> 2)
>   |------------------------------|-------|
>   |         dirty                | free  |
>
> In the current _vm_unmap_aliases(), to reclaim those unmapped range and
> flush, it will iterate percpu vbq to calculate the range from vmap_block
> like above two cases. Then call purge_fragmented_blocks_allcpus()
> to purge the vmap_block in case 2 since no mapping exists right now,
> and put these purged vmap_block va into purge tree/list. Then in
> __purge_vmap_area_lazy(), it will continue calculating the flush range
> from purge list. Obviously, this will take vmap_block va in the 2nd case
> into account twice.

Which made me look deeper into purge_fragmented_blocks()

	list_for_each_entry_rcu(vb, &vbq->free, free_list) {

		if (!(vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS))
			continue;

                spin_lock(&vb->lock);
		if (vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS) {

That means if an allocation does not find something in the free list
then this can happen:

vaddr = vb_alloc(size)
     vaddr = new_vmap_block(order, gfp_mask);

vb_free(vaddr, size)
     vb->dirty = 1ULL << order;

purge_fragmented_blocks()
      purge(most_recently_allocated_block);

vaddr = vb_alloc(size)
     vaddr = new_vmap_block(order, gfp_mask);

How does that make sense?

That block would have hundreds of pages left and is the most recently
allocated. So the next vb_alloc() has to reallocate one instead of using
the one which was allocated just before.

This clearly lacks a free space check so that blocks which have more
free space than a certain threshold are not thrown away prematurely.
Maybe it wants an age check too, so that blocks which are unused for a
long time can be recycled, but that's an orthogonal issue.


That aside your patch does still not address what I pointed out to you
and what my patch cures:

		 pages     bits    dirtymin            dirtymax
   vb_alloc(A)   255     0 -  254  VMAP_BBMAP_BITS     0
   vb_alloc(B)   255   255 -  509  VMAP_BBMAP_BITS     0
   vb_alloc(C)   255   510 -  764  VMAP_BBMAP_BITS     0
   vb_alloc(D)   255   765 - 1020  VMAP_BBMAP_BITS     0

The block stays on the free list because there are still 4 pages left
and it stays there until either _all_ free space is used or _all_
allocated space is freed.

Now the first allocation gets freed:

   vb_free(A)    255     0 - 254  0                    254

>From there on _every_ invocation of __purge_vmap_area_lazy() will see
this range as long as the block is on the free list:

   list_for_each_entry_rcu(vb, &vbq->free, free_list) {
	spin_lock(&vb->lock);
	if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {

because this condition is true. So this flushes the same range over and
over, no?

This flush range gets larger over time the more allocations are freed up
to the point where the block vanishes from the free list.

By resetting vb->dirty_min/max the freed range is only flushed _once_,
no? The resulting flush range might still be excessively large as I
pointed out before:

1) Flush after freeing A

   vb_free(A)    2     0 - 1  0			  1
   Flush                      VMAP_BBMAP_BITS     0    <- correct
   vb_free(C)    2     6 - 7  6                   7
   Flush                      VMAP_BBMAP_BITS     0    <- correct

2) No flush between freeing A and C

   vb_free(A)    2     0 - 1  0			  1
   vb_free(C)    2     6 - 7  0                   7     
   Flush                      VMAP_BBMAP_BITS     0     <- overbroad flush

3) No flush between freeing A, C, B

   vb_free(A)    2     0 - 1  0			  1
   vb_free(B)    2     6 - 7  0                   7
   vb_free(C)    2     2 - 5  0                   7
   Flush                      VMAP_BBMAP_BITS     0    <- correct

Obviously case 2 could be

   vb_free(A)    2     0 - 1     0		     1
   vb_free(X)    2  1000 - 1001  1000                1001

   So that flush via purge_vmap_area_list() will ask to flush 1002 pages
   instead of 4, right? Again, that does not make sense.


The other issue I pointed out:

Assume the block has (for simplicity) 255 allocations size of 4 pages,
again free space of 4 pages.

254 allocations are freed, which means there is one remaining
mapped. All 254 freed are flushed via __purge_vmap_area_lazy() over
time.

Now the last allocation is freed and the block is moved to the
purge_vmap_area_list, which then does a full flush of the complete area,
i.e. 4MB in that case, while in fact it only needs to flush 2 pages.


Also these intermediate flushes are inconsistent versus how fully
utilized blocks are handled:

vb_alloc()
      if (vb->free == 0)
          list_del(vb->free_list);

So all allocations which are freed after that point stay unflushed until
the last allocation is freed which moves the block to the
purge_vmap_area_list, where it gets a full VA range flush.

IOW, for blocks on the free list this cares about unflushed mappings of
freed spaces, but for fully utilized blocks with freed spaces it does
obviously not matter, right?

So either we care about flushing the mappings of freed spaces or we do
not, but caring in one case and ignoring it in the other case is
inconsistent at best.

Thanks,

        tglx




More information about the linux-arm-kernel mailing list