Excessive TLB flush ranges

Tue May 16 08:01:55 PDT 2023

On Tue, May 16, 2023 at 04:38:58PM +0200, Thomas Gleixner wrote:
> On Tue, May 16 2023 at 15:42, Uladzislau Rezki wrote:
> >> _vm_unmap_aliases() collects dirty ranges from per cpu vmap_block_queue
> >> (what ever that is) and hands a start..end range to
> >> __purge_vmap_area_lazy().
> >> 
> >> As I pointed out already, this can also end up being an excessive range
> >> because there is no guarantee that those individual collected ranges are
> >> consecutive. Though I have no idea how to cure that right now.
> >> 
> >> AFAICT this was done to spare flush IPIs, but the mm folks should be
> >> able to explain that properly.
> >> 
> > This is done to prevent generating IPIs. That is why the whole range is
> > calculated once and a flush occurs only once for all lazily registered VAs.
> 
> Sure, but you pretty much enforced flush_tlb_all() by doing that, which
> is not even close to correct.
> 
> This range calculation is only correct when the resulting coalesced
> range is consecutive, but if the resulting coalesced range is huge with
> large holes and only a few pages to flush, then it's actively wrong.
> 
> The architecture has zero chance to decide whether it wants to flush
> single entries or all in one go.
> 
Id depends what is a corner case what is not. Usually all allocations are
done sequentially. From the other hand it is not always true. A good
example is a module loading/unloading(it has a special place in vmap space).
In this scenario we are quite far in vmap space from for example VMALLOC_START
point. So it will require a flush_tlb_all, yes.

>
> There is a world outside of x86, but even on x86 it's borderline silly
> to take the whole TLB out when you can flush 3 TLB entries one by one
> with exactly the same number of IPIs, i.e. _one_. No?
> 
I meant if we invoke flush_tlb_kernel_range() on each VA's individual
range:

<ARM>
void flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
	if (tlb_ops_need_broadcast()) {
		struct tlb_args ta;
		ta.ta_start = start;
		ta.ta_end = end;
		on_each_cpu(ipi_flush_tlb_kernel_range, &ta, 1);
	} else
		local_flush_tlb_kernel_range(start, end);
	broadcast_tlb_a15_erratum();
}
<ARM>

we should IPI and wait, no?

--
Uladzislau Rezki