Excessive TLB flush ranges
Uladzislau Rezki
urezki at gmail.com
Fri May 19 10:02:49 PDT 2023
On Fri, May 19, 2023 at 06:32:42PM +0200, Thomas Gleixner wrote:
> On Fri, May 19 2023 at 17:14, Uladzislau Rezki wrote:
> > On Fri, May 19, 2023 at 04:56:53PM +0200, Thomas Gleixner wrote:
> >> > + /* Flush per-VA. */
> >> > + list_for_each_entry(va, &local_purge_list, list)
> >> > + flush_tlb_kernel_range(va->va_start, va->va_end);
> >> >
> >> > - flush_tlb_kernel_range(start, end);
> >> > resched_threshold = lazy_max_pages() << 1;
> >>
> >> That's completely wrong, really.
> >>
> > Absolutely. That is why we do not flush a range per-VA ;-) I provided the
> > data just to show what happens if we do it!
>
> Seriously, you think you need to demonstrate that to me? Did you
> actually read what I wrote?
>
> "I understand why you want to batch and coalesce and rather do a rare
> full tlb flush than sending gazillions of IPIs."
>
Yes i read it. Since i also mentioned about IPI and did not provide any
data, i did it later, just in case. I shared my observation and that is it.
> > A per-VA flushing works when a system is not capable of doing a full
> > flush, so it has to do it page by page. In this scenario we should
> > bypass ranges(not mapped) which are between VAs in a purge-list.
>
> ARM32 has a full flush as does x86. Just ARM32 does not have a cutoff
> for a full flush in flush_tlb_kernel_range(). That's easily fixable, but
> the underlying problem remains.
>
> The point is that coalescing the VA ranges blindly is also fundamentally
> wrong:
>
>
> start1 = 0x95c8d000 end1 = 0x95c8e000
> start2 = 0xf08a1000 end2 = 0xf08a5000
>
> --> start = 0x95c8d000 end = 0xf08a5000
>
> So this ends up with:
>
> if (end - start > flush_all_threshold)
> ipi_flush_all();
> else
> ipi_flush_range();
>
> So with the above example this ends up with flush_all(), but a
> flush_vas() as I demonstrated with the list approach (ignore the storage
> problem which is fixable) this results in
>
> if (total_nr_pages > flush_all_threshold)
> ipi_flush_all();
> else
> ipi_flush_vas();
>
> and that ipi flushes 3 pages instead of taking out the whole TLB, which
> results in a 1% gain on that machine. Not massive, but still.
>
> The blind coalescing is also wrong if the resulting range is not giantic
> but below the flush_all_threshold. Lets assume a threshold of 32 pages.
>
> start1 = 0xf0800000 end1 = 0xf0802000 2 pages
> start2 = 0xf081e000 end2 = 0xf0820000 2 pages
>
> --> start = 0xf0800000 end = 0xf0820000
>
> So because this does not qualify for a full flush and it should not,
> this ends up flushing 32 pages one by one instead of flushing exactly
> four.
>
> IOW, the existing code is fully biased towards full flushes which is
> wrong.
>
> Just because this does not show up in your performance numbers on some
> enterprise workload does not make it more correct.
>
Usually we do a flush of lazy-areas once the lazy_max_pages() threshold
is reached. There are exceptions. When an allocation fails, we drain the
areas(if there are any), second is a per-cpu allocator and last one is
vm_reset_perms() when "vm" is marked as VM_FLUSH_RESET_PERMS.
As for your description i totally see the problem.
--
Uladzislau Rezki
More information about the linux-arm-kernel
mailing list