Excessive TLB flush ranges

Tue May 16 01:54:09 PDT 2023

On Tue, May 16 2023 at 16:07, Baoquan He wrote:
> On 05/16/23 at 08:40am, Thomas Gleixner wrote:
>> On Tue, May 16 2023 at 10:26, Baoquan He wrote:
>> > On 05/15/23 at 08:17pm, Uladzislau Rezki wrote:
>> >> For systems which lack a full TLB flush and to flush a long range is
>> >> a problem(it takes time), probably we can flush VA one by one. Because
>> >> currently we calculate a flush range [min:max] and that range includes
>> >> the space that might not be mapped at all. Like below:
>> >
>> > It's fine if we only calculate a flush range of [min:max] with VA. In
>> > vm_reset_perms(), it calculates the flush range with the impacted direct
>> > mapping range, then merge it with VA's range. That looks really strange
>> > and surprising. If the vm->pages[] are got from a lower part of physical
>> > memory, the final merged flush will span tremendous range. Wondering why
>> > we need merge the direct map range with VA range, then do flush. Not
>> > sure if I misunderstand it.
>> 
>> So what happens on this BPF teardown is:
>> 
>> The vfree(8k) ends up flushing 3 entries. The actual vmalloc part (2) and
>> one extra which is in the direct map. I haven't verified that yet, but I
>> assume it's the alias of one of the vmalloc'ed pages.
>
> It looks like the reason. As Uladzislau pointed out, ARCH-es may
> have full TLB flush, so won't get trouble from the merged flush
> in the calculated [min:max] way, e.g arm64 and x86's flush_tlb_kernel_range().
> However, arm32 seems lacking the ability of full TLB flash.

ARM has a full flush, but it does not check for that in
flush_tlb_kernel_range().

> If agreed, I can make a draft patch to do the flush for direct map and
> VA seperately, see if it works.

Of course it works. Already done that.

But you are missing the point. Look at the examples I provided.

The current implementation ends up doing a full flush on x86 just to
flush 3 TLB entries. For the very same reason because the flush range
(start..end) becomes insanely large due to the direct map and vmalloc
parts.

But doing indivudual flushes for direct map and vmalloc space is silly
too because then it ends up doing two IPIs instead of one. IPIs are
expensive and the whole point of coalescing the flushes is to spare
IPIs, no?

So with my hacked up flush_tlb_kernel_vas() I end up having exactly
_one_ IPI which walks the list and flushes the 3 TLB entries.

Thanks,

        tglx