Excessive TLB flush ranges

Wed May 17 15:41:51 PDT 2023

> On May 17, 2023, at 4:47 AM, Thomas Gleixner <tglx at linutronix.de> wrote:
> 
> On Wed, May 17 2023 at 12:31, Thomas Gleixner wrote:
>> On Tue, May 16 2023 at 18:23, Nadav Amit wrote:
>>>> INVLPG is not serializing so the CPU can pull in the next required cache
>>>> line(s) on the VA list during that.
>>> 
>>> Indeed, but ChatGPT says (yes, I see you making fun of me already):
>>> “however, this doesn't mean INVLPG has no impact on the pipeline. INVLPG
>>> can cause a pipeline stall because the TLB entry invalidation must be
>>> completed before subsequent instructions that might rely on the TLB can
>>> be executed correctly.”
>>> 
>>> So I am not sure that your claim is exactly correct.
>> 
>> Key is a subsequent instruction which might depend on the to be flushed
>> TLB entry. That's obvious, but I'm having a hard time to construct that
>> dependent intruction in this case.
> 
> But obviously a full TLB flush _is_ guaranteed to stall the pipeline,
> right?

Right. I had a discussion about it with ChatGPT but it started to say BS,
so here is my understanding of the matter.

IIUC, when you flush a TLB entry the CPU might have a problem figuring out
the translation of which memory addresses is affected. All the RAW conflict
detection mechanisms in such a case are probably useless, since even the
granularity of the invalidation (e.g., 4KB/2MB/1GB) might be unknown during
decoding. It is not that it is impossible to obtain this information, it is
just that I am doubtful the CPU architects optimized this flow.

As a result I think the pieces of code as the following ones are affected
(taken from from flush_tlb_func() ):

                while (addr < f->end) {
                        flush_tlb_one_user(addr);
                        addr += 1UL << f->stride_shift;
                }

flush_tlb_one_user has a memory clobber on the INVLPG. As a result f->end
and f->stride_shift would need to be reread from memory and their loading
would be stalled.

As they reside in L1 cache already, the impact is low. Having said that,
the fact that flush_tlb_one_user() flushes the “PTI” (userspace mappings)
as well does introduce small overhead relatively to the alternative of
having two separate loops to flush kernel/userspace mappings when PTI is
enabled.

Long story short, I think that prefetching the entries that you want to
flush - assuming they do not fit on a single cacheline - might be needed.
Linked list would therefore not be very friendly for something like that.