Excessive TLB flush ranges

Tue May 16 17:23:07 PDT 2023

On Tue, May 16 2023 at 21:32, Thomas Gleixner wrote:
> On Tue, May 16 2023 at 10:56, Nadav Amit wrote:
>>> On May 16, 2023, at 7:38 AM, Thomas Gleixner <tglx at linutronix.de> wrote:
>>> 
>>> There is a world outside of x86, but even on x86 it's borderline silly
>>> to take the whole TLB out when you can flush 3 TLB entries one by one
>>> with exactly the same number of IPIs, i.e. _one_. No?
>>
>> I just want to re-raise points that were made in the past, including in
>> the discussion that I sent before and match my experience.
>>
>> Feel free to reject them, but I think you should not ignore them.
>
> I'm not ignoring them and I'm well aware of these issues. No need to
> repeat them over and over. I'm old but not senile yet.

Just to be clear. This works the other way round too.

It makes a whole lot of a difference whether you do 5 IPIs in a row
which all need to get a cache line updated or if you have _one_ which
needs a couple of cache lines updated.

INVLPG is not serializing so the CPU can pull in the next required cache
line(s) on the VA list during that. These cache lines are _not_
contended at that point because _all_ of these data structures are not
longer globally accessible (mis-speculation aside) and therefore not
exclusive (misalignment aside, but you have to prove that this is an
issue).

So just dismissing this on 10 years old experience is not really
helpful, though I'm happy to confirm your points once I had the time and
opportunity to actually run real testing over it, unless you beat me to
it.

What I can confirm is that it solves a real world problem on !x86
machines for the pathological case at hand

   On the affected contemporary ARM32 machine, which does not require
   IPIs, the selective flush is way better than:

   - the silly 1.G range one page by one flush (which is silly on its
     own as there is no range check)

   - a full tlb flush just for 3 pages, which is the same on x86 albeit
     the flush range is ~64GB there.

The point is that the generic vmalloc code is making assumptions which
are x86 centric on not even necessarily true on x86.

Whether or not this is benefitial on x86 that's a completey separate
debate.

There is also a debate required whether a wholesale "flush on _ALL_
CPUs' is justified when some of those CPUs are completely isolated and
have absolutely no chance to be affected by that. This process bound
seccomp/BPF muck clearly does not justify to kick isolated CPUs out of
their computation in user space just because...

Thanks,

        tglx