[PATCH] ARM: tlb: Prevent flushing insane large ranges one by one

Wed May 24 04:05:57 PDT 2023

On 2023-05-24 11:23, Russell King (Oracle) wrote:
> On Wed, May 24, 2023 at 11:18:12AM +0100, Robin Murphy wrote:
>> On 2023-05-24 10:32, Thomas Gleixner wrote:
>>> vmalloc uses lazy TLB flushes for unmapped ranges to avoid excessive TLB
>>> flushing on every unmap. The lazy flushing coalesces unmapped ranges and
>>> invokes flush_tlb_kernel_range() with the combined range.
>>>
>>> The coalescing can result in ranges which spawn the full vmalloc address
>>> range. In the case of flushing an executable mapping in the module address
>>> space this range is extended to also flush the direct map alias.
>>>
>>> flush_tlb_kernel_range() then walks insane large ranges, the worst case
>>> observed was ~1.5GB.
>>>
>>> The range is flushed page by page, which takes several milliseconds to
>>> complete in the worst case and obviously affects all processes in the
>>> system. In the worst case observed this causes the runtime of a realtime
>>> task on an isolated CPU to be almost doubled over the normal worst
>>> case, which makes it miss the deadline.
>>>
>>> Cure this by sanity checking the range against a threshold and fall back to
>>> tlb_flush_all() when the range is too large.
>>>
>>> The default threshold is 32 pages, but for CPUs with CP15 this is evaluated
>>> at boot time via read_cpuid(CPUID_TLBTYPE) and set to the half of the TLB
>>> size.
>>>
>>> The vmalloc range coalescing could be improved to provide a list or
>>> array of ranges to flush, which allows to avoid overbroad flushing, but
>>> that's a major surgery and does not solve the problem of actual
>>> justified large range flushes which can happen due to the lazy flush
>>> mechanics in vmalloc. The lazy flush results in batching which is biased
>>> towards large range flushes by design.
>>>
>>> Fixes: db64fe02258f ("mm: rewrite vmap layer")
>>> Reported-by: John Ogness <john.ogness at linutronix.de>
>>> Debugged-by: John Ogness <john.ogness at linutronix.de>
>>> Signed-off-by: Thomas Gleixner <tglx at linutronix.de>
>>> Tested-by: John Ogness <john.ogness at linutronix.de>
>>> Link: https://lore.kernel.org/all/87a5y5a6kj.ffs@tglx
>>> ---
>>>    arch/arm/include/asm/cputype.h  |    5 +++++
>>>    arch/arm/include/asm/tlbflush.h |    2 ++
>>>    arch/arm/kernel/setup.c         |   10 ++++++++++
>>>    arch/arm/kernel/smp_tlb.c       |    4 ++++
>>>    4 files changed, 21 insertions(+)
>>>
>>> --- a/arch/arm/include/asm/cputype.h
>>> +++ b/arch/arm/include/asm/cputype.h
>>> @@ -196,6 +196,11 @@ static inline unsigned int __attribute_c
>>>    	return read_cpuid(CPUID_MPUIR);
>>>    }
>>> +static inline unsigned int __attribute_const__ read_cpuid_tlbsize(void)
>>> +{
>>> +	return 64 << ((read_cpuid(CPUID_TLBTYPE) >> 1) & 0x03);
>>> +}
>>
>> This appears to be specific to Cortex-A9 - these bits are
>> implementation-defined, and it looks like on on most other Arm Ltd. CPUs
>> they have no meaning at all, e.g.[1][2][3], but they could still hold some
>> wildly unrelated value on other implementations.
> 
> That sucks. I guess we'll need to decode the main CPU ID register and
> have a table, except for Cortex-A9 where we can read the TLB size.

Yes, it seems like Cortex-A9 is the odd one out for having 
configurability here, otherwise the sizes seem to range from 32 entries 
on Cortex-A8 to 1024 entries for Cortex-A17's main TLB, so having just 
one single default value would seem less than optimal.

Thanks,
Robin.

> If that's not going to work either, then the MM layer needs to get
> fixed not to be so utterly stupid to request a TLB flush over an
> insanely large range - or people will just have to put up with
> latency sucking on 32-bit ARM platforms.
>