Excessive TLB flush ranges
Uladzislau Rezki
urezki at gmail.com
Mon May 15 11:17:17 PDT 2023
On Mon, May 15, 2023 at 06:43:40PM +0200, Thomas Gleixner wrote:
> Folks!
>
> We're observing massive latencies and slowdowns on ARM32 machines due to
> excessive TLB flush ranges.
>
> Those can be observed when tearing down a process, which has a seccomp
> BPF filter installed. ARM32 uses the vmalloc area for module space.
>
> bpf_prog_free_deferred()
> vfree()
> _vm_unmap_aliases()
> collect_per_cpu_vmap_blocks: start:0x95c8d000 end:0x95c8e000 size:0x1000
> __purge_vmap_area_lazy(start:0x95c8d000, end:0x95c8e000)
>
> va_start:0xf08a1000 va_end:0xf08a5000 size:0x00004000 gap:0x5ac13000 (371731 pages)
> va_start:0xf08a5000 va_end:0xf08a9000 size:0x00004000 gap:0x00000000 ( 0 pages)
> va_start:0xf08a9000 va_end:0xf08ad000 size:0x00004000 gap:0x00000000 ( 0 pages)
> va_start:0xf08ad000 va_end:0xf08b1000 size:0x00004000 gap:0x00000000 ( 0 pages)
> va_start:0xf08b3000 va_end:0xf08b7000 size:0x00004000 gap:0x00002000 ( 2 pages)
> va_start:0xf08b7000 va_end:0xf08bb000 size:0x00004000 gap:0x00000000 ( 0 pages)
> va_start:0xf08bb000 va_end:0xf08bf000 size:0x00004000 gap:0x00000000 ( 0 pages)
> va_start:0xf0a15000 va_end:0xf0a17000 size:0x00002000 gap:0x00156000 ( 342 pages)
>
> flush_tlb_kernel_range(start:0x95c8d000, end:0xf0a17000)
>
> Does 372106 flush operations where only 31 are useful
>
> So for all architectures which lack a mechanism to do a full TLB flush
> in flush_tlb_kernel_range() this takes ages (4-8ms) and slows down
> realtime processes on the other CPUs by a factor of two and larger.
>
> So while ARM32, CSKY, NIOS, PPC (some variants), _should_ arguably have
> a fallback to tlb_flush_all() when the range is too large, there is
> another issue. I've seen a couple of instances where _vm_unmap_aliases()
> collects one page and the actual va list has only 2 pages, which might
> be eventually worth to flush one by one.
>
> I'm not sure whether that's worth it as checking for those gaps might be
> too expensive for the case where a large number of va entries needs to
> be flushed.
>
> We'll experiment with a tlb_flush_all() fallback on that ARM32 system in
> the next days and see how that works out.
>
For systems which lack a full TLB flush and to flush a long range is
a problem(it takes time), probably we can flush VA one by one. Because
currently we calculate a flush range [min:max] and that range includes
the space that might not be mapped at all. Like below:
VA_1 VA_2
|....|-------------------------|............|
10 12 60 68
. mapped;
- not mapped.
so we flush from 10 until 68. Instead, probably we can do a flush of VA_1
range and VA_2 range. On modern systems with many CPUs, it could be a big
slow down.
Just some thoughts.
--
Uladzislau Rezki
More information about the linux-arm-kernel
mailing list