[RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU

Tue Sep 2 09:56:29 PDT 2025

On 02/09/2025 17:47, Catalin Marinas wrote:
> On Fri, Aug 29, 2025 at 04:35:06PM +0100, Ryan Roberts wrote:
>> Beyond that, the next question is; does it actually improve performance?
>> stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we
>> do a much better job of sustaining the overall number of "tlb shootdowns per
>> second" after the change:
>>
>> +------------+--------------------------+--------------------------+--------------------------+
>> |            |     Baseline (v6.15)     |        tlbi local        |        Improvement       |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>> | nr_threads |     ops/sec |    ops/sec |     ops/sec |    ops/sec |     ops/sec |    ops/sec |
>> |            | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>> |          1 |        9109 |       2573 |        8903 |       3653 |         -2% |        42% |
>> |          4 |        8115 |       1299 |        9892 |       1059 |         22% |       -18% |
>> |          8 |        5119 |        477 |       11854 |       1265 |        132% |       165% |
>> |         16 |        4796 |        286 |       14176 |        821 |        196% |       187% |
>> |         32 |        1593 |         38 |       15328 |        474 |        862% |      1147% |
>> |         64 |        1486 |         19 |        8096 |        131 |        445% |       589% |
>> |        128 |        1315 |         16 |        8257 |        145 |        528% |       806% |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>>
>> But looking at real-world benchmarks, I haven't yet found anything where it
>> makes a huge difference; When compiling the kernel, it reduces kernel time by
>> ~2.2%, but overall wall time remains the same. I'd be interested in any
>> suggestions for workloads where this might prove valuable.
> 
> I suspect it's highly dependent on hardware and how it handles the DVM
> messages. There were some old proposals from Fujitsu:
> 
> https://lore.kernel.org/linux-arm-kernel/20190617143255.10462-1-indou.takao@jp.fujitsu.com/
> 
> Christoph Lameter (Ampere) also followed with some refactoring in this
> area to allow a boot-configurable way to do TLBI via IS ops or IPI:
> 
> https://lore.kernel.org/linux-arm-kernel/20231207035703.158053467@gentwo.org/
> 
> (for some reason, the patches did not make it to the list, I have them
> in my inbox if you are interested)
> 
> I don't remember any real-world workload, more like hand-crafted
> mprotect() loops.
> 
> Anyway, I think the approach in your series doesn't have downsides, it's
> fairly clean and addresses some low-hanging fruits. For multi-threaded
> workloads where a flush_tlb_mm() is cheaper than a series of per-page
> TLBIs, I think we can wait for that hardware to be phased out. The TLBI
> range operations should significantly reduce the DVM messages between
> CPUs.

I'll gather some more numbers and try to make a case for merging it then. I
don't really want to add complexity if there is no clear value.

Thanks for the review.

Thanks,
Ryan