[RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU

Wed Sep 10 05:42:08 PDT 2025

On 10/09/2025 11:57, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts at arm.com> writes:
> 
>> Hi All,
>>
>> This is an RFC for my implementation of an idea from James Morse to avoid
>> broadcasting TBLIs to remote CPUs if it can be proven that no remote CPU could
>> have ever observed the pgtable entry for the TLB entry that is being
>> invalidated. It turns out that x86 does something similar in principle.
>>
>> The primary feedback I'm looking for is; is this actually correct and safe?
>> James and I both believe it to be, but it would be useful to get further
>> validation.
>>
>> Beyond that, the next question is; does it actually improve performance?
>> stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we
>> do a much better job of sustaining the overall number of "tlb shootdowns per
>> second" after the change:
>>
>> +------------+--------------------------+--------------------------+--------------------------+
>> |            |     Baseline (v6.15)     |        tlbi local        |        Improvement       |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>> | nr_threads |     ops/sec |    ops/sec |     ops/sec |    ops/sec |     ops/sec |    ops/sec |
>> |            | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>> |          1 |        9109 |       2573 |        8903 |       3653 |         -2% |        42% |
>> |          4 |        8115 |       1299 |        9892 |       1059 |         22% |       -18% |
>> |          8 |        5119 |        477 |       11854 |       1265 |        132% |       165% |
>> |         16 |        4796 |        286 |       14176 |        821 |        196% |       187% |
>> |         32 |        1593 |         38 |       15328 |        474 |        862% |      1147% |
>> |         64 |        1486 |         19 |        8096 |        131 |        445% |       589% |
>> |        128 |        1315 |         16 |        8257 |        145 |        528% |       806% |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>>
>> But looking at real-world benchmarks, I haven't yet found anything where it
>> makes a huge difference; When compiling the kernel, it reduces kernel time by
>> ~2.2%, but overall wall time remains the same. I'd be interested in any
>> suggestions for workloads where this might prove valuable.
>>
>> All mm selftests have been run and no regressions are observed. Applies on
>> v6.17-rc3.
> 
> I have used redis (a single threaded in-memory database) to test the
> patchset on an ARM server.  32 redis-server processes are run on the
> NUMA node 1 to enlarge the overhead of TLBI broadcast.  32
> memtier-benchmark processes are run on the NUMA node 0 accordingly.
> Snapshot is triggered constantly in redis-server, which fork(), saves
> memory database to disk, exit(), so that COW in the redis-server will
> trigger a large amount of TLBI.  Basically, this tests the performance
> of redis-server during snapshot.  The test time is about 300s.  Test
> results show that the benchmark score can improve ~4.5% with the
> patchset.
> 
> Feel free to add my
> 
> Tested-by: Huang Ying <ying.huang at linux.alibaba.com>
> 
> in the future versions.

Thanks for this - very useful!

> 
> ---
> Best Regards,
> Huang, Ying