[mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations
Catalin Marinas
catalin.marinas at arm.com
Wed Apr 16 05:47:46 PDT 2025
On Tue, Apr 15, 2025 at 04:22:04PM +0800, Xavier wrote:
> Patch V3 has changed the while loop to a for loop according to the suggestions
> of Dev.
For some reason, my email (office365) rejected all these patches (not
even quarantined), I only got the replies. Anyway, I can get them from
the lore archive.
> Meanwhile, to improve efficiency, the definition of local variables has
> been removed. This macro is only used within the current function and there
> will be no additional risks. In order to verify the optimization performance of
> Patch V3, a test function has been designed. By repeatedly calling mlock in a
> loop, the kernel is made to call contpte_ptep_get extensively to test the
> optimization effect of this function.
> The function's execution time and instruction statistics have been traced using
> perf, and the following are the operation results on a certain Qualcomm mobile
> phone chip:
>
> Instruction Statistics - Before Optimization
> # count event_name # count / runtime
> 20,814,352 branch-load-misses # 662.244 K/sec
> 41,894,986,323 branch-loads # 1.333 G/sec
> 1,957,415 iTLB-load-misses # 62.278 K/sec
> 49,872,282,100 iTLB-loads # 1.587 G/sec
> 302,808,096 L1-icache-load-misses # 9.634 M/sec
> 49,872,282,100 L1-icache-loads # 1.587 G/sec
>
> Total test time: 31.485237 seconds.
>
> Instruction Statistics - After Optimization
> # count event_name # count / runtime
> 19,340,524 branch-load-misses # 688.753 K/sec
> 38,510,185,183 branch-loads # 1.371 G/sec
> 1,812,716 iTLB-load-misses # 64.554 K/sec
> 47,673,923,151 iTLB-loads # 1.698 G/sec
> 675,853,661 L1-icache-load-misses # 24.068 M/sec
> 47,673,923,151 L1-icache-loads # 1.698 G/sec
>
> Total test time: 28.108048 seconds.
We'd need to reproduce these numbers on other platforms as well and with
different page sizes. I hope Ryan can do some tests next week.
Purely looking at the patch, I don't like the complexity. I'd rather go
with your v1 if the numbers are fairly similar (even if slightly slower).
However, I don't trust microbenchmarks like calling mlock() in a loop.
It was hand-crafted to dirty the whole buffer (making ptes young+dirty)
before mlock() to make the best out of the rewritten contpte_ptep_get().
Are there any real world workloads that would benefit from such change?
As it stands, I think this patch needs better justification.
Thanks.
--
Catalin
More information about the linux-arm-kernel
mailing list