[mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations

Wed Apr 16 08:08:33 PDT 2025

Hi Catalin,

At 2025-04-16 20:47:46, "Catalin Marinas" <catalin.marinas at arm.com> wrote:
>On Tue, Apr 15, 2025 at 04:22:04PM +0800, Xavier wrote:
>> Patch V3 has changed the while loop to a for loop according to the suggestions
>> of Dev.
>
>For some reason, my email (office365) rejected all these patches (not
>even quarantined), I only got the replies. Anyway, I can get them from
>the lore archive.
>
>> Meanwhile, to improve efficiency, the definition of local variables has
>> been removed. This macro is only used within the current function and there
>> will be no additional risks. In order to verify the optimization performance of
>> Patch V3, a test function has been designed. By repeatedly calling mlock in a
>> loop, the kernel is made to call contpte_ptep_get extensively to test the
>> optimization effect of this function.
>> The function's execution time and instruction statistics have been traced using
>> perf, and the following are the operation results on a certain Qualcomm mobile
>> phone chip:
>> 
>> Instruction Statistics - Before Optimization
>> #          count  event_name              # count / runtime
>>       20,814,352  branch-load-misses      # 662.244 K/sec
>>   41,894,986,323  branch-loads            # 1.333 G/sec
>>        1,957,415  iTLB-load-misses        # 62.278 K/sec
>>   49,872,282,100  iTLB-loads              # 1.587 G/sec
>>      302,808,096  L1-icache-load-misses   # 9.634 M/sec
>>   49,872,282,100  L1-icache-loads         # 1.587 G/sec
>> 
>> Total test time: 31.485237 seconds.
>> 
>> Instruction Statistics - After Optimization
>> #          count  event_name              # count / runtime
>>       19,340,524  branch-load-misses      # 688.753 K/sec
>>   38,510,185,183  branch-loads            # 1.371 G/sec
>>        1,812,716  iTLB-load-misses        # 64.554 K/sec
>>   47,673,923,151  iTLB-loads              # 1.698 G/sec
>>      675,853,661  L1-icache-load-misses   # 24.068 M/sec
>>   47,673,923,151  L1-icache-loads         # 1.698 G/sec
>> 
>> Total test time: 28.108048 seconds.
>
>We'd need to reproduce these numbers on other platforms as well and with
>different page sizes. I hope Ryan can do some tests next week.

Of course, it would be even better if any reviewers could verify it with some
real-world scenarios.

>
>Purely looking at the patch, I don't like the complexity. I'd rather go
>with your v1 if the numbers are fairly similar (even if slightly slower).

The implementation of this macro is indeed a bit complicated. If it
doesn't affect the performance, I will try to make it simpler.

>
>However, I don't trust microbenchmarks like calling mlock() in a loop.
>It was hand-crafted to dirty the whole buffer (making ptes young+dirty)
>before mlock() to make the best out of the rewritten contpte_ptep_get().
>Are there any real world workloads that would benefit from such change?
>
>As it stands, I think this patch needs better justification.
>

Indeed, I used mlock() because it is an example that can simply achieve
a large number of calls to contpte_ptep_get() in the user space. Of course,
there are many other scenarios where this function will be called, such as
madvise and mprotect, and so on. But essentially, there is no difference,
and all of them are aimed at verifying the performance improvement
brought by the contpte_ptep_get() function. It's true that the previous
test cases only tested the best-case scenario by dirtying the data.
Next, I will continue to test the impact of the new patch on performance
in scenarios where there are no "young" and "dirty" flags, or where there
is only a "young" flag.

--
Thanks,
Xavier