[mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations

Wed Apr 16 08:22:17 PDT 2025

Hi Ryan,

At 2025-04-16 20:48:57, "Ryan Roberts" <ryan.roberts at arm.com> wrote:
>On 15/04/2025 09:22, Xavier wrote:
>> Patch V3 has changed the while loop to a for loop according to the suggestions
>> of Dev. Meanwhile, to improve efficiency, the definition of local variables has
>> been removed. This macro is only used within the current function and there
>> will be no additional risks. In order to verify the optimization performance of
>> Patch V3, a test function has been designed. By repeatedly calling mlock in a
>> loop, the kernel is made to call contpte_ptep_get extensively to test the
>> optimization effect of this function.
>> The function's execution time and instruction statistics have been traced using
>> perf, and the following are the operation results on a certain Qualcomm mobile
>> phone chip:
>
>Xavier, for some reason your emails aren't hitting my inbox - I'm only seeing
>the replies from others. I'll monitor lore but appologies if I'm slow to respond
>- that's the reason.
>
>Please start the first line of the commit with "arm64/mm" instead of "mm/contpte".
>
>Also I noticed that Andrew put this into mm-new last night. I'd prefer that this
>go via the arm64 tree, if we decide we want it.

OK, I will change it to "arm64/mm" in the subsequent version.

>> 
>> Instruction Statistics - Before Optimization
>> #          count  event_name              # count / runtime
>>       20,814,352  branch-load-misses      # 662.244 K/sec
>>   41,894,986,323  branch-loads            # 1.333 G/sec
>>        1,957,415  iTLB-load-misses        # 62.278 K/sec
>>   49,872,282,100  iTLB-loads              # 1.587 G/sec
>>      302,808,096  L1-icache-load-misses   # 9.634 M/sec
>>   49,872,282,100  L1-icache-loads         # 1.587 G/sec
>> 
>> Total test time: 31.485237 seconds.
>> 
>> Instruction Statistics - After Optimization
>> #          count  event_name              # count / runtime
>>       19,340,524  branch-load-misses      # 688.753 K/sec
>>   38,510,185,183  branch-loads            # 1.371 G/sec
>>        1,812,716  iTLB-load-misses        # 64.554 K/sec
>>   47,673,923,151  iTLB-loads              # 1.698 G/sec
>>      675,853,661  L1-icache-load-misses   # 24.068 M/sec
>>   47,673,923,151  L1-icache-loads         # 1.698 G/sec
>> 
>> Total test time: 28.108048 seconds.
>> 
>> Function Statistics - Before Optimization
>> Arch: arm64
>> Event: cpu-cycles (type 0, config 0)
>> Samples: 1419716
>> Event count: 99618088900
>> 
>> Overhead   Symbol
>> 21.42%     lock_release
>> 21.26%     lock_acquire
>> 20.88%     arch_counter_get_cntvct
>> 14.32%     _raw_spin_unlock_irq
>> 6.79%      contpte_ptep_get
>> 2.20%      test_contpte_perf
>> 1.82%      follow_page_pte
>> 0.97%      lock_acquired
>> 0.97%      rcu_is_watching
>> 0.89%      mlock_pte_range
>> 0.84%      sched_clock_noinstr
>> 0.70%      handle_softirqs.llvm.8218488130471452153
>> 0.58%      test_preempt_disable_long
>> 0.57%      _raw_spin_unlock_irqrestore
>> 0.54%      arch_stack_walk
>> 0.51%      vm_normal_folio
>> 0.48%      check_preemption_disabled
>> 0.47%      stackinfo_get_task
>> 0.36%      try_grab_folio
>> 0.34%      preempt_count
>> 0.32%      trace_preempt_on
>> 0.29%      trace_preempt_off
>> 0.24%      debug_smp_processor_id
>> 
>> Function Statistics - After Optimization
>> Arch: arm64
>> Event: cpu-cycles (type 0, config 0)
>> Samples: 1431006
>> Event count: 118856425042
>> 
>> Overhead   Symbol
>> 22.59%     lock_release
>> 22.13%     arch_counter_get_cntvct
>> 22.08%     lock_acquire
>> 15.32%     _raw_spin_unlock_irq
>> 2.26%      test_contpte_perf
>> 1.50%      follow_page_pte
>> 1.49%      arch_stack_walk
>> 1.30%      rcu_is_watching
>> 1.09%      lock_acquired
>> 1.07%      sched_clock_noinstr
>> 0.88%      handle_softirqs.llvm.12507768597002095717
>> 0.88%      trace_preempt_off
>> 0.76%      _raw_spin_unlock_irqrestore
>> 0.61%      check_preemption_disabled
>> 0.52%      trace_preempt_on
>> 0.50%      mlock_pte_range
>> 0.43%      try_grab_folio
>> 0.41%      folio_mark_accessed
>> 0.40%      vm_normal_folio
>> 0.38%      test_preempt_disable_long
>> 0.28%      contpte_ptep_get
>> 0.27%      __traceiter_android_rvh_preempt_disable
>> 0.26%      debug_smp_processor_id
>> 0.24%      return_address
>> 0.20%      __pte_offset_map_lock
>> 0.19%      unwind_next_frame_record
>> 
>> If there is no problem with my test program, it can be seen that there is a
>> significant performance improvement both in the overall number of instructions
>> and the execution time of contpte_ptep_get.
>> 
>> If any reviewers have time, you can also test it on your machines for comparison.
>> I have enabled THP and hugepages-64kB.
>> 
>> Test Function:
>> ---
>> #define PAGE_SIZE 4096
>> #define CONT_PTES 16
>> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>> 
>> void rwdata(char *buf)
>> {
>> 	for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>> 		buf[i] = 'a';
>> 		volatile char c = buf[i];
>> 	}
>> }
>> void test_contpte_perf()
>> {
>> 	char *buf;
>> 	int ret = posix_memalign((void **)&buf, PAGE_SIZE, TEST_SIZE);
>> 	if (ret != 0) {
>> 		perror("posix_memalign failed");
>> 		exit(EXIT_FAILURE);
>> 	}
>> 
>> 	rwdata(buf);
>> 
>> 	for (int j = 0; j < 500; j++) {
>> 		mlock(buf, TEST_SIZE);
>> 
>> 		rwdata(buf);
>> 
>> 		munlock(buf, TEST_SIZE);
>
>This is a microbenchmark in a pathological case and it's showing ~11%
>improvement. But in principle I'm ok with it. I have some comments on the actual
>change though, which I'll send through against email.

OK. Please refer to my reply to that email.

Thanks,
Xavier