[mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations

Tue Apr 15 01:22:04 PDT 2025

Patch V3 has changed the while loop to a for loop according to the suggestions
of Dev. Meanwhile, to improve efficiency, the definition of local variables has
been removed. This macro is only used within the current function and there
will be no additional risks. In order to verify the optimization performance of
Patch V3, a test function has been designed. By repeatedly calling mlock in a
loop, the kernel is made to call contpte_ptep_get extensively to test the
optimization effect of this function.
The function's execution time and instruction statistics have been traced using
perf, and the following are the operation results on a certain Qualcomm mobile
phone chip:

Instruction Statistics - Before Optimization
#          count  event_name              # count / runtime
      20,814,352  branch-load-misses      # 662.244 K/sec
  41,894,986,323  branch-loads            # 1.333 G/sec
       1,957,415  iTLB-load-misses        # 62.278 K/sec
  49,872,282,100  iTLB-loads              # 1.587 G/sec
     302,808,096  L1-icache-load-misses   # 9.634 M/sec
  49,872,282,100  L1-icache-loads         # 1.587 G/sec

Total test time: 31.485237 seconds.

Instruction Statistics - After Optimization
#          count  event_name              # count / runtime
      19,340,524  branch-load-misses      # 688.753 K/sec
  38,510,185,183  branch-loads            # 1.371 G/sec
       1,812,716  iTLB-load-misses        # 64.554 K/sec
  47,673,923,151  iTLB-loads              # 1.698 G/sec
     675,853,661  L1-icache-load-misses   # 24.068 M/sec
  47,673,923,151  L1-icache-loads         # 1.698 G/sec

Total test time: 28.108048 seconds.

Function Statistics - Before Optimization
Arch: arm64
Event: cpu-cycles (type 0, config 0)
Samples: 1419716
Event count: 99618088900

Overhead   Symbol
21.42%     lock_release
21.26%     lock_acquire
20.88%     arch_counter_get_cntvct
14.32%     _raw_spin_unlock_irq
6.79%      contpte_ptep_get
2.20%      test_contpte_perf
1.82%      follow_page_pte
0.97%      lock_acquired
0.97%      rcu_is_watching
0.89%      mlock_pte_range
0.84%      sched_clock_noinstr
0.70%      handle_softirqs.llvm.8218488130471452153
0.58%      test_preempt_disable_long
0.57%      _raw_spin_unlock_irqrestore
0.54%      arch_stack_walk
0.51%      vm_normal_folio
0.48%      check_preemption_disabled
0.47%      stackinfo_get_task
0.36%      try_grab_folio
0.34%      preempt_count
0.32%      trace_preempt_on
0.29%      trace_preempt_off
0.24%      debug_smp_processor_id

Function Statistics - After Optimization
Arch: arm64
Event: cpu-cycles (type 0, config 0)
Samples: 1431006
Event count: 118856425042

Overhead   Symbol
22.59%     lock_release
22.13%     arch_counter_get_cntvct
22.08%     lock_acquire
15.32%     _raw_spin_unlock_irq
2.26%      test_contpte_perf
1.50%      follow_page_pte
1.49%      arch_stack_walk
1.30%      rcu_is_watching
1.09%      lock_acquired
1.07%      sched_clock_noinstr
0.88%      handle_softirqs.llvm.12507768597002095717
0.88%      trace_preempt_off
0.76%      _raw_spin_unlock_irqrestore
0.61%      check_preemption_disabled
0.52%      trace_preempt_on
0.50%      mlock_pte_range
0.43%      try_grab_folio
0.41%      folio_mark_accessed
0.40%      vm_normal_folio
0.38%      test_preempt_disable_long
0.28%      contpte_ptep_get
0.27%      __traceiter_android_rvh_preempt_disable
0.26%      debug_smp_processor_id
0.24%      return_address
0.20%      __pte_offset_map_lock
0.19%      unwind_next_frame_record

If there is no problem with my test program, it can be seen that there is a
significant performance improvement both in the overall number of instructions
and the execution time of contpte_ptep_get.

If any reviewers have time, you can also test it on your machines for comparison.
I have enabled THP and hugepages-64kB.

Test Function:
---
#define PAGE_SIZE 4096
#define CONT_PTES 16
#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)

void rwdata(char *buf)
{
	for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
		buf[i] = 'a';
		volatile char c = buf[i];
	}
}
void test_contpte_perf()
{
	char *buf;
	int ret = posix_memalign((void **)&buf, PAGE_SIZE, TEST_SIZE);
	if (ret != 0) {
		perror("posix_memalign failed");
		exit(EXIT_FAILURE);
	}

	rwdata(buf);

	for (int j = 0; j < 500; j++) {
		mlock(buf, TEST_SIZE);

		rwdata(buf);

		munlock(buf, TEST_SIZE);
	}

	free(buf);
}
---

Xavier (1):
  mm/contpte: Optimize loop to reduce redundant operations

 arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

-- 
2.34.1