[mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations

Tue Apr 15 19:10:27 PDT 2025

Please try to avoid presentation of a [0/N] cover letter when N==1!  A
simple singleton patch is better.

On Tue, 15 Apr 2025 16:22:04 +0800 Xavier <xavier_qy at 163.com> wrote:

> Patch V3 has changed the while loop to a for loop according to the suggestions
> of Dev. Meanwhile, to improve efficiency, the definition of local variables has
> been removed. This macro is only used within the current function and there

which function?

> will be no additional risks. In order to verify the optimization performance of
> Patch V3, a test function has been designed. By repeatedly calling mlock in a
> loop, the kernel is made to call contpte_ptep_get extensively to test the
> optimization effect of this function.
> The function's execution time and instruction statistics have been traced using
> perf, and the following are the operation results on a certain Qualcomm mobile
> phone chip:

All the words thus far appear to be discussing changes since v2.  For
the permanent kernel record, this isn't interesting or useful material.
So please present a standalone description which doesn't refer to
previous iterations.

It's great to present this what-i-changed-since-last-time material, but
that is better placed after the "^---$" separator, after the
Signed-off-by:, Reviewed-by: etc tags.

>
> ...
>

Below is what I came up with for a changelog.  Please check?

Optimize contpte_ptep_get() by adding early termination logic.  Check if
the dirty and young bits of orig_pte are already set and skip redundant
bit-setting operations during the loop.  This reduces unnecessary
iterations and improves performance.

The function's execution time and instruction statistics have been traced
using perf, and the following are the operation results on a certain
Qualcomm mobile phone chip:

Instruction Statistics - Before Optimization
#          count  event_name              # count / runtime
      20,814,352  branch-load-misses      # 662.244 K/sec
  41,894,986,323  branch-loads            # 1.333 G/sec
       1,957,415  iTLB-load-misses        # 62.278 K/sec
  49,872,282,100  iTLB-loads              # 1.587 G/sec
     302,808,096  L1-icache-load-misses   # 9.634 M/sec
  49,872,282,100  L1-icache-loads         # 1.587 G/sec

Total test time: 31.485237 seconds.

Instruction Statistics - After Optimization
#          count  event_name              # count / runtime
      19,340,524  branch-load-misses      # 688.753 K/sec
  38,510,185,183  branch-loads            # 1.371 G/sec
       1,812,716  iTLB-load-misses        # 64.554 K/sec
  47,673,923,151  iTLB-loads              # 1.698 G/sec
     675,853,661  L1-icache-load-misses   # 24.068 M/sec
  47,673,923,151  L1-icache-loads         # 1.698 G/sec

Total test time: 28.108048 seconds.

Function Statistics - Before Optimization
Arch: arm64
Event: cpu-cycles (type 0, config 0)
Samples: 1419716
Event count: 99618088900

Overhead   Symbol
21.42%     lock_release
21.26%     lock_acquire
20.88%     arch_counter_get_cntvct
14.32%     _raw_spin_unlock_irq
6.79%      contpte_ptep_get
2.20%      test_contpte_perf
1.82%      follow_page_pte
0.97%      lock_acquired
0.97%      rcu_is_watching
0.89%      mlock_pte_range
0.84%      sched_clock_noinstr
0.70%      handle_softirqs.llvm.8218488130471452153
0.58%      test_preempt_disable_long
0.57%      _raw_spin_unlock_irqrestore
0.54%      arch_stack_walk
0.51%      vm_normal_folio
0.48%      check_preemption_disabled
0.47%      stackinfo_get_task
0.36%      try_grab_folio
0.34%      preempt_count
0.32%      trace_preempt_on
0.29%      trace_preempt_off
0.24%      debug_smp_processor_id

Function Statistics - After Optimization
Arch: arm64
Event: cpu-cycles (type 0, config 0)
Samples: 1431006
Event count: 118856425042

Overhead   Symbol
22.59%     lock_release
22.13%     arch_counter_get_cntvct
22.08%     lock_acquire
15.32%     _raw_spin_unlock_irq
2.26%      test_contpte_perf
1.50%      follow_page_pte
1.49%      arch_stack_walk
1.30%      rcu_is_watching
1.09%      lock_acquired
1.07%      sched_clock_noinstr
0.88%      handle_softirqs.llvm.12507768597002095717
0.88%      trace_preempt_off
0.76%      _raw_spin_unlock_irqrestore
0.61%      check_preemption_disabled
0.52%      trace_preempt_on
0.50%      mlock_pte_range
0.43%      try_grab_folio
0.41%      folio_mark_accessed
0.40%      vm_normal_folio
0.38%      test_preempt_disable_long
0.28%      contpte_ptep_get
0.27%      __traceiter_android_rvh_preempt_disable
0.26%      debug_smp_processor_id
0.24%      return_address
0.20%      __pte_offset_map_lock
0.19%      unwind_next_frame_record

If there is no problem with my test program, it can be seen that there is a
significant performance improvement both in the overall number of instructions
and the execution time of contpte_ptep_get.

If any reviewers have time, you can also test it on your machines for comparison.
I have enabled THP and hugepages-64kB.

Test function:

#define PAGE_SIZE 4096
#define CONT_PTES 16
#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)

void rwdata(char *buf)
{
	for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
		buf[i] = 'a';
		volatile char c = buf[i];
	}
}
void test_contpte_perf()
{
	char *buf;
	int ret = posix_memalign((void **)&buf, PAGE_SIZE, TEST_SIZE);
	if (ret != 0) {
		perror("posix_memalign failed");
		exit(EXIT_FAILURE);
	}

	rwdata(buf);

	for (int j = 0; j < 500; j++) {
		mlock(buf, TEST_SIZE);

		rwdata(buf);

		munlock(buf, TEST_SIZE);
	}

	free(buf);
}