[mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations
Xavier
xavier_qy at 163.com
Wed Apr 16 08:22:17 PDT 2025
Hi Ryan,
At 2025-04-16 20:48:57, "Ryan Roberts" <ryan.roberts at arm.com> wrote:
>On 15/04/2025 09:22, Xavier wrote:
>> Patch V3 has changed the while loop to a for loop according to the suggestions
>> of Dev. Meanwhile, to improve efficiency, the definition of local variables has
>> been removed. This macro is only used within the current function and there
>> will be no additional risks. In order to verify the optimization performance of
>> Patch V3, a test function has been designed. By repeatedly calling mlock in a
>> loop, the kernel is made to call contpte_ptep_get extensively to test the
>> optimization effect of this function.
>> The function's execution time and instruction statistics have been traced using
>> perf, and the following are the operation results on a certain Qualcomm mobile
>> phone chip:
>
>Xavier, for some reason your emails aren't hitting my inbox - I'm only seeing
>the replies from others. I'll monitor lore but appologies if I'm slow to respond
>- that's the reason.
>
>Please start the first line of the commit with "arm64/mm" instead of "mm/contpte".
>
>Also I noticed that Andrew put this into mm-new last night. I'd prefer that this
>go via the arm64 tree, if we decide we want it.
OK, I will change it to "arm64/mm" in the subsequent version.
>>
>> Instruction Statistics - Before Optimization
>> # count event_name # count / runtime
>> 20,814,352 branch-load-misses # 662.244 K/sec
>> 41,894,986,323 branch-loads # 1.333 G/sec
>> 1,957,415 iTLB-load-misses # 62.278 K/sec
>> 49,872,282,100 iTLB-loads # 1.587 G/sec
>> 302,808,096 L1-icache-load-misses # 9.634 M/sec
>> 49,872,282,100 L1-icache-loads # 1.587 G/sec
>>
>> Total test time: 31.485237 seconds.
>>
>> Instruction Statistics - After Optimization
>> # count event_name # count / runtime
>> 19,340,524 branch-load-misses # 688.753 K/sec
>> 38,510,185,183 branch-loads # 1.371 G/sec
>> 1,812,716 iTLB-load-misses # 64.554 K/sec
>> 47,673,923,151 iTLB-loads # 1.698 G/sec
>> 675,853,661 L1-icache-load-misses # 24.068 M/sec
>> 47,673,923,151 L1-icache-loads # 1.698 G/sec
>>
>> Total test time: 28.108048 seconds.
>>
>> Function Statistics - Before Optimization
>> Arch: arm64
>> Event: cpu-cycles (type 0, config 0)
>> Samples: 1419716
>> Event count: 99618088900
>>
>> Overhead Symbol
>> 21.42% lock_release
>> 21.26% lock_acquire
>> 20.88% arch_counter_get_cntvct
>> 14.32% _raw_spin_unlock_irq
>> 6.79% contpte_ptep_get
>> 2.20% test_contpte_perf
>> 1.82% follow_page_pte
>> 0.97% lock_acquired
>> 0.97% rcu_is_watching
>> 0.89% mlock_pte_range
>> 0.84% sched_clock_noinstr
>> 0.70% handle_softirqs.llvm.8218488130471452153
>> 0.58% test_preempt_disable_long
>> 0.57% _raw_spin_unlock_irqrestore
>> 0.54% arch_stack_walk
>> 0.51% vm_normal_folio
>> 0.48% check_preemption_disabled
>> 0.47% stackinfo_get_task
>> 0.36% try_grab_folio
>> 0.34% preempt_count
>> 0.32% trace_preempt_on
>> 0.29% trace_preempt_off
>> 0.24% debug_smp_processor_id
>>
>> Function Statistics - After Optimization
>> Arch: arm64
>> Event: cpu-cycles (type 0, config 0)
>> Samples: 1431006
>> Event count: 118856425042
>>
>> Overhead Symbol
>> 22.59% lock_release
>> 22.13% arch_counter_get_cntvct
>> 22.08% lock_acquire
>> 15.32% _raw_spin_unlock_irq
>> 2.26% test_contpte_perf
>> 1.50% follow_page_pte
>> 1.49% arch_stack_walk
>> 1.30% rcu_is_watching
>> 1.09% lock_acquired
>> 1.07% sched_clock_noinstr
>> 0.88% handle_softirqs.llvm.12507768597002095717
>> 0.88% trace_preempt_off
>> 0.76% _raw_spin_unlock_irqrestore
>> 0.61% check_preemption_disabled
>> 0.52% trace_preempt_on
>> 0.50% mlock_pte_range
>> 0.43% try_grab_folio
>> 0.41% folio_mark_accessed
>> 0.40% vm_normal_folio
>> 0.38% test_preempt_disable_long
>> 0.28% contpte_ptep_get
>> 0.27% __traceiter_android_rvh_preempt_disable
>> 0.26% debug_smp_processor_id
>> 0.24% return_address
>> 0.20% __pte_offset_map_lock
>> 0.19% unwind_next_frame_record
>>
>> If there is no problem with my test program, it can be seen that there is a
>> significant performance improvement both in the overall number of instructions
>> and the execution time of contpte_ptep_get.
>>
>> If any reviewers have time, you can also test it on your machines for comparison.
>> I have enabled THP and hugepages-64kB.
>>
>> Test Function:
>> ---
>> #define PAGE_SIZE 4096
>> #define CONT_PTES 16
>> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>>
>> void rwdata(char *buf)
>> {
>> for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>> buf[i] = 'a';
>> volatile char c = buf[i];
>> }
>> }
>> void test_contpte_perf()
>> {
>> char *buf;
>> int ret = posix_memalign((void **)&buf, PAGE_SIZE, TEST_SIZE);
>> if (ret != 0) {
>> perror("posix_memalign failed");
>> exit(EXIT_FAILURE);
>> }
>>
>> rwdata(buf);
>>
>> for (int j = 0; j < 500; j++) {
>> mlock(buf, TEST_SIZE);
>>
>> rwdata(buf);
>>
>> munlock(buf, TEST_SIZE);
>
>This is a microbenchmark in a pathological case and it's showing ~11%
>improvement. But in principle I'm ok with it. I have some comments on the actual
>change though, which I'll send through against email.
OK. Please refer to my reply to that email.
Thanks,
Xavier
More information about the linux-arm-kernel
mailing list